Session: 07-03-02 Risk Management
Paper Number: 87258
87258 - Comparison of Machine Learning Models for Quantitative Risk Modelling of Pipeline Systems
Over the past decade, machine learning models have enabled significant technical achievements in a variety of fields, however the application of these models is far from established in conventional and regulated industries, which are often more cautious to adopt new technologies. While classical statistical models have a long history of providing data-driven predictions, machine learning methods offer several attractive benefits over these more traditional approaches. Data complexities such as non-linearity or multicollinearity between variables can be problematic for classical models, which may require iterative data manipulation with a time-consuming manual model-building process to handle appropriately. In contrast, modern machine learning algorithms such as gradient-boosted decision trees and neural networks are highly flexible in adapting to patterns within the data and can readily handle a high number of input data sources without a loss in performance. These qualities offer the opportunity to streamline the model building process while simultaneously maintaining or increasing predictive performance.
In more recent years, machine learning models have been applied to various pipeline industry prediction problems such as failure modeling, estimation of soil properties, identification of structures and land use from satellite imagery, and the classification of in-line inspection defects. In this paper, a case study is presented where various statistical and machine learning models including logistic regression, random forest, and gradient-boosted decision trees are trained and validated using a historical incident record dataset to quantify the probability of pipe failure on a distribution pipeline system. The relative performance of each model type is compared against a held-out test dataset using an evaluation framework that utilizes lift charts and calibration plots to quantify the model performance. Observed strengths and limitations of the different model types are discussed with respect to performance, interpretability, and ease of incorporating additional data, along with key considerations for fitting and evaluating models such as spatial and temporal cross-validation.
In general, more training data leads to better performing models, however the amount of historical data different operators have available may vary widely. To illustrate how the model performance depends on the quantity of training data, cases are presented where each model type is trained using only a portion of the overall dataset. In addition, the benefit of augmenting existing asset data with external datasets, such as those obtained from public geospatial datasets, is quantified by comparing the performance of models fit before and after this data is included. The results of this study will provide operators with additional insights and guidance in developing and evaluating machine-learning models for pipeline risk assessment and integrity management.
Presenting Author: Daryl Bandstra Integral Engineering
Presenting Author Biography: Daryl Bandstra is a professional engineer and a co-founder of Integral Engineering, an Edmonton-based pipeline engineering consulting firm. He works in the area of pipeline risk and integrity assessment, where he aids operators in applying probabilistic and predictive models to their pipeline systems. He has authored papers for various conferences including the International Pipeline Conference and the Rio Pipeline Conference and developed a portion of the IPC machine learning tutorial which was run for the first time in 2020. His current area of focus is pioneering the application of probabilistic models and machine learning methods to the pipeline industry.
Comparison of Machine Learning Models for Quantitative Risk Modelling of Pipeline Systems
Paper Type
Technical Paper Publication
