Full article: A big data-based ensemble for fault prediction in electrical secondary distribution network

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

The introduction of smart meters, sensors and integrated electronic devices in the electrical secondary distribution network (ESDN) has led to the collection of massive amounts of data. Accurate prediction of faults from this data can help to improve the reliability, safety and operational efficiency of ESDN. Due to its complexity, ESDN big data are hard to process and manage using traditional technologies and tools. The difficulties posed by dataset complexity arise from issues including high dimensionality, imbalance and variability, and one current challenge is to address them simultaneously. Currently, the capability of fault prediction techniques to address this challenge remains limited. New approaches are needed to overcome it. To this purpose, this article presents a big data-based ensemble for fault prediction in ESDN (BDEFP-ESDN) on Apache Spark with gradient-boosted trees, random forest, decision tree and binomial logistic regression base models. BDEFP-ESDN is optimized for the complexity of the ESDN dataset by dimension reduction, bootstrap sampling and hyperparameter optimization approaches in the training process and a weighted voting approach in the prediction process. Our experimental results illustrate the efficiency of BDEFP-ESDN against traditional classifiers like ANN, SVM, RF and XGB, achieving an accuracy of 99.6% for both binary and multiclass classification.

Keywords:

Reviewing Editor:

Qingsong Ai, Senior Editor, Wuhan University of Technology, CHINA

Subjects:

Introduction

Generally, the electrical secondary distribution network (ESDN) is an electrical power distribution system configuration designed to deliver power to final consumers (Chen et al., Citation2012). One of the dominant challenges utility companies face for the management and control functions in ESDN is the process of fault detection, identification and clearance. Some of the leading causes of this challenge include highly expanding distribution network due to increasing power demand and aging facilities (Amin, Citation2011). As such, accurately predicting faults before they occur in ESDN is one of the important measures in addressing this challenge.

Fault prediction analyzes power system historical data to discover abnormal operating condition patterns to forecast future failures so that appropriate measures are taken to optimize the lifespan of power system equipment and facilitate system recovery (Zhang et al., Citation2018). The advantages of fault prediction in power distribution systems are numerous such as reducing the maintenance time and cost, prolonged life of power system assets and increased safety and reliability. Therefore, there is an increasing interest amongst researchers and practitioners to effectively use fault prediction mechanisms in the management and control functions in ESDN.

Many machine learning classification algorithms such as support vector machine (SVM), artificial neural networks (ANNs) and random forests (RF) have been employed for fault prediction in power distribution systems as single models. Nevertheless, some studies have demonstrated that combining a collection of individual models in an ensemble algorithm can outperform single models (Dietterich, Citation2000; Sagi & Rokach, Citation2018; Zhou, Citation2021). While ensemble algorithms are accepted to yield the best results, the question of how to weigh each model, as well as the inherent complexity of ESDN datasets, create issues in ESDN fault prediction systems.

First, ESDN datasets may be high-dimensional. With high-dimensional datasets, the performance of some algorithms that perform well with low-dimension datasets tend to deteriorate with the increase of dimensions in the datasets (Thudumu et al., Citation2020). Second, ESDN datasets are typically imbalanced. This state occurs when one or more classes of the training datasets have a minority number of data points. This situation affects the algorithm’s accuracy because a data point from the minority class may be classified as a data point from the majority class (Weiss & Provost, Citation2001).

Another issue is caused by the introduction of smart meters, sensors, smart appliances and integrated electronic devices (IEDs) resulting in ESDN big data. This data are hard to process and manage within a tolerable operating time and hardware resources using traditional technologies and tools due to its high volume, high velocity, complexity and variability (Jacobs, Citation2009; Kezunovic et al., Citation2013). In addition, the base learners provide default hyperparameter values that are sub-optimal for all data values that can be used for predictions in general cases. In some instances such as with ESDN datasets, this may lead to reducing the prediction performance, increasing the complexity of the learning model and increasing the risk of overfitting (Probst et al., Citation2019).

The issues associated with ESDN dataset complexity hamper the performance and accuracy of machine learning fault prediction algorithms. To effectively and accurately predict faults in ESDN, all issues associated with the complexity of ESDN datasets need to be addressed simultaneously (Della Giustina et al., Citation2014). Several studies (see next section for a review) have proposed various methods to perform fault prediction in the power distribution system. These methods, however, have limited capability in simultaneously addressing all issues associated with ESDN datasets.

This article proposes a machine learning approach for creating an efficient supervised classification ensemble algorithm to simultaneously deal with the aforementioned issues. The proposed method combines the individual predictions of the base classifiers using a weighted voting strategy based on their performances. To handle the complexity of ESDN datasets, such as high dimensionality, imbalance and variability, the classification ensemble algorithm is enhanced with dimension reduction, bootstrap sampling and hyperparameter optimization. To implement the classification ensemble algorithm, a big data processing framework Apache Spark is applied for parallel and distributed processing. To evaluate the proposed ensemble algorithm, numerical experiments on the ESDN dataset have been conducted to demonstrate its performance.

Related work

Several methods have been put forth by researchers to perform fault prediction in the power distribution system. More specifically, various methodologies have been proposed based on classification algorithms aiming at monitoring the network’s health condition and the different assets to predict abnormalities in the power distribution system. This section presents a brief overview of some valuable outcomes of the proposed methodologies.

Roland and Eseosa (Citation2015) proposed a predictive model for predicting incipient transformer faults in the distribution network. Their predictive model applies ANN on dissolved gas analysis (DGA) data to predict incipient transformer faults. Along that line, Sayar and Yüksel (Citation2020) proposed an algorithm for predicting power outages in the electrical distribution network. The proposed algorithm was trained using ANN on hourly outages and meteorological data. The model was able to derive insights on how power system failures and network health conditions are affected by meteorological conditions with 70.59% accuracy. However, both studies have not addressed the issues of high dimensionality, imbalance and variability inherent in the ESDN dataset.

Huang et al. (Citation2020) focused on the prediction of faults in the distribution network by having a fair amount of historical datasets available from distribution network management and monitoring systems. Their proposed model deals with different types of faults in the network by utilizing SVM. Based on their numerical experiments, it was established that the proposed model achieved a higher performance compared to ANN and C 4.5 decision tree models. Wang et al. (Citation2021) extended the previous work and proposed a prediction system for cascade failures in the smart grid. Their extensive numerical experiments on real-time data concluded that the proposed method can efficiently predict cascade failures with accuracy close to 100%. However, SVM may suffer bottlenecks when applied to raw big data in ESDN (Qiu et al., Citation2016).

Lin et al. (Citation2019) utilized a voted RF (VRF) classifier and developed an efficient fault prediction model for a smart grid distribution system. Their numerical experiments on distribution network faults logs and meteorological data demonstrated that robust classification ensemble models could be developed by modifying the implementation of available ensemble algorithms by adapting more efficient voting methodologies. Furthermore, their results show that the proposed model outperformed ANN, RF and SVM. Though better performance has been reported, VRF still cannot handle the high dimensionality, imbalance and variability of ESDN dataset (Qiu et al., Citation2016).

Cai et al. (Citation2020) recommended a predictive system for feeder fault prediction in a power distribution network based on XGBoost. During the numerical experimentation process on datasets collected over 17 years from Fujian Electric Power in China, the proposed method attained an AUC of 0.8899, indicating that the proposed system is valid and efficient. However, although XGBoost can handle imbalanced data well (Le et al., Citation2022), its capability on data sets with high dimensionality and variability remains limited (Ma et al., Citation2021).

Stefenon et al. (Citation2021) evaluated the classification efficacy of a hybrid time series model for fault prediction in distribution insulators. The proposed model was implemented using the wavelet technique and long short-term memory (LSTM). The authors presented the results illustrating that wavelet LSTM outperformed non-linear auto-regressive (NAR) and NAR with eXogenous input (NARX).

Hou et al. (Citation2021) focused on a case study from typhoon ‘Mijiage’ of 2015 to predict power outage faults caused by typhoon disasters. They built their model based on RF using a dataset with 14 features from multiple sources, including geographical data, power grid data and meteorological data. Their numerical experiments illustrated the proposed model’s efficiency for predicting power faults achieving an accuracy of up to 92.44%. However, RF cannot perform well with high dimensional, imbalanced and variable ESDN dataset (Lin et al., Citation2019; Qiu et al., Citation2016).

Yang (Citation2021) considered the problem of multi-label fault prediction for the distribution system by proposing a predictive model based on RF and multi-classification SVM (MSVM). RF is used to calculate and sort feature weights to obtain the optimal fault feature set, and then, MSVM is used to predict the fault level. The numerical experiments on original data from the distribution network coupled with meteorological data indicated a practical application value in the proposed method, with accuracy reaching 95%. However, while the study tackles the issue of high dimensionality, it does not address the imbalance and variability issues of ESDN dataset.

Materials and methods

The objective of this study was to propose an approach for making reliable fault predictions in ESDN using an ensemble algorithm on the parallel and distributed big data processing framework Apache Spark. This study describes the results obtained from aggregated ESDN dataset comprising data recorded using automatic meter reading (AMR) and the corresponding weather data of temperature and rainfall. The dataset contains 105118 instances with a resolution of 20 min collected from January 2015 to December 2018. displays the description of the ESDN dataset.

Table 1. Description of ESDN dataset.

Download CSV Display Table

depicts the main process of the construction of BDEFP-ESDN that uses four base predictors, namely, gradient boosted trees (GBT), RF, decision tree (DT) and binomial logistic regression (BLR) from Spark MLlib (Assefi et al., Citation2017). Van der Laan et al. (Citation2007) argue that combining individual classification algorithms leads to optimal predictions.

Figure 1. Illustration of BDEFP-ESDN construction.

Data preparation and feature normalization

First, the collected SDN measurement data files in text format and weather data files in Excel format were imported into the Jupyter Lab notebook. Next, the data were extracted from the data files into dataFrames. Then we aggregated current, voltage, power, temperature and rainfall dataFrames into a single dataFrame. The missing rainfall and temperature data points were replaced using the method of linear interpolation.

Thereafter, the datasets were labeled for training the prediction model. The labels were added to the datasets using the sliding window approach based on whether or not a fault would appear in the next twenty minutes. The datasets were labeled as ‘1’ if a fault is likely to occur in the next twenty minutes and as ‘0’ if a fault is not likely to occur in the next twenty minutes. illustrates the sliding window approach used in the dataset labeling process.

Figure 2. Sliding window labeling.

After data preparation, feature normalization was performed to standardize the dataset so that the values of all features fall within a particular range to avoid the features with large value range to dominate those with smaller ones. The standard score Z is calculated using Z-Score normalization presented in EquationEquation (1)(1) $Z = \frac{x - μ}{σ}$ (1) . (1) $Z = \frac{x - μ}{σ}$ (1) where $x$ is the scaled feature sample, $μ$ is the mean of the training feature samples and $σ$ is the standard deviation of the training feature samples.

Dimension reduction

Dimension reduction is performed by selecting the most important input feature variables by measuring the importance of each input feature variable using its gain ratio. First, the information entropy of the target feature variable in the subset $S_{i} (i = 1, 2, \dots, n)$ of the training dataset $S$ is calculated using EquationEquation (2)(2) $Entropy (S_{i}) = \sum_{a = 1}^{b} - p_{a} {log}_{2} p_{a}$ (2) . (2) $Entropy (S_{i}) = \sum_{a = 1}^{b} - p_{a} {log}_{2} p_{a}$ (2) where $b$ is the number of values in the range of the target variable in $S_{i}$ and $p_{a}$ is the probability of each type of value $a$ for all types in $S_{i} .$

Then, the information entropy of each input feature variable $y_{ij}$ in the subset $S_{i}$ apart from the target variable is calculated using EquationEquation (3)(3) $Entropy (y_{ij}) = \sum_{v \in V (y_{ij})}^{b} \frac{| S_{(v, i)} |}{| S_{i} |} Entropy (v (y_{ij}))$ (3) . (3) $Entropy (y_{ij}) = \sum_{v \in V (y_{ij})}^{b} \frac{| S_{(v, i)} |}{| S_{i} |} Entropy (v (y_{ij}))$ (3) where $y_{ij}$ is the j-th input feature variable ( $j = 1, 2, \dots, M$ ) of the subset $S_{i},$ $b$ is the number of different values of $y_{ij},$ $V (y_{ij})$ is the set of all possible values of $y_{ij},$ $S_{(v, i)}$ is a subset containing sampled data points from $S_{i}$ where the value of $y_{j}$ is $v,$ $| S_{(v, i)} |$ is the number of samples in $S_{(v, i)}$ and $| S_{i} |$ is the number of samples in the subset $S_{i} .$

At this point, the information gain $G (y_{ij})$ for $y_{ij}$ in the subset $S_{i}$ of the training dataset $S$ can be calculated using EquationEquation (4)(4) $\begin{matrix} G (y_{ij}) = Entropy (S_{i}) ‐ Entropy (y_{ij}) \\ = Entropy (S_{i}) - \sum_{v \in V (y_{ij})}^{b} \frac{| S_{(v, i)} |}{| S_{i} |} Entropy (v (y_{ij})) \end{matrix}$ (4) . (4) $\begin{matrix} G (y_{ij}) = Entropy (S_{i}) ‐ Entropy (y_{ij}) \\ = Entropy (S_{i}) - \sum_{v \in V (y_{ij})}^{b} \frac{| S_{(v, i)} |}{| S_{i} |} Entropy (v (y_{ij})) \end{matrix}$ (4)

Next, the split information $I (y_{ij})$ of each input feature variable in the subset $S_{i}$ of the training dataset $S$ is computed in EquationEquation (5)(5) $I (y_{ij}) = \sum_{a = 1}^{b} - p_{(a, j)} {log}_{2} (p_{(a, j)})$ (5) . (5) $I (y_{ij}) = \sum_{a = 1}^{b} - p_{(a, j)} {log}_{2} (p_{(a, j)})$ (5) where $b$ is the number of values in the range of the input variable in $y_{ij}$ and $p_{(a, j)}$ is the probability of each type of value $a$ for all types in $y_{ij} .$

After that, the gain ratio $GR (y_{ij})$ of each input feature variable in the subset $S_{i}$ of the training dataset $S$ is obtained using EquationEquation (6)(6) $GR (y_{ij}) = \frac{G (y_{ij})}{I (y_{ij})}$ (6) . (6) $GR (y_{ij}) = \frac{G (y_{ij})}{I (y_{ij})}$ (6)

Finally, the importance of the input feature variable $VI (y_{ij})$ is computed using EquationEquation (7)(7) $VI (y_{ij}) = \frac{GR (y_{ij})}{\sum_{a = 1}^{M} GR (y_{(i, a)})}$ (7) . It measures a score for each input feature variable $y_{ij}$ in the subset $S_{i}$ of the training dataset $S$ that indicates the relative importance or relevance of each input feature variable. (7) $VI (y_{ij}) = \frac{GR (y_{ij})}{\sum_{a = 1}^{M} GR (y_{(i, a)})}$ (7)

The algorithm then sorts the input feature variables in descending order by their importance values. Then, the top $p$ features are selected as the most important ones $(p < m) .$ Further $(m - p)$ input feature variables are selected randomly, resulting in $m$ selected input feature variables in total. As a result, the number of features will be reduced from $M$ to $m .$ Algorithm 1 presents the steps of the dimension reduction process.

Algorithm 1:

Dimension Reduction

Input:

$S_{i};$

$p;$

$m;$

Output:

$FI;$ // $FI$ is the set of $m$ features of $S_{i}$

BEGIN

1. Initialize $F_{i},$ ${GR}_{i},$ ${GRT}_{i},$ $V_{i};$ // ${GR}_{i}$ is the set of gain ratio for all input feature variables in $S_{i},$ ${GRT}_{i}$ is the total gain ratio ( $\sum_{a = 1}^{M} GR (y_{(i, a)})$ ) for all input feature variables in $S_{i},$ $V_{i}$ is the set of pairs of $y_{ij}$ and their importance score for all input variables in $S_{i}$

2. Standardize the training subset $S_{i};$

3. Calculate $Entropy (S_{i})$ using EquationEquation (2)(2) $Entropy (S_{i}) = \sum_{a = 1}^{b} - p_{a} {log}_{2} p_{a}$ (2) ;

4. FOREACH $y_{ij} \in S_{i};$

5. Calculate $Entropy (y_{ij})$ using EquationEquation (3)(3) $Entropy (y_{ij}) = \sum_{v \in V (y_{ij})}^{b} \frac{| S_{(v, i)} |}{| S_{i} |} Entropy (v (y_{ij}))$ (3) ;

6. Calculate $G (y_{ij})$ using EquationEquation (4)(4) $\begin{matrix} G (y_{ij}) = Entropy (S_{i}) ‐ Entropy (y_{ij}) \\ = Entropy (S_{i}) - \sum_{v \in V (y_{ij})}^{b} \frac{| S_{(v, i)} |}{| S_{i} |} Entropy (v (y_{ij})) \end{matrix}$ (4) ;

7. Calculate $I (y_{ij})$ using EquationEquation (5)(5) $I (y_{ij}) = \sum_{a = 1}^{b} - p_{(a, j)} {log}_{2} (p_{(a, j)})$ (5)

8. Calculate $GR (y_{ij})$ using EquationEquation (6)(6) $GR (y_{ij}) = \frac{G (y_{ij})}{I (y_{ij})}$ (6) ;

9. ${GR}_{i} \leftarrow GR (y_{ij});$

10. ${GRT}_{i} \leftarrow {GRT}_{i} + GR (y_{ij});$

11. END

12. FOREACH $y_{ij} \in S_{i};$

13. Calculate $VI (y_{ij})$ using EquationEquation (7)(7) $VI (y_{ij}) = \frac{GR (y_{ij})}{\sum_{a = 1}^{M} GR (y_{(i, a)})}$ (7) ;

14. $V_{i} \leftarrow (y_{ij}, VI (y_{ij}));$

15. END

16. Sort $V_{i}$ in descending order by $VI (y_{ij});$

17. Select top $p$ feature variables from $V_{i}$ to $FI;$

18. WHILE count( $FI$ ) < $m$ DO

19. Push $y_{ij}$ randomly from $V_{i} (M - p)$ to $FI;$

20. END

21. RETURN $FI;$

END

Bootstrap sampling

In order to enhance the accuracy of the ensemble on imbalanced ESDN datasets, an improved bootstrap sampling algorithm has been designed based on the stationary block bootstrap resampling procedure proposed by Politis and Romano (Citation1994). The bootstrap sampling algorithm creates the set $B$ of all bootstrap samples ${S_{1}, S_{2}, \dots S_{b}}$ from the training dataset $S$ with $n$ data points as shown in Algorithm 2. Symbol $g$ represents the geometric distribution value that controls the typical size of the blocks sampled and $b$ is the number of bootstrap samples.

Each bootstrap sample $S_{l} \in B$ is obtained by initially picking the first data point $x_{k} \in S (k = 1)$ for the first sampled data point of a pseudo series value $x_{k}^{*} \in S_{l} (l = 1, 2, \dots, b) .$ Next, a discrete uniform distribution variable $d$ is chosen randomly in the range of 0 and 1. With probability $1 - g$ (that is when $d > g$ ) the choice for the index $i (i = 1, 2, \dots, n)$ of the next sampled data point from $S$ is $k + 1$ otherwise an index $e$ of a data point in $S$ is selected at random. Then, $x_{i} \in S$ is selected to be the next sampled data point for a pseudo series value $x_{j}^{*} \in S_{l} (j = 2, 3, \dots, n) .$

Algorithm 2:

Bootstrap Sampling

Input:

$S;$

$g;$

$b;$

Output:

$B;$

BEGIN

1. FOR $l = 1, \dots, b$ DO

2. $k = 1;$ // $k$ is current index of a sampled data point from $S$

3. Initialize $S_{l}$ with $x_{k}^{*} = x_{k};$

4. FOR $j = 2, \dots, COUNT (S)$ DO

5. Choose $d$ at random;

6. IF $d > g$ THEN

7. $i = k + 1;$

8. ELSE

9. Choose $e$ at random;

10. $i = e;$

11. END

12. $x_{j}^{*} = x_{i};$

13. $S_{l} \leftarrow S_{l} \cup x_{j}^{*};$

14. $k = i;$

15. END

16. $B \leftarrow {B \cup S}_{l};$

17. END

18. RETURN $B;$

END

Weighted voting

The voting weights for the base predictors are computed based on their local performance. After training, each base predictor $F_{i},$ from the trained model $F$ is evaluated using the testing data set $T,$ each instance of $T$ is predicted by all predictors in the model. Then, the results of each predictor are aggregated by voting a final result for each instance of the training data set. The local performance ${PP}_{i}$ of each predictor $F_{i}$ is measured in terms of the classification accuracy (CA) as defined in EquationEquation (8)(8) $\begin{matrix} {PP}_{i} = {CA}_{i} \\ = \frac{I_{L} (F_{i} (T) = c)}{I_{L} (F_{i} (T) = c) + \sum I_{L} (F_{i} (T) = e)} \end{matrix}$ (8) (8) $\begin{matrix} {PP}_{i} = {CA}_{i} \\ = \frac{I_{L} (F_{i} (T) = c)}{I_{L} (F_{i} (T) = c) + \sum I_{L} (F_{i} (T) = e)} \end{matrix}$ (8) where $T$ is the testing dataset, $c \in L$ is the value in the correct class, $e \in L$ is the value in the error class and $I_{L}$ is the indicator function for the prediction of the classifier $F_{i}$ on instances of $T .$

After obtaining ${PP}_{i}$ a set of trained models and their corresponding prediction performance $FE {(F_{1}, {PP}_{1}), (F_{2}, {PP}_{2}), \dots (F_{n}, {PP}_{n})}$ is created. Next, the normalized classification weights ${nw}_{i},$ are calculated using EquationEquation (9)(9) ${nw}_{i} = \frac{{PP}_{i}}{\sum_{k = 1}^{n} {PP}_{k}}$ (9) . (9) ${nw}_{i} = \frac{{PP}_{i}}{\sum_{k = 1}^{n} {PP}_{k}}$ (9)

The final classification result of the $n$ base predictors of the ensemble is the majority vote of the classification results. The weighted classification result $P_{c} (t_{i}),$ is defined in EquationEquation (10)(10) $P_{c} (t_{i}) = \sum_{j = 1}^{n} {nw}_{j} \times F_{j} (t_{i})$ (10) . (10) $P_{c} (t_{i}) = \sum_{j = 1}^{n} {nw}_{j} \times F_{j} (t_{i})$ (10)

In the prediction phase, this weighted voting mechanism increases the influence of the base predictors with better performance compared to those with poor performance. The steps of the weighted voting scheme in the designed algorithm are presented in Algorithm 3.

Algorithm 3:

Weighted Voting

Input:

$T;$

$F;$

Output:

$P (T);$ //prediction result for T

BEGIN

1. Initialize $FE,$ $FW;$ //FE model with evaluation, FW model with weight

2. FOREACH $F_{i} \in F$ // $F_{i}$ is the ith Predictor of $F$

3. Apply $F_{i}$ on testing data set $T;$

5. Compute ${PP}_{i}$ using EquationEquation (8)(8) $\begin{matrix} {PP}_{i} = {CA}_{i} \\ = \frac{I_{L} (F_{i} (T) = c)}{I_{L} (F_{i} (T) = c) + \sum I_{L} (F_{i} (T) = e)} \end{matrix}$ (8) ;

9. $FE \leftarrow (F_{i}, {PP}_{i});$

10. END

11. Sort $FE$ in descending order by $PP;$

12. FOREACH $t_{i} \in T$

13. FOREACH ${FE}_{j} \in FE$

14. Apply $F_{j}$ on $t_{i};$

16. Compute ${nw}_{j}$ using EquationEquation (9)(9) ${nw}_{i} = \frac{{PP}_{i}}{\sum_{k = 1}^{n} {PP}_{k}}$ (9) ;

20. $FW \leftarrow (F_{j}, {nw}_{j});$

21. END

23. Vote the final prediction $P_{c} (t_{i})$ using EquationEquation (10)(10) $P_{c} (t_{i}) = \sum_{j = 1}^{n} {nw}_{j} \times F_{j} (t_{i})$ (10) ;

24. $P (T) \leftarrow P_{c} (t_{i});$

29. END

30. RETURN $P (T);$

END

Hyperparameter optimization

The best combination of the values for the hyperparameters was determined by the use of Bayesian Optimization. The prediction model was trained and evaluated using each combination of the hyperparameters $H {h_{1}, h_{2}, h_{2}, \dots h_{n}} .$ The best prediction model was identified based on the combination of the hyperparameter values with the best performance. The objective function $f$ to evaluate the combination of hyperparameter values is defined in EquationEquation (11)(11) ${PP}_{i} = f (h_{i})$ (11) . It uses the set of hyperparameters $h_{i}$ to calculate the score that determines the performance ${PP}_{i}$ of the model on the testing dataset using $h_{i} .$ (11) ${PP}_{i} = f (h_{i})$ (11)

The performance ${PP}_{i}$ of the prediction model $F_{i}$ with hyperparameter values $h_{i}$ and the set of unique classification labels $L$ is determined by the CA as defined in EquationEquation (8)(8) $\begin{matrix} {PP}_{i} = {CA}_{i} \\ = \frac{I_{L} (F_{i} (T) = c)}{I_{L} (F_{i} (T) = c) + \sum I_{L} (F_{i} (T) = e)} \end{matrix}$ (8) . Clearly, the best prediction model is the one which produces the highest CA. We then maximize the objective function using EquationEquation (12)(12) $h^{*} = \underset{h_{i} \in H}{argmax} f (h_{i})$ (12) to obtain the optimal set of hyperparameters $h^{*} .$ (12) $h^{*} = \underset{h_{i} \in H}{argmax} f (h_{i})$ (12)

Because the number of hyperparameter combinations is usually large, it becomes difficult and costly to calculate the objective function score on each of the hyperparameter combinations. Bayesian optimization aims at reducing the number of times the objective function is computed by applying a surrogate model. The surrogate model calculates the conditional probability of the objective function to propose hyperparameter sets that are likely to improve the performance score. The surrogate model is presented in EquationEquation (13)(13) $h_{i} = \underset{h \in H}{argmax} A_{i} (h; S_{i - 1})$ (13) . (13) $h_{i} = \underset{h \in H}{argmax} A_{i} (h; S_{i - 1})$ (13) where $S$ is the Gaussian process (GP) surrogate model and $A$ is the acquisition function based on expected improvement (EI).

The steps of the hyperparameter optimization process are presented in Algorithm 4. First, a surrogate probability model of the objective function is initialized. Next, for each iteration, we find the surrogate model’s best-performing combination of hyperparameters $h_{i},$ where $A_{i}$ is maximized. Then, we obtain the best-performing combination of hyperparameters using the true objective function. After that, the set of hyperparameters combination and their corresponding score is augmented to the set of observation history $OH,$ of other samples. Finally, the surrogate probability model is re-trained using the latest history of samples. After the maximum number of iterations is exhausted, the latest true objective function’s best-performing combination of hyperparameters is the optimal one.

Algorithm 4:

Hyperparameters Optimization

Input:

$H;$

$Q;$ //maximum number of iterations

Output:

$OP;$ //Optimal hyperparameters with performance score

BEGIN

1. Initialize $h_{0} \in H$ randomly;

2. Evaluate $f (h_{0})$ using EquationEquation (8)(8) $\begin{matrix} {PP}_{i} = {CA}_{i} \\ = \frac{I_{L} (F_{i} (T) = c)}{I_{L} (F_{i} (T) = c) + \sum I_{L} (F_{i} (T) = e)} \end{matrix}$ (8) ;

3. Initialize $OH \leftarrow (h_{0}, f (h_{0}));$ // $OH$ is observation history

4. Initialize $h^{*} \leftarrow h_{0};$

5. Initialize $f^{*} \leftarrow f (h_{0});$ // $f^{*}$ is performance score of model with $h^{*}$

6. FOR $i = 1, \dots, Q$ DO

7. Select $h_{i} \in H$ using EquationEquation (13)(13) $h_{i} = \underset{h \in H}{argmax} A_{i} (h; S_{i - 1})$ (13) ;

8. Evaluate $f (h_{i})$ using EquationEquation (8)(8) $\begin{matrix} {PP}_{i} = {CA}_{i} \\ = \frac{I_{L} (F_{i} (T) = c)}{I_{L} (F_{i} (T) = c) + \sum I_{L} (F_{i} (T) = e)} \end{matrix}$ (8) ;

9. IF $f (h_{i}) > f^{*}$ THEN

10. $f^{*} \leftarrow f (h_{i});$

11. $h^{*} \leftarrow h_{i};$

12. END

13. $OH \leftarrow OH \cup (h_{i}, f (h_{i}));$

14. Fit a new model $S_{i}$ to $OH;$

15. END

16. $OP \leftarrow (h^{*}, f^{*});$

17. RETURN $OP;$

END

Model integration

The integrated BDEFP-ESDN is presented in Algorithm 5. The input dataset is preprocessed and divided into training and testing datasets. Then, a dimension reduction mechanism is performed on the preprocessed data before the bootstrap samples are generated. During the training phase, the ensemble model is constructed using the determined set of optimal hyperparameters and the relevant features obtained during dimension reduction. Finally, the final prediction result is obtained by combining the results of the ensemble predictors using the calculated voting weights.

Algorithm 5:

BDEFP-ESDN

Input:

$D;$ //Historical dataset

$p;$ //Number of selected important features

$m;$ //Number of selected relevant input feature variables

$b;$ //Number of bootstrap samples

$k;$ //Number of nearest neighbors

$n;$ // Minimum number of neighbors of the majority set from the minority set

$E;$ // Set of predictors ${E_{1}, E_{2} \dots E_{n}}$ which constitute the ensemble

$H;$ //Hyperparameter search space

$Q;$ //Maximum number of iterations

Output:

$P (T);$ //Prediction result

BEGIN

1. Preprocess the data from $D;$

2. Split $D$ into training dataset $S$ and testing dataset $T;$

3. Select relevant features from $S,$ $IF \leftarrow dimensionReduction (S, p, m);$

4. Create the bootstrap samples from $S,$ $B \leftarrow bootstrapSampling (S, b, k, n);$

5. Initialize the empty set $F$ of trained predictors of the ensemble

6. FOREACH $E_{i} \in E$

7. Determine the set of optimal hyperparameters, $OP \leftarrow hyperOpt (H, Q);$

8. Train $E_{i}$ with bootstrap sample $S_{i}$ and a combination of hyperparameters $h^{*};$

9. $F_{i} = Trained E_{i};$

10. $F \leftarrow F \cup F_{i};$

11. END

12. Perform weighted voting prediction, $P (T) \leftarrow weightedVoting (T, F);$

13. RETURN $P (T);$

END

Performance evaluation

The metrics used in the evaluation of the performance of BDEFP-ESDN are accuracy, recall, precision and F 1-Score, as presented in EquationEquations (14–17). These metrics come from the four fundamental parameters of all the classification results namely, true positive (TP), false positive (FP), true negative (TN) and false negative (FN). TP indicates that the positive label is predicted positive, while TN means the negative label is predicted negative. On the other hand, FP shows that the negative label is predicted positive, and FN shows that the positive label is predicted negative. (14) $Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$ (14) (15) $Precision = \frac{TP}{TP + FP}$ (15) (16) $Recall = \frac{TP}{TP + FN}$ (16) (17) $F 1‐Score = \frac{2 x Precision x Recall}{Precision + Recall}$ (17)

Other metrics are the area under the precision-recall curve (AUPR) as well as the area under the receiver operator characteristic curve (AUROC). AUROC provides a performance measure that shows the TPs against the FPs. Unlike accuracy, AUROC is threshold independent because it implicitly compares the base error rates between classifiers. Thus, AUROC is considered more reliable in the presence of high false-positive rates (Ling et al., Citation2003; Sibona & Brickey, Citation2012). AUPR is another threshold-independent metric obtained from the plot of precision against recall. In the case of the highly imbalanced ESDN dataset, AUPR is said to be more effective and informative than AUROC (Saito & Rehmsmeier, Citation2015).

Results and discussion

This section presents the discussion on the results produced after applying BDEFP-ESDN with the optimization methods presented in the previous sections on the ESDN dataset using the Apache Spark platform. First, we discuss how BDEFP-ESDN was optimized with dimension reduction, bootstrap sampling and how the optimal hyperparameter sets were obtained. Finally, we discuss the comparison of the classification performance of BDEFP-ESDN against methods used in previous works.

Dimension reduction

The performance analysis of BDEFP-ESDN with dimension reduction is presented in this section. The number of input features in the ESDN dataset is 11, as shown in . Both binary and multiclass classification models were generated with two to all input feature variables and the models with the highest performance were selected.

For the binary classification problem, the prediction model with 11 features produced the best AUPR (0.991235). This model was selected as the best prediction model for binary classification fault prediction. The performance of the designed algorithm with dimension reduction for binary classification is presented in .

Figure 3. Performance of binary classification prediction with different number of features.

For multiclass classification, the best performance in terms of accuracy (0.979934) was produced by the prediction model with both 11 features and a combination of features producing the best performance from the individual base models (DT: 10 features; MLR: 03 features; OVR-GBT: 11; RF: 07 features). Hence, the model with 11 features was selected as the best prediction model for multiclass classification fault prediction. The performance of the designed algorithm with dimension reduction for multiclass classification is presented in .

Figure 4. Performance of multiclass classification prediction with different number of features.

Bootstrap sampling

This section presents the performance analysis of BDEFP-ESDN with bootstrap sampling. Using the stationary block bootstrap sampling, the optimal geometric distributions are considered to effectively sample block lengths that greatly influence the balancing of minority class datasets in the generated bootstrap samples. In turn, the reduced imbalance in the generated bootstrap samples enhances the performance of classification models used in fault prediction.

Aimed at finding the optimal geometric distributions for ESDN dataset, each classification model is constructed with geometric distribution from 0.111111 to 0.999999. shows the performance of BDEFP-ESDN with bootstrap sampling for binary classification.

Figure 5. Performance of binary classification prediction with different values of geometric distribution.

According to , the prediction model with a geometric distribution value of 0.888888 gets the highest performance (Accuracy = 0.98989) and is selected as the optimal geometric distribution for binary classification of the ESDN dataset.

The performance of BDEFP-ESDN with bootstrap sampling for multiclass classification is illustrated in . The best prediction model selected for multiclass classification of the ESDN dataset was the one with a geometric distribution value of 0.222222. This model produced the highest performance in terms of accuracy (0.97963315).

Figure 6. Performance of multiclass classification prediction with different values of geometric distribution.

Hyperparameter optimization

This section presents the findings of the critical hyperparameters for generating optimal ensembles from BDEFP-ESDN for ESDN. The strategy to determine the best set of hyperparameters involved the application of Bayesian Optimization. The best prediction model with highest accuracy for multiclass classification and the highest AUPR for binary classification was selected. shows the combinations of hyperparameters used for tuning and optimizing the performance of BDEFP-ESDN for both binary and multiclass classification problems.

Table 2. Classification optimal hyperparameters.

Download CSV Display Table

Result comparison

In this section, we discuss the comparison in CA of the proposed algorithm against some traditional classifiers used in previous works: ANN, SVM, RF and XGB. Both binary and multiclass classification results are presented to assess the proposed approach’s effectiveness and validate the results obtained from BDEFP-ESDN.

and present the performance of the proposed algorithm BDEFP-ESDN against ANN, SVM, RF and XGB for binary and multiclass classification, respectively. The highest classification performance for each metric is highlighted in bold. The results show that BDEFP-ESDN is the most efficient algorithm since it demonstrates the best overall classification performance for both binary and multiclass classification.

Table 3. Binary classification results. The values in bold indicate best performance.

Download CSV Display Table

Table 4. Multiclass classification results. The values in bold indicate best performance.

Download CSV Display Table

and show the graphical representation of the binary classification results using the ROC curve and PR curve, respectively. With the ROC curve, best-performing algorithms tend to produce curves nearer to the top-left corner and those closer to the dashed diagonal of the ROC space indicate poor performance. Then again, algorithms producing curves nearer to the top-right corner in the PR curve diagram indicate better performance. According to and , BDEFP-ESDN outperforms ANN, SVM, RF and XGB.

Figure 7. ROC curves for binary classification performance results.

Figure 8. PR curves for binary classification performance results.

Conclusions and future work

With high dimensional, imbalanced and variable ESDN datasets, the performance of some algorithms tends to deteriorate and increase the complexity and risk of over fitting the learning model. As far as the authors are aware, no comprehensive work was dedicated to addressing these issues. This paper proposes a big data-based ensemble for fault prediction in an ESDN to simultaneously tackle these issues associated with the inherent complexity of ESDN datasets and weighting of base models. The proposed algorithm was implemented on the parallel and distributed big data framework Apache Spark using a weighted vote approach on DT, RF, GBT and LR base models. To address the ESDN dataset complexity issues, the algorithm is optimized through dimension reduction, bootstrap sampling and hyperparameter optimization.

The experimental results on the ESDN dataset collected from AMR showed that the proposed algorithm is more suitable for fault prediction in ESDN than traditional classifiers like ANN, SVM, RF and XGB. For future work, we will focus on extending the algorithm to cater for regression problems in forecasting ESDN measurements associated with the predicted faults. Additionally, it will be of interest to explore the proposed approach in different areas of big data predictive modeling other than ESDN fault prediction.

Acknowledgment

This research was carried out as part of the iGrid-Project at the University of Dar es Salaam (UDSM) under the Swedish International Development Agency (SIDA) sponsorship. The authors also appreciate TANESCO for the collaboration provided during the research.

Disclosure statement

No conflict of interest.

Additional information

Notes on contributors

David T. Makota

David Makota is an assistant lecturer and consultant in the Department of Computer Science, at the Institute of Finance Management (IFM), in Dar es Salaam, Tanzania. He is pursuing his doctoral degree in Computer and IT Systems Engineering at the University of Dar es Salaam. His research interests include machine learning, big data, knowledge sharing, smart grid and database systems.

References

Amin, S. M. (2011). Smart grid: Overview, issues and opportunities. advances and challenges in sensing, modeling, simulation, optimization and control. European Journal of Control, 17(5–6), 547–567. https://doi.org/10.3166/ejc.17.547-567
Web of Science ®Google Scholar
Assefi, M., Behravesh, E., Liu, G., & Tafti, A. P. (2017). Big data machine learning using Apache Spark MLlib [Paper presentation]. 2017 IEEE International Conference on Big Data (Big Data) (pp. 3492–3498), Boston, MA, USA. https://doi.org/10.1109/BigData.2017.8258338
Google Scholar
Cai, J., Cai, Y., Cai, H., Shi, S., Lin, Y., & Xie, M. (2020). Feeder fault warning of distribution network based on XGBoost. Journal of Physics: Conference Series, 1639(1), 012037. https://doi.org/10.1088/1742-6596/1639/1/012037
Google Scholar
Chen, W. G., Xi, H. J., Su, X. P., & Liu, W. (2012). Application of generalized regression neural network to transformer winding hot spot temperature forecasting. Gaodianya Jishu/High Voltage Engineering, 38(1), 16–21.
Google Scholar
Della Giustina, D., Pau, M., Pegoraro, P. A., Ponci, F., & Sulis, S. (2014). Electrical distribution system state estimation: Measurement issues and challenges. IEEE Instrumentation & Measurement Magazine, 17(6), 36–42. https://doi.org/10.1109/MIM.2014.6968929
Web of Science ®Google Scholar
Dietterich, T. G. (2000). Ensemble methods in machine learning [Paper Presentation]. In: Multiple Classifier Systems. MCS 2000. Lecture Notes in Computer Science, 1857 (pp. 1–15). Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45014-9_1
Google Scholar
Hou, H., Zhu, S., Geng, H., Li, M., Xie, Y., Zhu, L., & Huang, Y. (2021). Spatial distribution assessment of power outage under typhoon disasters. International Journal of Electrical Power & Energy Systems, 132, 107169. https://doi.org/10.1016/j.ijepes.2021.107169
Web of Science ®Google Scholar
Huang, W. S., Lu, X., Liu, Y., Chen, Q., Qi, M. H., & Gao, H. J. (2019). Fault prediction of distribution network based on support vector machine. DEStech Transactions on Engineering and Technology Research, no. amee. https://doi.org/10.12783/dtetr/amee2019/33488
Google Scholar
Jacobs, A. (2009). The pathologies of big data. Communications of the ACM, 52(8), 36–44. https://doi.org/10.1145/1536616.1536632
Web of Science ®Google Scholar
Kezunovic, M., Xie, L., & Grijalva, S. (2013). The role of big data in improving power system operation and protection. Bulk Power System Dynamics and Control-IX Optimization, Security and Control of the Emerging Power Grid (IREP). 2013 IREP Symposium (pp. 1–9).
Google Scholar
Le, T. T. H., Oktian, Y. E., & Kim, H. (2022). Xgboost for imbalanced multiclass classification-based industrial internet of things intrusion detection systems. Sustainability, 14(14), 8707. https://doi.org/10.3390/su14148707
Web of Science ®Google Scholar
Lin, R., Pei, Z., Ye, Z., Wu, B., & Yang, G. (2019). A voted based random forests algorithm for smart grid distribution network faults prediction. Enterprise Information Systems, 14(4), 496–514. https://doi.org/10.1080/17517575.2019.1600724
Web of Science ®Google Scholar
Ling, C. X., Huang, J., & Zhang, H. (2003). AUC: A statistically consistent and more discriminating measure than accuracy. IJCAI, 3, 519–524.
Google Scholar
Ma, M., Zhao, G., He, B., Li, Q., Dong, H., Wang, S., & Wang, Z. (2021). XGBoost-based method for flash flood risk assessment. Journal of Hydrology, 598, 126382. https://doi.org/10.1016/j.jhydrol.2021.126382
Web of Science ®Google Scholar
Politis, D. N., & Romano, J. P. (1994). The stationary bootstrap. Journal of the American Statistical Association, 89(428), 1303–1313. https://doi.org/10.1080/01621459.1994.10476870
Web of Science ®Google Scholar
Probst, P., Wright, M. N., & Boulesteix, A. L. (2019). Hyperparameters and tuning strategies for random forest. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(3), e1301.
Web of Science ®Google Scholar
Qiu, J., Wu, Q., Ding, G., Xu, Y., & Feng, S. (2016). A survey of machine learning for big data processing. EURASIP Journal on Advances in Signal Processing, 2016(1), 67. https://doi.org/10.1186/s13634-016-0355-x
Web of Science ®Google Scholar
Roland, U., & Eseosa, O. (2015). Artificial neural network approach to distribution transformers maintenance. International Journal of Scientific Research Engineering Technology, 1(4), 62–70.
Google Scholar
Sagi, O., & Rokach, L. (2018). Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4), e1249.
Web of Science ®Google Scholar
Saito, T., & Rehmsmeier, M. (2015). The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432
PubMed Web of Science ®Google Scholar
Sayar, M., & Yüksel, H. (2020). Real-time prediction of electricity distribution network status using artificial neural network model: A case study in Salihli (Manisa, Turkey). Celal Bayar Üniversitesi Fen Bilimleri Dergisi, 16(3), 307–321. https://doi.org/10.18466/cbayarfbe.740343
Google Scholar
Sibona, C., & Brickey, J. (2012). A statistical comparison of classification algorithms on a single data set. AMCIS 2012 Proceedings. https://aisel.aisnet.org/amcis2012/proceedings/ResearchMethods/2
Google Scholar
Stefenon, S. F., Freire, R. Z., Meyer, L. H., Corso, M. P., Sartori, A., Nied, A., Klaar, A. C. R., & Yow, K. C. (2021). Fault detection in insulators based on ultrasonic signal processing using a hybrid deep learning technique. IET Science, Measurement & Technology, 14(10), 953–961. https://doi.org/10.1049/iet-smt.2020.0083
Web of Science ®Google Scholar
Thudumu, S., Branch, P., Jin, J., & Singh, J. J. (2020). A comprehensive survey of anomaly detection techniques for high dimensional big data. Journal of Big Data, 7(1), 1–30. https://doi.org/10.1186/s40537-020-00320-x
Google Scholar
Van der Laan, M. J., Polley, E. C., & Hubbard, A. E. (2007). Super learner. Statistical Applications in Genetics and Molecular Biology, 6(1), 25. https://doi.org/10.2202/1544-6115.1309
Web of Science ®Google Scholar
Wang, Y., Li Y., H., Liang, X., Weng, & Huang, M. (2021). An active power failure early warning probability model based on support vector machine algorithm. IOP Conference Series: Earth and Environmental Science, 632(4), 042042.
Google Scholar
Weiss, G. M., & Provost, F. (2001). The effect of class distribution on classifier learning: An empirical study.
Google Scholar
Yang, X. (2021). Power grid fault prediction method based on feature selection and classification algorithm. International Journal of Electronics Engineering and Applications, 9(2), 33–44.
Google Scholar
Zhang, S., Wang, Y., Liu, M., & Bao, Z. (2018). Data-based line trip fault prediction in power systems using LSTM networks and SVM. IEEE Access, 6, 7675–7686. https://doi.org/10.1109/ACCESS.2017.2785763
Google Scholar
Zhou, Z. H. (2021). Ensemble learning. In Z. H. Zhou (Ed.), Machine learning (pp. 181–210). Springer. https://doi.org/10.1007/978-981-15-1967-3_8
Google Scholar

A big data-based ensemble for fault prediction in electrical secondary distribution network

Abstract

Introduction

Related work

Materials and methods

Table 1. Description of ESDN dataset.

Data preparation and feature normalization

Dimension reduction

Bootstrap sampling

Weighted voting

Hyperparameter optimization

Model integration

Performance evaluation

Results and discussion

Dimension reduction

Bootstrap sampling

Hyperparameter optimization

Table 2. Classification optimal hyperparameters.

Result comparison

Table 3. Binary classification results. The values in bold indicate best performance.

Table 4. Multiclass classification results. The values in bold indicate best performance.

Conclusions and future work

Acknowledgment

Disclosure statement

Notes on contributors

David T. Makota

References

Information for

Open access

Opportunities

Help and information

A big data-based ensemble for fault prediction in electrical secondary distribution network

Abstract

Introduction

Related work

Materials and methods

Table 1. Description of ESDN dataset.

Data preparation and feature normalization

Dimension reduction

Bootstrap sampling

Weighted voting

Hyperparameter optimization

Model integration

Performance evaluation

Results and discussion

Dimension reduction

Bootstrap sampling

Hyperparameter optimization

Table 2. Classification optimal hyperparameters.

Result comparison

Table 3. Binary classification results. The values in bold indicate best performance.

Table 4. Multiclass classification results. The values in bold indicate best performance.

Conclusions and future work

Acknowledgment

Disclosure statement

Additional information

Notes on contributors

David T. Makota

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date