Full article: Creating optimized machine learning pipelines for PV power generation forecasting using the grid search and tree-based pipeline optimization tool

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Demand for electric power, especially amidst limited fossil fuel-based generation capacity, has elevated renewable energy sources to a forefront solution for the growing energy needs. Solar energy, a key renewable source through photovoltaic (PV) panels, faces challenges such as intermittency and non-dispatchability. Thus, recent research has focused on developing programs to predict near-future solar energy generation, with machine learning being a pivotal approach. This article details the creation of an effective machine-learning pipeline for predicting future hourly power generation based on weather data (e.g. temperature, humidity, irradiance). The pipeline, aimed at a scheduling system in a farm equipped with a Solar Power System (SPS) in Al-Salt, Jordan, was optimized using Genetic Algorithm and Grid Search methods. The objective of this article is to create an optimal pipeline with minimal loss. The evaluation shows that ensemble regressors, especially Gradient Boosting Regressors, are effective. This is evidenced in the grid search pipeline, which outperformed the TPOT optimization pipeline-derived pipeline, the latter including stacked ensemble regressors and sequential preprocessors.

Keywords:

Reviewing Editor:

Ai Qingsong, Senior Editor, Wuhan University of Technology, China

Subjects:

OptimizationRenewable EnergyMachine LearningSolar energy

1. Introduction

The demand for electrical energy is increasing dramatically due to rapid population growth. Data from (Our World Data, Citation2021a) shows that between 2015 and 2019, there was a severe rise in energy production due to demand by around 9.5%. Consequently, alternative renewable energy sources were one of the proposed solutions to cover the needs. During the same period, the dependency on fossil fuels dropped by 3.3% as a result of the efforts made and the policies followed to benefit from renewable energy (Our World Data, Citation2021b).

One of the most popular renewable sources is photovoltaic cells, which can generate electrical power by converting solar energy into electrical energy. As described by Wang et al. (Citation2020a), this conversion occurs through the photovoltaic effect. Photovoltaic cells together form what is known as a Solar Power System (SPS), where a group of cells is arranged in a panel (called a solar panel), and each panel can generate a specific amount of electrical power depending on its characteristics. These panels, along with other power electronics components, make up the SPS.

SPS power generation is intermittent and non-dispatchable, as it depends mainly on solar irradiance. Once the irradiance level drops for weather-related reasons, power generation decreases considerably. Furthermore, a better match needs to be made between the typical load profile and the regular power generation of the SPSs, as mentioned in Khasawneh and Illindala (Citation2013).

Several technologies are being used to overcome these challenges, especially the non-dispatchable generation, such as energy storage for later usage and electricity-based operations time scheduling through Energy Management Systems (EMS) as presented in (Illindala et al., Citation2015; Khasawneh & Illindala, Citation2014). The primary purpose of an EMS is to reduce energy costs by providing adaptive operational optimization, as mentioned in Abraham (Citation2017).

SPS’s power generation forecast can improve the EMS functionality by providing them with futuristic data that enhance optimization, such as day-ahead power generation forecasting. Meanwhile, forecasting the futuristic power generation for a facility equipped with SPS can reduce the facility’s energy billing amount. For instance, utilizing the forecasts to understand the SPS’s hourly energy generation and schedule the hours of operation for specified electrical appliances and machines in the facility to match the high energy generation hours could reduce the amount of energy sourced by utilities (billed energy). This reduction leads to lower costs and more environment-friendly operations.

There has been a growing interest in using connected and off-grid solar power systems due to the increasing demand for renewable energy sources. However, the unpredictable nature of renewable energy sources, such as solar irradiance, poses a challenge to effective power management in these systems. To address this challenge, this study aims to develop an optimal ML pipeline for accurate forecasting of power generation in an existing solar power system in Jordan.

The ML pipeline can be defined as a series of data preprocessors concatenated with a machine learning model. shows an illustration. The pipeline passes the input data into the preprocessors, where data are transformed (e.g. scaled, categorized, encoded, etc.) and fed to the machine learning (ML) model to predict the power generation values. Note that the ML models will be called “models” in the following text for simplicity.

Figure 1. Machine learning pipeline structure.

To build such a pipeline, experimenting with several combinations of data preprocessors and models is required, where each combination represents a pipeline. One of the goals of this study is to find the best (optimal) pipeline that achieves the highest power forecasting accuracy. Two optimization methods (illustrated in Section 1.2) were used to achieve that.

The proposed pipeline aims to improve the utilization of the generated power and extend the life of the battery bank, thus reducing the dependency on non-renewable energy sources.

1.1. Literature review

Solar power generation forecasting has been advancing rapidly in recent years. Artificial Intelligence (AI) algorithms are being utilized massively to build models that predict power generation in different resolutions and horizons. The resolution and horizon are essential to consider when building the forecasting model. Resolution indicates the time range of each prediction (e.g. hourly power generation predictions), where the horizon is the length of time into the future for which predictions are to be prepared. For example, a model might be trained to predict the average power generation values every 15 min for the next day. In such a case, the resolution is 15 min, and the horizon is 1 day.

The following section illustrates related studies on building solar power forecasting models:

Al-Dahidi et al. (Citation2019) developed an ensemble of optimized Artificial Neural Networks (ANNs) for predicting the SPS power generation 24 h ahead. The ANNs receive weather metrological data as input. The study also covered the quantification of the uncertainty in the obtained predictions. Wang et al. (Citation2020b) had done a broad review of publications that studied solar power generation forecasting using AI, including Machine Learning (ML), expert systems, and Deep Learning (DL). Examples of some of the used algorithms in the study are Support Vector Machines (SVM), Extreme Learning Machines (ELM), stacked auto-encoders, and several types of neural networks. Furthermore, various optimization methods were reviewed to study their effect on enhancing AI models’ performance. The study covered Practical Swarm Optimization (PSO), Genetic Algorithm (GA), and Differential Evolution (DE). By utilizing the DL approach, Shin et al. (Citation2021) worked on an adaptive neuro-fuzzy inference system. ANN to predict the hourly solar radiation and sunshine using various metrological weather factors and append these predictions to the factors to predict the futuristic solar power generation in different horizons.

Moreover, Kim et al. (Citation2022) did a comparison and analysis of the solar power prediction accuracy using weather forecast metrological data to train Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) learning models. The Meteorological Data Open Portal of the Korea Meteorological Administration provided the data used. Another approach was taken by Zang et al. (Citation2018) considering the historical time series power generation data as input instead of weather metrics. The work proposed a hybrid method based on deep convolutional neural networks and variation mode decomposition for short-term power forecasting. To overcome one of the issues facing power generation prediction models, the noisy data, Liu and Sun (Citation2019) proposed a new approach utilizing Principal Component Analysis (PCA), K-means clustering algorithm, and random forest algorithm. Additionally, optimization was applied using the differential evolution Grey Wolf Optimizer to model the photovoltaic power generation in three regions.

Many studies also covered the potential and advantages of integrating power generation forecasting in EMSs: Lee and Cheng (Citation2016) reviewed 276 papers related to Building Energy Management Systems (BEMS), which led to the categorization of the main 3 functions of increasing the BEMS efficiencies, which are scheduling control, tariff and load control, and AI-enabled smart home/environment, lifting energy savings of BEMS from 11.39% to 16.22%. Furthermore, Bourhnane et al. (Citation2020) were motivated to investigate the several ways that have been adopted for energy consumption prediction and scheduling for its crucial part in deploying energy-efficient management systems through the scheduling of typical home appliances. Furthermore, Setlhaolo et al. (Citation2014) studied residential demand response to minimize electricity costs through their case study, which utilized shifting consumption, indicating that a household may realize an electricity cost saving of over 25%.

1.2. Paper contribution

This article illustrates the experimental work done by a research team at the University of Jordan to create an optimal ML pipeline. The input to the proposed pipeline is weather forecast data metrics for a specific date and predicts that day’s average hourly power generation (in watts). The study was carried out on an Off-Grid Solar Power System located on a farm in the city of Al-Salt, west of Jordan.

The purpose of the proposed optimal ML pipeline is to generate accurate power generation forecasts that can be fed to a scheduling algorithm that works on defining the optimal working hours for specific electrical appliances, gadgets, and machines in the farm to increase the utilization of the power generated by the off-grid SPS and reduce the dependency on the battery bank, thus extending their lives.

illustrates the impact of employing power generation forecasts generated by the optimal ML pipeline on scheduling the operational hours of electrical appliances on the farm. In this particular example, it can be observed that there was a reduction in the demand for power supplied by the grid due to the enhanced utilization of the power generated by the SPS.

Figure 2. An example of scheduling the appliances’ working hours in the farm, based on the power generation forecasts given by the created optimal ML pipeline.

The off-grid SPS supplies the farm with the needed energy to perform daily activities. The main components of the system are the battery bank, solar panels, and power inverter (to convert the direct current (DC) generated by panels to alternative current (AC)). The visualized structure of the off-grid SPS is provided in . When the solar irradiance is available sufficiently, the farm would be totally supplied by the electrical energy generated from the solar panels. In case of a lack of solar irradiance, batteries would supply the remaining energy amount.

Figure 3. Main components of the farm’s solar power system.

The SPS power generation capacity is 4480 Watts under the Standard Test Conditions (STC). Practically, the maximum power generation recorded was around 3800 watts, based on the SPS’s previous-years power generation data statistics recorded by the system controller.

The optimal ML pipeline-building process is summarized in . Historical weather and power generation data were used to train several pipelines, evaluate them, and figure out the one with the best performance (optimal ML pipeline). Two approaches were applied to achieve this goal.

Figure 4. Training process visualization.

The first approach was utilizing the Tree-based Pipeline Optimization Tool (TPOT). As illustrated in Olson (Citation2022), TPOT uses the genetic programming stochastic search to find the optimal pipeline by training and evaluating an immense number of different pipelines, moving toward producing the optimal pipeline structure using GA.

The other method was the Grid Search (GS) optimized training, where several pipelines were structured and trained based on different preprocessors, regressors, and hyperparameter combinations. These pipelines were evaluated, and the best (the pipeline with minimum cost function value) was compared with the optimal one resulting from the TPOT approach. Accordingly, the best of the two was considered to be the optimal ML pipeline.

2. Dataset

2.1. Dataset overview

The historical power generation data were collected between 27 October 2021 and 1 March 2022. This piece of data was collected manually by getting the saved data from the SPS controller memory using USB Flash. The controller could save data for only 20 days, after which it would be overwritten. Accordingly, the team had to collect the data from the site every 2 weeks to be used later as the training set target variable.

The historical weather data used as training input were downloaded from meteoblue.com (MeteoBlue, Citation2022), a web service that has provided historical weather simulation data hourly since 1979 for nearly every location on Earth. On the basis of the collected power generation data, the weather data downloaded corresponded to the dates included in the mentioned data collection period.

Only records captured during daylight hours were kept in the dataset to avoid the over-complexity that could result from training the ML pipeline to predict already-known values (zero watts for night hours). Therefore, the daylight hours that were considered were (7–18) (24-h clock). The selected daylight hours were selected because the sunrise time in the farm location varies between 5:29 and 7:36, and sunset varies between 16:43 and 19:46. Practically, a time window between 7:00 and 18:00 will be sufficient to capture all the solar generation.

For the testing data, meteoblue also provides weather data forecast for up to 7 days. During May 2022, the forecast data were downloaded day-by-day for 10 days, and the power generation, data were recorded and collected from the farm’s SPS for the same period. These data were used to evaluate the pipelines, where the weather data forecast represented the testing set input, and the collected power generation data represented the testing set target values. The dataset structure is shown in .

Figure 5. Dataset splitting strategy for the training and testing processes.

2.2. Power generation data overview

The historical data on power generation consisted of 4000 records. The data were recorded by the SPS controller, which, every 20 min, saves the instantaneous power generation value, in addition to some other variables’ values: date-time (Month – Day – Hour – Minute), load (Watt), string voltage (Volt), batteries charge (%), string current (A), inverter output voltage (Volt), and inverter output frequency (Hz). shows the actual setup on the farm, showing the inverter, battery packs, and PC used to read and store data.

Figure 6. The experimental setup.

The same specifications apply to the data collected in May 2022 for testing purposes. The test data size was 110 records collected over 10 days (records taken during daylight hours were only considered).

2.3. Weather data overview

The historical weather data features are used to train the ML pipeline represent the weather conditions for each hour. The following features were originally considered:

Temperature (°C)
Sunshine Duration (minute)
Total Shortwave Radiation (W/m²)
Direct Shortwave Radiation (W/m²)
Diffused Shortwave Radiation (W/m²)
Precipitation Rate (mm)
Relative Humidity (%)
Cloud Total Coverage (%)
Cloud Coverage High (%)
Cloud Coverage Medium (%)
Cloud Coverage Low (%)
Sea Level Pressure (hPa)
Evapotranspiration (mm)
FAO Reference Evapotranspiration (mm)
Soil Temperature (∘C)
Soil Moisture (∘C)
Wind Speed (km/h)
Wind Direction (∘)
Time-related Features (e.g. Month, Hour…)

The weather data were sourced by meteoblue for the farm location: 32°09’18.2” N 35°42’51.2” E. The forecast weather data had the same characteristics as the historical data.

2.4. Power generation data preparation

The target data were collected serially from the SPS controller. The data had tidiness issues and invalid records that should be dropped (e.g. records with all values equal to zero). Out-of-logging time readings, as well as noisy power generation values, represent the majority of data issues. shows a sample of raw data.

Table 1. Sample of raw data exported from the SPS controller.

Download CSV Display Table

Moreover, the data were aggregated on an average hourly to get the mean power generation during each hour. For example, for 1 day, the power values recorded at 1:20 pm, 1:40 pm, and 2:00 pm are averaged to get the value of the mean power generation during the interval 1:01 pm–2:00 pm.

After aggregation, the PV Power (Watt) column was merged with the historical weather data based on the date and time of the record to create the desired training set.

2.5. Weather data preparation

Weather data were downloaded from the Meteoblue weather service. The data were clean and tidy, but some statistical-based preparation was required.

The weather data were suffering mainly from the high dimensionality and collinearity. Accordingly, highly correlated variables were identified using the Pearson correlation coefficient. The highly correlated features ( $| Coeff | ⩾ 0.75$ ) were grouped. From each group, only one feature was kept, and the others were eliminated. The feature chosen from each group was the one with a higher correlation with the target variable. shows the Pearson correlation coefficients between the features. Note that the lower triangular part is clipped to avoid redundancy.

Figure 7. Correlation Matrix: Features Pearson Correlation.

This step was important to reduce the complexity of the pipeline and enhance the performance of parametric regressors. The following features remained after preparation:

Temperature (°C)
Direct Shortwave Radiation (W/m²)
Mean Sea Level Pressure (hPa)
Cloud Cover Low (%)
Relative Humidity (%)
Wind Direction (∘)
Wind Speed (Km/h)
Hour

3. Methodology

This section consists of a detailed description of the algorithms used to build the pipelines, the data preprocessing techniques, and the two optimization methods used to obtain the optimal ML pipeline.

3.1. Algorithms

Many algorithms can be used to train ML models. This work required algorithms to predict the continuous variable, which is the hourly power generation values. Accordingly, regression algorithms (regression models) were utilized. The ML algorithms used in this article are:

3.1.1. Ridge regression

Ridge regression is a linear regression algorithm. As proposed by Stephanie (Citation2022), the power of ridge regression comes from the regularization of L2, which adds a penalty to the high coefficients of the linear function. This regularization enhances the model performance by dealing with the multicollinearity in data and the overfitting issue.

EquationEq. (1)(1) ${Cost}_{ridge} = \sum_{i = 1}^{n} {(y_{i} - \sum_{j} x_{ij} β_{j})}^{2} + λ \sum_{j = 1}^{p} (β_{j}^{2})$ (1) represents the cost calculation formula, where the error for each data point is squared, and the sum of these squared errors is added to the penalty value. The penalty is found by squaring each coefficient in the linear equation and summing all of the squared values to be multiplied by λ, which is the penalization strength parameter. (1) ${Cost}_{ridge} = \sum_{i = 1}^{n} {(y_{i} - \sum_{j} x_{ij} β_{j})}^{2} + λ \sum_{j = 1}^{p} (β_{j}^{2})$ (1)

3.1.2. Lasso regression

Lasso is another type of linear regression. As opposed to Ridge regression, Lasso uses L1 regularization, which also penalizes the high coefficients. However, the difference is that the L1 type may result in sparse models. Therefore, some coefficients might become zero and get discarded. On the contrary, the L2 type does not eliminate coefficients. EquationEq. (2)(2) ${Cost}_{lasso} = \sum_{i = 1}^{n} {(y_{i} - \sum_{j} x_{ij} β_{j})}^{2} + λ \sum_{j = 1}^{p} | β_{j} |$ (2) shows the cost formula for the Lasso type: (2) ${Cost}_{lasso} = \sum_{i = 1}^{n} {(y_{i} - \sum_{j} x_{ij} β_{j})}^{2} + λ \sum_{j = 1}^{p} | β_{j} |$ (2)

3.1.3. Support vectors machines (SVM)

As explained by Raj (Citation2020), the objective of the SVM algorithm is to find a hyperplane in an n-dimensional space that distinctly classifies the data points. In the regression case, the SVM tries to fit the best hyperplane within a threshold value. Radical Basis Function (RBF) and linear kernels were used to train a subset of the models, as illustrated by Raschka et al. (Citation2022b). First, the linear kernel is the simplest type of kernel, and it is usually used in cases where data are linearly separable. The RBF kernel can be used to solve non-linear problems. This kernel can be considered a similarity function, where it produces a number of bounded regions that separate data points based on their closeness in the n-dimensional space. The RBF can be presented in (3): (3) $k (x^{(i)}, x^{(j)}) = exp (- γ | | x^{(i)} - x^{(j)} | |^{2})$ (3)

3.1.4. AdaBoost ensemble method

AdaBoost works on training several simple models called weak learners and then produces the final model that utilizes these weak learners to make the final prediction. Summarized by Raschka et al. (Citation2022a), the process is carried out in the following steps:

The type and number of models chosen to be trained on the complete training dataset, e.g. 500 decision trees.
Each decision tree is trained and evaluated, and misclassified points are noted (as in ).
Consequently, another decision tree will be trained on the same dataset, giving higher weight to the misclassified points like in the second chart, where large circles got higher weights than other points. The current training will focus on the points previously missed to minimize the loss (as in ).
The process continues in the same way. Assuming that the process consists of three rounds of boosting, the three trained weak learners (decision trees) will be combined by a weighted majority vote, as shown in .

Figure 8. Visualization of Adaboost working procedure given a sample of data points. (Raschka et al. Citation2022a).

3.1.5. Gradient boosting ensemble method

Gradient boosting is another boosting family. It shares the same concepts with AdaBoost but with some differences. The key point is that gradient boosting relies on training the weak learners on the residual errors, which is the difference between the true label and the predicted value. Each consequent model equals the previous model but with a slight modification (enhancement) based on the previous one’s residual error.

3.1.6. Random forest ensemble method

The principle behind random forests is called bagging. The principle is based on creating a collection of uncorrelated models (e.g. decision trees), and training each one on a data sample drawn from the training set with replacement. The final prediction is calculated by getting the majority vote (classification problem) or the mean of all models’ predictions (regression problem).

3.2. Preprocessors

3.2.1. Standard scalar

Scaling the input data features is critically important to enhance the training quality. ML models tend to perform better when they work on scaled data. One type of the famous scaling method is Standard Scaling. Here, each feature distribution is centered around a mean value equal to zero and a standard deviation equal to one. This helps turn the feature to be approximately normally distributed, so all features have nearly the same range of values. EquationEq. (4)(4) $x_{scaled} = \frac{x - \bar{x}}{σ}$ (4) presents the scaling formula. (4) $x_{scaled} = \frac{x - \bar{x}}{σ}$ (4) where x is the original data point, $\bar{x}$ and σ are the mean value and the standard deviation of the data points, respectively.

3.2.2. Power transformer

The power transformer is a type of data transformation that is applied to make the data distribution more normally distributed (Bell-shaped), as described in Scikit Learn (Citation2022). This transformer was used in the preprocessing step to transform data into the Gaussian distribution, where it is desired to train the proposed linear regressors. Specifically, the Yeo-Johnson transformation method was chosen due to the presence of positive and negative values. EquationEq. (5)(5) $ψ (y, λ) = {\begin{matrix} \frac{{(y + 1)}^{λ} - 1}{λ} & y \geq 0 and λ \neq 0 \\ log (y + 1) & y \geq 0 and λ = 0 \\ - \frac{{(- y + 1)}^{2 - λ} - 1}{2 - λ} & y < 0 and λ \neq 2 \\ - log (- y + 1) & y \geq 0 and λ = 2 \end{matrix}$ (5) shows the data points transformation formulas. (5) $ψ (y, λ) = {\begin{matrix} \frac{{(y + 1)}^{λ} - 1}{λ} & y \geq 0 and λ \neq 0 \\ log (y + 1) & y \geq 0 and λ = 0 \\ - \frac{{(- y + 1)}^{2 - λ} - 1}{2 - λ} & y < 0 and λ \neq 2 \\ - log (- y + 1) & y \geq 0 and λ = 2 \end{matrix}$ (5)

3.3. Optimization method 1: Grid Search

The Grid Search (GS) was utilized to find the best combination between sets of hyperparameters, preprocessors, and regressors. The search process was defined by training various pipelines’ structures, using all possible combinations, and then evaluating each resulting pipeline.

describes the hyperparameters tuned for each model, the total number of models, and the resulting pipeline count. Note that the goal of the GS was to completely train and evaluate the built pipelines based on all possible combinations.

Table 2. Grid Search involved regressors, tuned hyperparameters, and the number of trained regressors and total pipelines.

Download CSV Display Table

3.4. Optimization method 2: Tree-based pipeline optimization tool

TPOT uses a tree-based structure to represent pipelines and uses a version of genetic programming to train and evaluate pipelines to produce the best (optimal) trained pipeline that achieves the lowest loss. Each pipeline structure includes data cleaning, feature selection, feature processing, feature construction, and regressor(s) steps. Regressor(s) hyperparameters optimization is also included in the optimization process. Note that TPOT is a wrapper for the Python ML package scikit-learn.

Taking into account the processing power of the machine used, TPOT was run for 400 generations, with a population size of 50, a crossover rate of 0.6, and a mutation rate of 0.4. The mean absolute error (MAE) was used as an evaluation metric. inspired by Olson et al. (Citation2016) illustrates the TPOT optimization process in a simplified way.

Figure 9. Machine learning pipeline steps automated by TPOT.

4. Results and discussion

The proposed metric that was chosen to compare pipelines’ performance and identify the optimal ML pipeline is the Average Normalized Mean Absolute Error (ANMAE). ANMAE is a metric that is used to evaluate the accuracy of the experimented ML pipelines. The ANMAE values indicate the average difference between the predicted and actual values, normalized by the range of the actual values. In practical terms, lower ANMAE values indicate that the pipeline is better at predicting the target variable, while higher ANMAE values indicate that the pipeline is less accurate. For example, if the ANMAE value is 0.1, it means that the pipeline predictions are, on average, off by 10% of the range of the actual values. Therefore, when evaluating the machine learning pipeline, it is desirable to have lower ANMAE values, as this indicates that the pipeline is making more accurate predictions.

To find ANMAE, the Normalized Mean Absolute Error (NMAE) is calculated first: for each hour, the difference between each prediction and the corresponding true value is found. This difference is then divided by the 90th percentile range (the range between the 5th and the 95th percentiles) of the power generation values for that specific hour. For example, suppose that the power generation prediction was made for the hour 9 am. In that case, the difference between the predicted value ( $\hat{y}$ ) and the true value (y) is to be calculated and divided by the 90th percentile range (R_H). In this example, (R₉) represents the 90th percentile range of historical power generation values recorded between 9:00 am and 9:59 am. After that, the ratios between the differences and the percentile ranges are summed and then divided by the number of data points (n) for that hour. This results in a specific NMAE for the specific hour (NMAE_H). This leads to many NMAEs corresponding to the number of daylight hours. The NMAE calculation formulas are shown in (6) and (7). (6) $R_{H} = Q_{95} - Q_{5}$ (6) (7) ${NMAE}_{H} = \frac{1}{n} \sum_{i = 1}^{n} \frac{| {\hat{y}}_{i} - y_{i} |}{R_{H}}$ (7) where (n) is the number of data records for the hour (H), (Q₉₅) and (Q₅) are the 95th and 5th percentiles, respectively, and R_H is the percentile range of the historical power generation values recorded during the interval H:00 and H:59.

The ANMAE is calculated by taking the average of the NMAEs. This is done by summing all NMAEs and dividing the result by the number of unique hours found in the dataset (11 hours, from 7:00 to 18:00). EquationEq. (8)(8) $ANMAE = \frac{1}{11} \sum_{H = 7}^{18} {NMAE}_{H}$ (8) shows the ANMAE formula: (8) $ANMAE = \frac{1}{11} \sum_{H = 7}^{18} {NMAE}_{H}$ (8)

The ANMAE was chosen because it gives a relative indication of the pipelines’ performance. Normalization was applied using the percentile range due to the presence of outliers in power generation values.

The Percentage Root Mean Squared Error (PRMSE) serves as another metric to evaluate the performance of ML pipelines. It proves particularly effective in comparing different pipeline performances and demonstrates robustness against outliers Despotovic et al. (Citation2015). A lower PRMSE value indicates better performance. The calculations are represented by EquationEq. (9)(9) $PRMSE = \sqrt{\frac{\frac{1}{N} \sum_{i = 1}^{N} {({\hat{y}}_{i} - y_{i})}^{2}}{\sum_{i = 1}^{N} y_{i}^{2}}} \times 100$ (9) , where N stands for the total number of records in the dataset, ( $\hat{y_{i}}$ ) denotes the predicted value, and ( $y_{i}$ ) signifies the true value. (9) $PRMSE = \sqrt{\frac{\frac{1}{N} \sum_{i = 1}^{N} {({\hat{y}}_{i} - y_{i})}^{2}}{\sum_{i = 1}^{N} y_{i}^{2}}} \times 100$ (9)

4.1. Evaluation results

The assessment of pipelines was conducted on the designated test set as detailed in Section 2. Throughout the optimization phase, multiple machine learning pipelines underwent training and evaluation. In this context, each ML pipeline training and evaluation is referred to as an experiment. shows the distribution of the NMAE values grouped for each hour for all pipelines. It can be noted that during the early morning hours, there are considerable counts of NMAE values that exceed 100%, and this is due to the low generation values during these hours. Additionally, the noon hours have high NMAE values exceeding 100%. By investigating the results, it was inferred that the pipelines containing linear models did overestimate power generation values during these hours (11:01, 13:00).

Figure 10. NMAE values distribution, grouped by each hour.

The distributions of ANMAE and PRMSE values, resulting from the evaluation conducted in each experiment, are depicted in and , respectively. A red vertical line in both charts represents the mean value.

Figure 11. ANMAE values distribution for all the experimented pipelines.

Figure 12. PRMSE values distribution for all the experimented pipelines.

Moreover, contains details about the top and bottom 5 pipelines, ranked based on the ANMAE values. Note that due to the large number of the experimented pipelines, the evaluation results and details for the top and bottom 3 pipelines are only presented. Pipelines containing the Gradient Boosting regressor were the best at all. Conversely, the Support Vector Regressor (SVR) was the worst algorithm used to make predictions, especially in pipelines that pass the original data without any preprocessing (Preprocessor = Null).

Table 3. Details of top and bottom 3 pipelines resulted from the 2 optimization methods: Grid Search and TPOT.

Download CSV Display Table

The bar chart in illustrates the minimum ANMAE by each regressor type, regardless of the optimization method used. Moreover, a comparison between the GS and TPOT Genetic Algorithm optimization methods was done to see the performance difference between the best pipeline produced by each of these methods. ANMAE for these best pipelines is visualized in . contains the best pipelines’ structure details.

Figure 13. Lowest ANMAE achieved by each regressor type.

Figure 14. ANMAE achieved by the best pipeline resulted from each of the 2 optimization methods.

Table 4. Details of best pipeline resulted from each of the 2 optimization methods: Grid Search and TPOT.

Download CSV Display Table

5. Conclusion

The work presented in this article aimed to create an ML pipeline that takes forecast weather data metrics – for a day – as input and predicts the hourly power generation produced by SPS. The pipeline building was to serve a farm in Jordan equipped with an SPS. So, we needed to train an ML pipeline on weather and power generation data specific to the farm’s location.

Weather data were downloaded from a web service that provides hourly historical weather data created using simulations, since 1979 for nearly every location on Earth. Power generation data were collected through the SPS’s system controller, which records hourly power generation values and saves them in memory.

The pipeline structure consists of two steps: a preprocessor and a regressor. To build and train a robust pipeline, two optimization methods were used to optimize the pipeline steps and the associated parameters. First, GS was used to find the best combination between a set of preprocessors and a set of regressors, as well as the hyper-parameters. The second method was using TPOT. At the end of the training optimization process, each method produced the pipeline that achieved the minimum loss. Moreover, the loss results and structures of the intermediate pipelines were recorded. The intermediate pipeline is any pipeline created and trained during the process but did not achieve the lowest loss.

The evaluation results were investigated to compare the results of the optimization methods and choose the optimal ML pipeline based on the lowest ANMAE achieved. Pipelines consisting of ensemble regressors could achieve the best performance. Gradient boosting regressor was the best at all, with ANMAE of 46.31% and PRMSE of 27.07%. This regressor was part of the best pipeline found by the GS method.

On the other hand, the best pipeline resulting from the TPOT contained two stacked ensemble regressors, the random forest and gradient boosting. In the preprocessing step, there were two sequential preprocessors, the max normalizer and the max absolute scalar. This pipeline recorded 51.81% ANMAE and 30.16% PRMSE.

Pipelines’ performance did not fit optimal ML pipeline requirements, where lower ANMAE is required. The optimal ML pipeline will serve the farm by predicting the hourly power generation for the next day accurately, and those predictions will be used as inputs for a scheduling algorithm that works on matching the operation times of some machines and gadgets in the farm with the hours that have high power generation values. This scheduling helps reduce the dependency on the battery bank by shifting the consumption activities to run within high power generation times. The optimal ML pipeline was employed to support the farm by accurately predicting hourly power generation for the next day. These predictions were then used as input for a scheduling algorithm designed to align the operating times of various machines and gadgets on the farm with hours of peak power generation. This scheduling strategy contributed to a reduction in power demand from the grid by approximately 27.3%.

Accordingly, retraining should be done using the same methods, as they relatively could produce pipelines with good performance. However, more data should be collected to achieve better performance. The new training data should cover a whole calendar year at minimum, where the training and optimization processes should be performed on data containing different weather conditions for different seasons.

Supplemental material

Acknowledgment

We would like to express our sincere thanks to Beyond Limits – an Industrial and Enterprise-grade AI technology company – for their collaboration and for providing mentorship during the research period.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

Data are available on request from the authors.

Additional information

Notes on contributors

Hussam J. Khasawneh

Hussam J. Khasawneh is an Associate Professor of Mechatronics Engineering at the University of Jordan and Associate Professor of Electrical Engineering Al Hussein Technical University (HTU). His research focuses on advanced mechatronics and energy systems.

Zaid A. Ghazal

Zaid A. Ghazal is mechatronics engineering graduate from the University of Jordan.

Waseem M. Al-Khatib

Waseem M. Al-Khatib is mechatronics engineering graduate from the University of Jordan.

Ahmad M. Al-Hadi

Ahmad M. Al-Hadi is mechatronics engineering graduate from the University of Jordan.

Zaid M. Arabiyat

Zaid M. Arabiyat is a mechatronics engineering graduate from the University of Jordan.

References

Abraham, M. (2017). Encyclopedia of sustainable technologies. Elsevier.
Google Scholar
Al-Dahidi, S., Ayadi, O., Alrbai, M., & Adeeb, J. (2019). Ensemble approach of optimized artificial neural networks for solar photovoltaic power prediction. IEEE Access,7, 81741–81758. https://doi.org/10.1109/ACCESS.2019.2923905
Web of Science ®Google Scholar
Bourhnane, S., Abid, M. R., Lghoul, R., Zine-Dine, K., Elkamoun, N., & Benhaddou, D. (2020). Machine learning for energy consumption prediction and scheduling in smart buildings. SN Applied Sciences, 2(2), 1–10. https://doi.org/10.1007/s42452-020-2024-9
Web of Science ®Google Scholar
Despotovic, M., Nedic, V., Despotovic, D., & Cvetanovic, S. (2015). Review and statistical analysis of different global solar radiation sunshine models. Renewable and Sustainable Energy Reviews, 52, 1869–1880. https://doi.org/10.1016/j.rser.2015.08.035
Web of Science ®Google Scholar
Illindala, M. S., Khasawneh, H. J., & Renjit, A. A. (2015). Flexible distribution of energy and storage resources: Integrating these resources into a microgrid. IEEE Industry Applications Magazine, 21(5), 32–42. https://doi.org/10.1109/MIAS.2014.2345838
Web of Science ®Google Scholar
Khasawneh, H. J., & Illindala, M. S. (2013 Quantitative and qualitative evaluation of flexible distribution of energy and storage resources [Paper presentation]. In 2013 IEEE Energy Conversion Congress and Exposition, 43–50. https://doi.org/10.1109/ECCE.2013.6646679
Google Scholar
Khasawneh, H. J., & Illindala, M. S. (2014). Battery cycle life balancing in a microgrid through flexible distribution of energy and storage resources. Journal of Power Sources, 261, 378–388. https://doi.org/10.1016/j.jpowsour.2014.02.043
Web of Science ®Google Scholar
Kim, S.-Y., Lee, E.-J., Khatri, U., Shin, S., Kim, J.-I., & Kwon, G.-R. (2022 Comparative analysis of solar power generation prediction system using deep learning [Paper presentation]. In 2022 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), 383–386. IEEE. https://doi.org/10.1109/ICAIIC54071.2022.9722650
Google Scholar
Lee, D., & Cheng, C.-C. (2016). Energy savings by energy management systems: A review. Renewable and Sustainable Energy Reviews, 56, 760–777. https://doi.org/10.1016/j.rser.2015.11.067
Web of Science ®Google Scholar
Liu, D., & Sun, K. (2019). Random forest solar power forecast based on classification optimization. Energy, 187, 115940. https://doi.org/10.1016/j.energy.2019.115940
Web of Science ®Google Scholar
MeteoBlue. (2022). “History Weather.” https://www.meteoblue.com/en/historyplus.
Google Scholar
Olson, R. S. (2022). “TPOT—epistasislab.github.io.” http://epistasislab.github.io/tpot/.
Google Scholar
Olson, R. S., Bartley, N., Urbanowicz, R. J., & Moore, J. H. (2016). Evaluation of a tree-based pipeline optimization tool for automating data science. In Proceedings of the Genetic and Evolutionary Computation Conference 2016, 485–492. https://doi.org/10.1145/2908812.2908918
Google Scholar
Our World Data. (2021a). “Electricity production by source.” https://ourworldindata.org/grapher/electricity-prod-source-stacked.
Google Scholar
Our World Data. (2021b). “Share of electricity production from fossil fuels.” https://ourworldindata.org/grapher/share-electricity-fossil-fuels.
Google Scholar
Raj, A. (2020). “Unlocking the True Power of Support Vector Regression—towardsdatascience.com.” https://towardsdatascience.com/unlocking-the-true-power-of-support-vector-regression-847fd123a4a0.
Google Scholar
Raschka, S., Liu, Y., Mirjalili, V., & Dzhulgakov, D. (2022a). Leveraging weak learners via adaptive boosting. Packt Publishing.
Google Scholar
Raschka, S., Liu, Y., Mirjalili, V., & Dzhulgakov, D. (2022b). Using the kernel trick to find separating hyperplanes in a high-dimensional space. Packt Publishing.
Google Scholar
Scikit Learn. (2022). “Preprocessing Data - PowerTransformer.” https://scikit-learn.org/stable/modules/preprocessing.html.
Google Scholar
Setlhaolo, D., Xia, X., & Zhang, J. (2014). Optimal scheduling of household appliances for demand response. Electric Power Systems Research, 116, 24–28. https://doi.org/10.1016/j.epsr.2014.04.012
Web of Science ®Google Scholar
Shin, D., Ha, E., Kim, T., & Kim, C. (2021). Short-term photovoltaic power generation predicting by input/output structure of weather forecast using deep learning. Soft Computing, 25(1), 771–783. https://doi.org/10.1007/s00500-020-05199-7
Web of Science ®Google Scholar
Stephanie. (2022). “Ridge Regression: Simple Definition—statisticshowto.com.” https://www.statisticshowto.com/ridge-regression/.
Google Scholar
Wang, F., Xuan, Z., Zhen, Z., Li, K., Wang, T., & Shi, M. (2020a). A day-ahead PV power forecasting method based on LSTM-RNN model and time correlation modification under partial daily pattern prediction framework. Energy Conversion and Management, 212, 112766. https://doi.org/10.1016/j.enconman.2020.112766
Web of Science ®Google Scholar
Wang, H., Liu, Y., Zhou, B., Li, C., Cao, G., Voropai, N., & Barakhtenko, E. (2020b). Taxonomy research of artificial intelligence for deterministic solar power forecasting. Energy Conversion and Management, 214, 112909. https://doi.org/10.1016/j.enconman.2020.112909
Web of Science ®Google Scholar
Zang, H., Cheng, L., Ding, T., Cheung, K. W., Liang, Z., Wei, Z., & Sun, G. (2018). Hybrid method for short-term photovoltaic power forecasting based on deep convolutional neural network. IET Generation, Transmission & Distribution, 12(20), 4557–4567. https://doi.org/10.1049/iet-gtd.2018.5847
Web of Science ®Google Scholar

Creating optimized machine learning pipelines for PV power generation forecasting using the grid search and tree-based pipeline optimization tool

Abstract

1. Introduction

1.1. Literature review

1.2. Paper contribution

2. Dataset

2.1. Dataset overview

2.2. Power generation data overview

2.3. Weather data overview

2.4. Power generation data preparation

Table 1. Sample of raw data exported from the SPS controller.

2.5. Weather data preparation

3. Methodology

3.1. Algorithms

3.1.1. Ridge regression

3.1.2. Lasso regression

3.1.3. Support vectors machines (SVM)

3.1.4. AdaBoost ensemble method

3.1.5. Gradient boosting ensemble method

3.1.6. Random forest ensemble method

3.2. Preprocessors

3.2.1. Standard scalar

3.2.2. Power transformer

3.3. Optimization method 1: Grid Search

Table 2. Grid Search involved regressors, tuned hyperparameters, and the number of trained regressors and total pipelines.

3.4. Optimization method 2: Tree-based pipeline optimization tool

4. Results and discussion

4.1. Evaluation results

Table 3. Details of top and bottom 3 pipelines resulted from the 2 optimization methods: Grid Search and TPOT.

Table 4. Details of best pipeline resulted from each of the 2 optimization methods: Grid Search and TPOT.

5. Conclusion

Cogent_revised.zip

revisedCOGENT.tex

cas-refs.bib

revisedCOGENT.bbl

interact.cls

Acknowledgment

Disclosure statement

Data availability statement

Additional information

Notes on contributors

Hussam J. Khasawneh

Zaid A. Ghazal

Waseem M. Al-Khatib

Ahmad M. Al-Hadi

Zaid M. Arabiyat

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date