ABSTRACT
This paper introduces a method for deriving an accurate regression equation based on a set of any paired data, and a technique for solving the equation. For a practical example, we used five hundred seventy-one pairs of sediment concentration and river flow data to derive an accurate sediment rating equation. The graphs of the measured and predicted sediment concentrations matched each other, and data correlation showed Nash–Sutcliffe efficiency (NSE) of 0.9999860, coefficient of determination () of 0.99998679, root mean square error (RMSE) of 0.0345, mean average error (MAE) of 0.0067, volume error (VE) of 1, and sum of square error (SSE) of 0.678631. To explain the technique of deriving and solving the accurate regression equation, separate files of video presentation and excel spreadsheet are provided as supplementary materials. In general, the method can be used to model any processes, and any calibration and validation processes can be addressed.
Introduction
The relationship between independent and dependent variables is governed by accepted scientific laws (Seber and Wild Citation2003) or it is expressed by mathematical, statistical, empirical, analytical, or numerical models. To find the best fit model to the measured data, parameters of the model can be estimated either through calibration or by regression analysis. The performance of the model is evaluated by using different statistical indicators.
Regression analysis, a technique for finding the relation among variables, is important to all scientific work where interpretations need to be drawn from measured data sets (Wu and Yen Citation1992). Authors (Seal Citation1967; Finney Citation1996; Barnes Citation1998; Galton Citation2001) highlighted the history related to the regression analysis, and authors (Fern andez-Delgado et al. Citation2019) provided an extensive experimental survey of regression methods.
If the relationship between dependent and independent variables is known or their relationship is defined by a chosen model, parameters of the model can be determined by the parametric regression method. The results of analysing data using a parametric model may heavily depend on the chosen model for regression and variance functions, moreover also on a possibly underlying preliminary transformation of the variables (Bunke et al. Citation1999). Non-parametric regression methods, on the other hand, have in general a slower rate of convergence, but need no explicit specification of the form of the regression function (Glad Citation1998); the resulting curve is hence completely determined by the data themselves (Glad Citation1998). Different types of parametric or non-parametric regression methods, and their descriptions or applications are given in (Linnet Citation1998; Seber and Wild Citation2003; Qian and Reckhow Citation2005; Li and Yin Citation2009; Lolli and Gasperini Citation2012; Wang and Du Citation2014; Yong Citation2014; Özsoy and Örkçü Citation2016; Fern andez-Delgado et al. Citation2019).
The regression methods which can be used for both parametric and non-parametric regression analysis are artificial neural network (Specht Citation1991; Wu and Yen Citation1992; Zhang et al. Citation1998) and fuzzy regression method (Bárdossy et al. Citation1993; Yang and Lin Citation2002; Hao and Chiang Citation2008). Compared to other regression approaches, the artificial neural network is more appropriate than other approaches (Wiese and Schaper Citation1993; Pao Citation2008; Rahman and Asadujjaman Citation2021). The artificial neural network was designed to study the behaviour of real, nonlinear, complex systems, and they are particularly effective in solving problems where the correlations between the dependent and independent variables are well-known (Kopal et al. Citation2022). However, their precise description by classical mathematical methods is too complicated, too simplified, or impossible (Du and Swamy Citation2014; Kopal et al. Citation2022) and they also embody much uncertainty and difficulty (Masters and Land Citation1997; Zhang et al. Citation1998; Tomandl and Schober Citation2001; Morala et al. Citation2021). The neural network model could be a more useful nonlinear regression tool if it successfully incorporates human knowledge (heuristics) and other regression techniques (Wang Citation1999).
In actual modelling, the underlying processes are generally complex and not well understood, this means that we have little or no idea about the form of the relationship (Seber and Wild Citation2003). For example, different authors indicate that the power function is a commonly used nonlinear regression approach to model the sediment rating curve (eg.,(Asselman Citation2000; Heng and Suetsugi Citation2014; Hapsari et al. Citation2019)). However, the error of the regression equation is very large. Therefore, finding an accurate regression method to derive an accurate regression equation based on a set of any paired data becomes important for the best accurate representation of any process. In this paper, we provide a procedure to derive a complex equation expressing the relationship between dependent and independent variables based on a set of any paired data.
2. Methodology
2.1. An iterative approach to derive an accurate regression equation
To arrive at iteration steps, let us begin from the following definition.
Definition 2.1.
For given values of paired variables and , variables and are defined by
Let , where is the function.
Since a polynomial function can accommodate and generate negative value, positive value, or both negative and positive values, let us consider a polynomial function
Therefore,
where is the error value
Substitute Equationequation 1(1) (1) into 4
Rearrange equation 5
In Equationequation 6(6) (6) , variables , and are connected by plus and minus sign. It shows that values of variables , or have an individual effect on a value of variable (i.e. if we use ) in place of and vice versa, or if we use in the place of and vice versa, a value of will be different). This is a reason why we defined and in the above way to arrive at Equationequation 6(6) (6) .
Let
Substitute Equationequation 7(7) (7) into 6
EquationEquation 8(8) (8) represents an actual value of variable . In Equationequation 8(8) (8) , if a value of error is the minimum tolerable error that could be ignored, then the sum of error values , , ,…. represents an approximate value of the total error . Therefore, the predicted value of variable (let us say ) is given by
Therefore, the difference between and is an error, which is equal to (i.e. ), where refers to the number values of error should be required to derive an accurate regression equation at number of iteration steps. If there are number of values of error (i.e. , , …), there are also number of values of corresponding error (i.e. , , …).
Logic is now if we are able to express error as a function of error , we can derive an accurate regression equation. This is because of both errors (i.e. and ) is the function of variables and . Therefore, we define an iterative procedure to approximate a value of error based on a value of the corresponding error . Let an approximate value of error , , … be equal to , , … respectively. Therefore, the following iteration steps are defined based on Equationequation 9(9) (9) and the explanations above.
For the first iteration step (), , and . Therefore, the first predicted value of variable (i.e. ) is determined by
If , no need to proceed to the next iteration step. If , we proceed to the next iteration step.
For the second iteration step (), , and is determined by
where is the polynomial regression function between the values of and . Therefore, at the second iteration step, the second predicted value of variable (i.e. ) is determined by
If , no need to proceed to the next iteration step. If , we proceed to the next iteration step.
For the third iteration step (), , and is determined by
where is the polynomial regression function between the values of and . Therefore, at the third iteration step, the third predicted value of variable (i.e. ) is determined by
If , no need to proceed to the next iteration step. If , we proceed to the next iteration step.
For the fourth iteration steps (), , and is determined by
where is the polynomial regression function between the values of and . Therefore, at the fourth iteration step, the fourth predicted value of variable (i.e. ) is determined by
If , no need to proceed to the next iteration step. If , we proceed to the next iteration step.
For the th iteration step, , and is determined by
where is the polynomial regression function between the values of and . Therefore, at the th iteration step, the th predicted value of variable (i.e. ) is determined by
For the th iteration step, , and is determined by
where is the polynomial regression function between the values of and . Therefore, at the th iteration step, the th predicted value of variable (i.e. ) is determined by
2.2. Determining the final form of the accurate regression equation
Suppose at the th iteration step, . Then, the final form of an accurate regression equation is obtained through substitutions.
Substitute Equationequation 10(10) (10) into 12
Substitute Equationequation 14(14) (14) into 16
Substitute Equationequation 18(18) (18) into 20
Substitute Equationequation 26(26) (26) into 28
Substitute Equationequation 31(31) (31) into 13
Substitute Equationequation 32(32) (32) into 17
Substitute Equationequation 33(33) (33) into 21
Substitute Equationequation 34(34) (34) into 29
Substitute Equationequation 35(35) (35) into 36; Equationequations 35(35) (35) and Equation36(36) (36) into 37; Equationequations 35(35) (35) , Equation36(36) (36) and Equation37(37) (37) into 38 and so on. After all substitutions have been done one after the other, then the final resulting equation is very long. But, we can see that , , … is the function of variables and . For given values of paired variables, , , , , , , and are all constants. Therefore,
Substitute Equationequation 39(39) (39) into 30
Substitute Equationequation 2(2) (2) into 3
From Equationequation 41(41) (41) , is the function of variable . Therefore,
Substitute Equationequation 42(42) (42) into 40
Suppose at the th iteration step, . Therefore, Equationequation 43(43) (43) is given by
EquationEquation 44(44) (44) is the shorthand form of a very long equation. The substituting equations’ power constants , , and make the equation complex and difficult to simplify. However, the substituting equations that form the complex equation are easily interconnected in an Excel spreadsheet or programmed in Matlab. As we can see from Equationequation 44(44) (44) , there are only two variables and . Therefore, we can solve this equation for a given value of or . A procedure to solve the equation is provided in Section 2.5.
2.3. Determining initial values for deriving an accurate regression equation
In Sections 2.1 and 2.2, we showed the steps to derive and determine the final form of the accurate regression equation based on values of paired variables and . To start deriving the equation based on the values of the paired variables and , we should have to first determine the constants (see Equationequations 1(1) (1) and Equation2(2) (2) ). The polynomial function (see Equationequation 3(3) (3) ) directly describes the relationship between variables and , but it indirectly describes the relationship between variables and . Therefore, for given values of paired variables and , we find values of constants , , , , , , , and for Equationequations 1(1) (1) and Equation2(2) (2) such that plots of versus yield a smooth curve of a polynomial function. Accordingly, once all values of constants are known, the initial and final values of variables will be determined by following the iteration steps above.
2.4. Deriving an accurate sediment rating equation
In above sections, we indicated the general directions showing how to derive and determine the final form of the accurate regression equation, and we also indicated the direction showing how to determine the initial values to start deriving the equation. For a practical example, we use sediment concentration and corresponding river or streamflow data (see and ) to derive an accurate sediment rating equation. In the table, suspended sediment concentration data is represented by variable , whereas flow data is represented by variable .
To make it clear, we use the following steps to derive an accurate sediment rating equation based on the above pairs of sediment concentration and river or streamflow data.
For given values of paired variables and , estimate constants , , , , , , , and such that plots of versus yields a smooth curve of polynomial function (refer to Equationequations 1(1) (1) and Equation2(2) (2) )
Choose a polynomial regression function that fits the plots of versus
From the regression equation in step 2, find the constants of Equationequation 3(3) (3)
Calculate by using Equationequation 1(1) (1)
Calculate by using Equationequation 2(2) (2)
Calculate based on steps 3 and 5
Calculate by using Equationequation 10(10) (10) , where represents the first predicted value. Plot graphs of measured () and predicted () values. If the graphs do not match each other, then we proceed to the next iteration step.
Calculate by using Equationequation 11(11) (11)
Calculate by using Equationequation 12(12) (12)
Consider a polynomial regression function to correlate and
Calculate by using the regression equation from step 10 (i.e. refer to Equationequation 35(35) (35) )
Replace the calculated value of from step 11 in Equationequation 14(14) (14)
Calculate by using Equationequation 14(14) (14) , where represents the second predicted value. Plot graphs of measured () and predicted () values. If the graphs do not match each other, then proceed to the next iteration step.
Replace the calculated value of from step 11 in Equationequation 15(15) (15)
Then, calculate by using Equationequation 15(15) (15)
Calculate by using Equationequation 16(16) (16)
Consider a polynomial regression function to correlate and
Calculate by using the regression equation from step 17 (i.e. refer to Equationequation 36(36) (36) )
Replace the calculated value of from step 11, and the calculated value of from step 18 in Equationequation 18(18) (18)
Then, calculate by using Equationequation 18(18) (18) , where represents the third predicted value. Plot graphs of measured () and predicted () values. If the graphs do not match each other, then we proceed to the next iteration step, and so on.
We repeat the same procedure to calculate a value of by using Equationequation 30(30) (30) , where subscript stands for number of iteration steps. During each iteration step, we plot graphs of the measured () and predicted () values. Our iteration procedure ends when the graphs almost match each other.
Based on the paired data given in the table 0, the values of the required constants (i.e. , , , , , , , and ) and variables (i.e. , , , , ….) had been determined by following the above steps. The values of these constants and variables are given below. shows the graph of the original river or streamflow () versus sediment concentration () data, and the graph of the transformed data ( versus ) (see Section 2.3).
As the values of the above constants and variables were already determined, the final form of the accurate regression equation is obtained by direct substitutions (refer to Section 2.2). Therefore, the final form of the accurate sediment rating equation is given by
For the final form of the equation, the graphs of measured ()and predicted sediment concentration () matched each other (see ).
Since the final form of the equation is a very large and complex equation, the above values of the variables are easily interconnected in an excel spreadsheet or programmed in Matlab. Separate files of video presentation and Excel spreadsheet are provided as supplementary files.
2.5. Solving the accurate sediment rating equation
In the above section, we showed the procedures to derive the accurate sediment rating equation. For the paired suspended sediment concentration () and flow data (), we calculated each value of , , …, and the corresponding value of , , …, respectively. At the fifteenth iteration step, at the values of and , we found that . Therefore, the last remaining errors are and . According to the steps above or Section 2.1, a value of is determined by
Based on equation 43
Therefore,
Based on Equationequations 1(1) (1) , Equation27(27) (27) and 39
For each paired values of and , there are corresponding values of and . Now, we take the values of and as paired input data to derive another equation that relates and by following the above steps, and so on. To derive the equation based on paired values of and , we calculate another values of and (see the steps above). To avoid confusion, let us express these other values of and in terms of and , respectively. Therefore, we define the following relationship.
For the given paired data ( and ), at the value of , we found that . According to steps above or section 2.1, the value of is determined by
Consider equation 43
Therefore,
For each paired value of and , there is a corresponding value of , which is a unique value. It is to mean that, for a given value of , there is only one value of which results in a corresponding value of , minimum value or zero value of (i.e. there is no possibility to have two different values of for the same values of paired data). From the relationship (refer to Equationequation 73(73) (73) ) to approximate a value of for an unknown value of suspended sediment concentration or flow data, we keep on deriving a series of equations until a value of is approximately zero or it is far apart from a value of . In this case, the value of determines the accuracy of the approximation. Therefore, to estimate an unknown value of suspended sediment concentration for a given value of flow data, a value of suspended sediment concentration that results in the minimum value of is the solution.
Since the systems of equations forming the complex equation are very long, the separate files of the video presentation and Excel spreadsheet on deriving and solving the accurate sediment rating equation are provided as the supplementary files.
3. Results
The iterative approach for deriving an accurate regression equation based on values of paired variables is given in Section 2.1. The procedures to determine the final form of the accurate regression equation are given in Section 2.2. Accordingly, the shorthand form of the final accurate regression equation is given by
where, and are variables, , , and are constants for given values of paired data.
The accurate sediment rating equation which was derived based on five hundred seventy-one number of records of suspended sediment concentration and flow data is given by
The graphs of measured and predicted suspended sediment concentration matched each other (see ), and statistical measures for the data correlation are given in . The procedures to solve the accurate regression equation are given in Section 2.5.
4. Discussions
The relationship between the sediment concentration and flow was given by the complex equation (it was not polynomial or other kinds of known function). This equation may reflect the complex relationship between the dynamic behaviour of flow and sediment transport.
A power function is a commonly used non-linear regression approach for predicting sediment from a given flow data. However, a regression error is very large. The comparison of sediment prediction accuracy of the proposed regression equation and power function are given in . The proposed regression equation is very accurate. We can minimize a regression error as small as possible by increasing iteration steps.
Model calibration and validation are challenging tasks to apply a model for a particular purpose, even for further improvement of the model. For example, if we consider the Modified Universal Soil Loss Equation (MUSLE) or the improved MUSLE, finding the coefficient, soil erodibility, cover, and conservation practice factors of the MUSLE or the improved MUSLE through calibration is not a feasible approach (Tsige et al. Citation2022a, Citation2022b). This is because only a product effect of the coefficient and these factors is reflected in the MUSLE or the improved MUSLE rather than their individual effect during the calibration of sediment yield (Tsige et al. Citation2022a, Citation2022b). Therefore, the individual effect of model variables rather than their product effect on the engaged physical processes is important. Therefore, expressing the relationship between model variables in such a way that their individual effects can be seen on the engaged physical process is essential. The proposed regression method may play a significant role in this regard.
5. Conclusions
The accurate sediment rating equation was derived by following the proposed iteration steps. For the paired values of suspended sediment concentration () and flow () data, the shorthand form of the final accurate sediment rating equation is given by
where, , , and are constants for given values of paired data
In this paper, the polynomial regression functions were considered to derive very long and complex accurate regression equation. However, we can use any other known functions. And also, variables and were defined in such a way that individual effects of other variables can reflect on variable (refer to section 2.1). However, we can define variables and in another way, and we follow the proposed iterative approach to derive an accurate regression equation.
The proposed iterative approach can be used to derive an accurate regression equation based on given values of paired variables. Therefore, the iterative approach can be used to model any processes, and any calibration and validation processes can be addressed.
In this paper, the iterative procedure is provided to solve the accurate regression equation. For further research, the analytical solution of the equation is recommended.
Supplemental Material
Download MS Excel (5.6 MB)Disclosure statement
No potential conflict of interest was reported by the author(s).
Supplementary material
Supplemental data for this article can be accessed online at https://doi.org/10.1080/13873954.2024.2313014
Additional information
Funding
References
- Asselman NEM. 2000. Fitting and interpretation of sediment rating curves. J Hydrol. 234(3):228–248. doi:10.1016/S0022-1694(00)00253-5.
- Bárdossy A, Bogárdi I, Duckstein L. 1993. Theory and methodology: fuzzy nonlinear regression analysis of dose-response relationships. Eur J Oper Res. 66(1):36–51. doi:10.1016/0377-2217(93)90204-Z.
- Barnes TJ. 1998. A history of regression: actors, networks, machines, and numbers. Envir & Plan. 30(2):203–223. doi:10.1068/a300203.
- Bunke O, Droge B, Polzehl J. 1999. Model selection, transformations and variance estimation in nonlinear regression. Stat: A J Theo & Appl Stat. 33(3):197–240. doi:10.1080/02331889908802692.
- Du KL, Swamy MNS. 2014. Neural networks and statistical learning. London, Springer. doi:10.1007/978-1-4471-5571-3.
- Fernández-Delgado M.S, Sirsat M, Cernadas E, et al. 2019. An extensive experimental survey of regression methods. Neural Networks. 111:11–34. doi:10.1016/j.neunet.2018.12.010
- Finney DJ. 1996. A note on the history of regression. J Appl Stat. 23(5):555–557. doi:10.1080/02664769624099.
- Galton SJM. 2001. Pearson, and the Peas: a brief history of linear regression for statistics instructors. J Stat Educ, 9.
- Glad IK. 1998. Parametrically guided non-parametric regression. Scand J Stat. 25(4):649–668. Available from https://www.jstor.org/stable/4616530.
- Hao P, Chiang J. 2008. Fuzzy regression analysis by support vector learning approach. IEEE Trans Fuzzy Syst. 16(2):428–441. doi:10.1109/TFUZZ.2007.896359.
- Hapsari D, Onishi T, Imaizumi F, et al. 2019. The use of sediment rating curve under its limitations to estimate the suspended load. Rev Agric Sci. 7:88–101. doi:10.7831/ras.7.0_88
- Heng S, Suetsugi T. 2014. Comparison of regionalization approaches in parameterizing sediment rating curve in ungauged catchments for subsequent instantaneous sediment yield prediction. J Hydrol. 512:240–253. doi:10.1016/j.jhydrol.2014.03.003.
- Kopal I, Labaj I, Vršková J, et al. 2022. A generalized regression neural network model for predicting the curing characteristics of carbon black-filled rubber blends. Polymers. 14(4):653. doi:10.3390/polym14040653
- Li H, Yin G. 2009. Generalized method of moments estimation for linear regression with clustered failure time data. Biometrika. 96(2):293–306. doi:10.1093/biomet/asp005.
- Linnet K. 1998. Performance of deming regression analysis in case of misspecified analytical error ratio in method comparison studies. Clin Chem. 44(5):1024–1031. doi:10.1093/clinchem/44.5.1024.
- Lolli B, Gasperini P. 2012. A comparison among general orthogonal regression methods applied to earthquake magnitude conversions. Geophy J Int. 190(2):1135–1151. doi:10.1111/j.1365-246X.2012.05530.x.
- Masters T, Land W A new training algorithm for the general regression neural network, 1997 IEEE International Conference on Systems, Man, and Cybernetics, Orlando, USA, 1997.
- Morala P, Cifuentes JA, Lillo RE, et al. 2021. Towards a mathematical framework to inform neural network modelling via polynomial regression. Neural Networks. 142:57–72. doi:10.1016/j.neunet.2021.04.036.
- Özsoy VS, Örkçü HH. 2016. Estimating the parameters of nonlinear regression models through Particle Swarm optimization. Gazi Univ J Sci. 29:187–199.
- Pao H. 2008. A comparison of neural network and multiple regression analysis in modeling capital structure. Expert Syst Appl. 35(3):720–727. doi:10.1016/j.eswa.2007.07.018.
- Qian SS, Reckhow KH. 2005. Nonlinear regression modeling of nutrient loads in streams: a Bayesian approach. Water Resour Res. 41.
- Rahman M, Asadujjaman M Implementation of artificial neural network on regression analysis, 2021 5th annual systems modelling conference, Canberra, Australia. 2021;. doi:10.1109/SMC53803.2021.9569881.
- Seal HL. 1967. Studies in the history of probability and statistics. XV: the historical development of the gauss linear model. Biometrika. 54(1):1–24.
- Seber GAF, Wild CJ. 2003. Nonlinear regression. John Wiley & Sons, Inc., .Hoboken, New Jersey.
- Specht DF. 1991. A general regression neural network. IEEE Trans Neural Net. 2(6):568–576. doi:10.1109/72.97934.
- Tomandl D, Schober A. 2001. A modified general regression neural network (MGRNN) with new, efficient training algorithms as a robust ‘black box’-tool for data analysis. Neural Networks. 14(8):1023–1034. doi:10.1016/S0893-6080(01)00051-X.
- Tsige MG, Malcherek A, Seleshi Y. 2022a. Estimating the best exponent and the best combination of the exponent and topographic factor of the modified universal soil loss equation under the hydro-climatic conditions of Ethiopia. Water. 14(9):1501. Available from https://www.mdpi.com/2073-4441/14/9/1501.
- Tsige MG, Malcherek A, Seleshi Y. 2022b. Improving the modified universal soil loss equation by physical interpretation of its factors. Water. 14(9):1450. Available from https://www.mdpi.com/2073-4441/14/9/1450.
- Wang F, Du T. 2014. Implementing support vector regression with differential evolution to forecast motherboard shipments. J Amer Math Soc. 41:3850–3855.
- Wang S. 1999. Nonlinear regression: a hybrid model. Comput Oper Res. 26(8):799–817. doi:10.1016/S0305-0548(98)00088-4.
- Wiese M, Schaper KJ. 1993. Application of neural networks in the QSAR analysis of percent effect biological data: comparison with adaptive least squares and nonlinear regression analysis. SAR and QSAR in Envir Res. 1(2–3):137–152. doi:10.1080/10629369308028825.
- Wu FY, Yen KK. 1992. Applications of neural network in regression analysis. Comput Ind Eng. 23(1–4):93–95. doi:10.1016/0360-8352(92)90071-Q.
- Yang M, Lin T. 2002. Fuzzy least-squares linear regression analysis for fuzzy input–output data. Fuzzy Sets Syst. 126(3):389–399. doi:10.1016/S0165-0114(01)00066-5.
- Yong L. 2014. Novel global harmony search algorithm for least absolute deviation. J Appl Math. 2014:1–6. doi:10.1155/2014/632975.
- Zhang G, Patuwo BE, Hu MY. 1998. Forecasting with artificial neural networks: the state of the art. Int J Forecasting. 14(1):35–62. doi:10.1016/S0169-2070(97)00044-7.