Search in:

European Journal of Remote Sensing Volume 56, 2023 - Issue 1

Submit an article Journal homepage

Open access

2,148

Views

CrossRef citations to date

Altmetric

Listen

Research Article

County-level corn yield prediction using supervised machine learning

Shahid Nawaz Khana Department of Geography, University of Alabama, Tuscaloosa, USA;b Institute of Geographical Information Systems, National University of Sciences and Technology, Islamabad, PakistanView further author information

Abid Nawaz Khanc Faculty of Information Technology and Communication Sciences (Data Science), Tampere University, Tampere, FinlandView further author information

Aqil Tariqd Department of Wildlife, Fisheries and Aquaculture, College of Forest Resources, Mississippi State University, Mississippi, USACorrespondence[email protected]

https://orcid.org/0000-0003-1196-1248 View further author information

Linlin Lue Key Laboratory of Digital Earth Science, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, ChinaCorrespondence[email protected]
View further author information

Naeem Abbas Malikf Department of Remote Sensing and GIS, PMAS Arid Agriculture University, Rawalpindi, PakistanView further author information

Muhammad Umairg Département de Géographie, Université de Montréal, Montréal, QC, CanadaView further author information

Wesam Atef Hatamlehh Department of computer science, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi ArabiaView further author information

Farah Hanna Zawaidehi Department of Business Intelligence and Data Analysis, Faculty of Financial and Business Science, Irbid National University, Irbid, JordanView further author information

show all

Article: 2253985 | Received 08 Jun 2023, Accepted 28 Aug 2023, Published online: 05 Sep 2023

Cite this article
https://doi.org/10.1080/22797254.2023.2253985
CrossMark

In this article

ABSTRACT
Introduction
Materials and methods
Results
Discussion
Conclusions
Acknowledgements
Disclosure statement
Additional information
References

Full Article
Figures & data
References
Citations
Metrics
Licensing
Reprints & Permissions
View PDF PDF View EPUB EPUB

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

The main objectives of this study are (1) to compare several machine learning models to predict county-level corn yield in the study area and (2) to compare the feasibility of machine learning models for in-season yield prediction. We acquired remotely sensed vegetation indices data from moderate resolution imaging spectroradiometer using the Google Earth Engine (GEE). Vegetation indices for a span of 15 years (2006–2020) were processed and downloaded using GEE for the months corresponding to crop growth (April–October). We compared nine machine learning models to predict county-level corn yield. Furthermore, we analyzed the in-season yield prediction performance using the top three machine learning models. The results show that partial least square regression (PLSR) outperformed other machine learning models for corn yield prediction by achieving the highest training and testing performance. The study area’s top three models for county-level corn yield prediction were PLSR, support vector regression (SVR) and ridge regression. For in-season yield prediction, the SVR model performed comparatively well by achieving testing R² = 0.875. For in-season corn yield prediction, SVR outperformed other models. The results show that machine learning models can predict both in-season yield (best model R² = 0.875) and end-of-season yield (best model R² = 0.861) with satisfactory performance. The results indicate that remote sensing data and machine learning models can be used to predict crop yield before the harvest with decent performance. This can provide useful insights in terms of food security and early decision making related to climate change impacts on food security.

KEYWORDS:

Remote sensing
yield prediction
MODIS
vegetation indices
food security

Introduction

Corn is one of the important crops globally which is used for several purposes such as food, fodder, and producing ethanol (Ranum et al., Citation2014). United States (US) is one of the largest corn producing country, which accounts for approximately 36% of corn production globally (Green et al., Citation2018). The increasing population and the changing climate pose a threat to global food production. The United Nations, Sustainable Development Goals aim to eliminate food insecurity by 2030. Therefore, accurate crop yield estimation is important. County-level yield estimation can be very important for several stakeholders and provide key decision-making information. This is since key policy decisions, such as the import/export of grains, are related to these estimations, which need to be done accurately and in a timely manner (Krasnova et al., Citation2021; Ma et al., Citation2021). In-season crop yield prediction can also help in ascertaining problems in key areas and early intervention to address those issues (Shahhosseini et al., Citation2021).

Crop yield and its estimation are affected by several factors, which can be grouped into two main categories of factors: biotic and abiotic factors. The biotic includes the crop genotype, diseases, weeds, pests, etc. On the other hand, abiotic factors generally include the environmental factors such as soil quality, nutrient availability, and climatic factors. The interaction between biotic and abiotic factors in agricultural systems can significantly affect crop yield (Wairegi et al., Citation2010). For crop yield prediction, parameters such as crop type, growing and harvesting dates, soil moisture, soil temperature, precipitation, land surface temperature, crop phenology, and several other parameters are typically used. An increase in the number of factors that affect crop yield can increase the uncertainty in prediction models. Moreover, collecting data for so many parameters is costly and time-consuming.

With advancements in remote sensing and computing systems, the potential of remotely sensed data for estimation of crop yield had been explored in the last few decades. Remote sensing data collected with high spatial and temporal resolution can monitor crop growth and assess related stresses, which can affect yield and can help estimate crop yield. Earlier studies have shown that crop yield can be predicted using remote sensing data at various scales (Feng et al., Citation2021). Remote sensing-derived vegetation indices (VIs) can better characterize vegetation conditions than raw reflectance values, as they highlight the vegetation part and downgrade the background information. VIs such as the Normalized Difference Vegetation Index (NDVI), Enhanced Vegetation Index (EVI), Soil-Adjusted Vegetation Index, Green Chlorophyll Index, Normalized Difference Water Index, etc. have been used for crop yield prediction and forecasting (Son et al., Citation2013; Wall et al., Citation2008). The usage of VIs is not limited to crop monitoring only and can characterize several other phenomena, such as vegetation change, phenology, and other phenomena (Ahmad et al., Citation2021; Khan et al., Citation2020). Other than VIs, several other datasets, such as climate, environmental, and soil data, have also been used for crop yield prediction in previous studies (Khaki & Wang, Citation2019). Climate data such as temperature, precipitation, and vapor pressure deficit are very important for crop growth monitoring and thus yield prediction. On the other hand, soil variables such as bulk density, soil pH, available water storage and soil organic carbon have also been used for yield prediction. These variables affect the crop growth conditions and changes in any of the above factors can improve or reduce grain yield which is why they can be helpful for yield prediction.

Remotely sensed data for crop yield estimation is usually utilized in two types of models, i.e. process-based crop simulations (Guan et al., Citation2017; Jones et al., Citation2017; Muller et al., Citation2017; Peng et al., Citation2018) or empirical statistical models (Bocca & Rodrigues, Citation2016; Gornott & Wechsung, Citation2016; Guan et al., Citation2017; Kern et al., Citation2018). Crop simulation models the crop growth at various crop phenological stages which limits its applicability only to farm or field levels (Sun et al., Citation2019), whereas empirical statistical models can be applied at regional or national levels as they require fewer input features as compared to process-based models. Over the last decade, artificial intelligence has been applied in agriculture including crop yield estimation owing to its capacity to statistically characterize relationships between crops yield and its determining factors. The biophysical factors which are indicators of crop yield change with time and often depict non-linear relationships with crop yield (Dash et al., Citation2018; Whetton et al., Citation2017; Wieder et al., Citation2018). Machine learning (ML) techniques can better model the dynamic factors affecting crop yield within the growing season (Elavarasan et al., Citation2018). ML algorithms for predicting crop yields proved to be robust and accurate as compared to traditional statistical crop yield estimation models (Cai et al., Citation2018; Johnson et al., Citation2016; Pantazi et al., Citation2016, Citation2016).

ML algorithms include supervised and unsupervised algorithms. Supervised ML models involve labelled data (Caruana & Niculescu-Mizil, Citation2006). ML algorithms like Decision Tree (DT), Random Forest (RF), Bayesian Network, Artificial Neural Network (ANN) and Support Vector Machines (SVM) are supervised ML models, whereas unsupervised ML algorithms such as Markov chain model, K-means clustering, expectation maximization algorithm, density-based spatial clustering of applications with noise and Apriori algorithm process unlabeled data (Hahne et al., Citation2008). DT method which partitions the data into subsets based on certain features for decision making is simple to understand and identify important features but tends to overfit the data (Xu et al., Citation2005). RF being an ensemble method uses multiple DTs for prediction, handles large number of input features and is less prone to overfitting as final prediction is based on average of all predictions. ANNs consist of multiple layers of interconnected nodes like in human neurons to identify complex relationships between large number of variables but can be prone to overfitting. The support vector regression (SVR) model finds a hyperplane that best fits the data points while minimizing the error between the predicted values and the actual target values (Asif et al., Citation2023; Li et al., Citation2023; Liu et al., Citation2023). SVR model is very effective in high dimensional spaces and can handle non-linear decision boundaries. However, it can be sensitive to kernel function and can be computationally expensive for large datasets.

Importantly, supervised ML algorithms rely on provided input data and corresponding output values to best model the relationship between independent features and target feature. Among various supervised ML algorithms, neural network, linear regression algorithms, RF and SVM are widely used in crop yield estimation studies (van Klompenburg et al., Citation2020). ANNs were used to predict wheat and rice yields using within season-data of various crop yield limiting factors (Baral et al., Citation2011; Russ et al., Citation2008). Johnson et al. (Citation2016) used multiple linear regression (MLR) and ANN to predict barley, canola, and spring wheat crop yields from satellite derived VIs, i.e. NDVI and EVI (Tariq & Qin, Citation2023; Tariq et al., Citation2023). Polynomial regression ML model along with satellite derived NDVI and leaf area index (LAI) derived from field spectrometer performed better than logistic regression to predict maize yield (Kunapuli et al., Citation2015). Similarly, counter-propagation ANNs (CP-ANNs), XY-fused networks (XY-Fs) and supervised Kohonen networks (SKNs) multi-layer soil data and NDVI were used to predict wheat crop yield. Reportedly, SKN performed better with overall accuracy of 81.65% as compared to CP-ANNs and XY-Fs with 78.3% and 80.92% accuracy respectively (Pantazi et al., Citation2016). RF as compared to MLRs reported to be versatile method for global and regional level wheat, maize and potato crop yield prediction based on its high precision and accuracy (Jeong et al., Citation2016). RF was used to predict sugarcane yield with decent accuracy using simulated biomass indices, observed climate and seasonal climate prediction indices (Everingham et al., Citation2016). Spiking neural networks (SNNs) were able to predict winter wheat crop yield six weeks before harvest with an average accuracy of 95.64% using moderate resolution imaging spectroradiometer (MODIS) 250-m resolution timeseries and historical crop yield data (Bose et al., Citation2016). Furthermore, MODIS NDVI timeseries and an ensemble model of ANNs were used to predict sugarcane yield in Brazil with relative root mean square error (RRMSE) of 8% and with the coefficient of determination (R²) of 0.61 (Fernandes et al., Citation2017). Khanal et al. (Citation2018) integrated high resolution remotely sensed data, crop monitor yield dataset, one-meter digital elevation model and soil properties data to predict soil properties and corn yield using ANN, SVM and RF (Tariq et al., Citation2022, Citation2023; Wahla et al., Citation2022). RF was reported to outperform other methods with soil and VIs as input as compared to topographic variables (Tariq & Mumtaz, Citation2022; Tariq et al., Citation2023).

Studies mentioned above have used ML models to predict crop yield at several scales. Moreover, several features were used in these studies to predict crop yield. As mentioned earlier, collecting data for several features is not always feasible. Furthermore, crop yield prediction is sensitive to the scale of yield prediction. Since several ML models have different working principles, we hypothesize that the performance of different ML models will be diverse with some models showing superior performance over others. Moreover, since end-of-season yield prediction utilizes more data than in-season yield prediction, their prediction accuracy will be higher than in-season yield prediction (Sun et al., Citation2019). Therefore, we proposed to compare several ML algorithms for predicting county-level corn yield. We train ML models using NDVI and EVI data obtained from MODIS. Moreover, we compare the performance of in-season and end-of-season yield prediction ML models. Specifically, we aim to address the following research questions:

What is the predictive performance of several machine learning models to predict county-level corn yield?
How well can machine learning models predict county-level corn yield using in-season and end-of-season data?

The remainder of this article is structured as follows: Section 2 details the study area, data, and methods used in this study. Section 3 presents the results, while Section 4 discusses the results and their implications. Conclusions are presented in Section 5.

Materials and methods

Materials

Study area

This study was carried out in the state of South Dakota, US. South Dakota is one of the US Midwestern states which is responsible for a considerable amount of corn production in the US (Olson et al., Citation2007). South Dakota is situated between approximately 42.5° N to 45.95° N latitude and 96.43° W to 104.06° W longitude bordered by Nebraska to the south, North Dakota to the north, Minnesota to the east, and Wyoming and Montana to the west. The study area has a diverse climate, spanning from a humid continental climate in the east to a semi-arid climate in the western areas. Characterized by four different seasons, the climatic conditions extend from extremely cold and arid winters to warm and semi-humid summers (Todey et al., Citation2009). In summer, the average temperature of South Dakota ranges from 20.5°C to 32°C while in winter the average temperature ranges from −12°C to −3.8°C. South Dakota has a total of 66 counties. The eastern part of South Dakota is responsible for major crops production. The main crops of South Dakota are corn, soybean, alfalfa, and wheat. illustrates the study area map.

Figure 1. Study area map. The corn pixels for year 2020 are shown as yellow.

Corn yield data

County-level corn yield data were collected from the United States Department of Agriculture (USDA) National Agricultural Statistical Service. The yield data is reported annually by USDA which represents the harvested yield after each season. In South Dakota, corn is usually harvested from October to early November. The yield data was downloaded through the NASS web service Quick Stats (https://www.nass.usda.gov/Quick_Stats/, accessed 18 December 2021). Yield data is reported in bu/ac; however, we converted it to MT/ha for this study. County-level yield records of the study area were collected and mapped to its geographic data using the county-level shapefiles obtained from the US Census Topologically Integrated Geographic Encoding and Referencing (TIGER) project (Marx, Citation1986).

Vegetation indices (VIs)

Several VIs are available to monitor vegetation health; however, NDVI and EVI are the most widely used VIs for crop health monitoring and yield prediction. For this study, we used the NDVI and EVI derived from MODIS data (Didan, Citation2015). VIs instead of raw reflectance provides unique capabilities since they highlight key vegetation conditions such as vegetation health, diseases and stresses which aids in better interpretation. This is because certain electromagnetic spectrum bands are sensitive to vegetation conditions such as vigor and stress. NDVI value ranges from −1 to 1. High positive values of NDVI show healthy green vegetation, while lower values represent relatively unhealthy vegetation. Values less than 0 generally show an area with no vegetation. EVI is a modified form of NDVI mostly used for areas with higher biomass. EVI also minimizes the soil and atmospheric effects due to the coefficients of aerosol resistance terms. To extract the VIs values on the county level, we used MODIS product (MOD13Q1). MOD13Q1 is a processed remote sensing product derived from MODIS surface reflectance and has undergone several levels of processing to remove known sources of errors and noises. We extracted the time series of NDVI and EVI each during the study area’s corn growing season (April–October) for 15 years (2006–2020). Since the temporal resolution of MOD13Q1 is 16 days, there were 12 time series observations for each county each year.

Croplands data layer (CDL)

CDL is a specialized land cover dataset developed by USDA with a primary focus on delineating various crop types (Craig, Citation2010). The spatial resolution of CDL is 30 m provided annually by USDA. In this research, we used the CDL mask to exclude non-corn pixels from the VIs dataset. Since county-level yield the aggregation of VIs to the county level, it is very important to eliminate the pixels which belongs to other crops or other land cover types. The CDL mask was applied to MODIS VIs in Google Earth Engine (GEE) (Gorelick et al., Citation2017).

Data preprocessing

The following steps were applied in data preprocessing:

Shapefiles for county and state-level were obtained from the US census TIGER project. Since, TIGER shapefiles are available in GEE, they can be directly imported using their production details available in GEE. Using the study area details, the shapefile were filtered to include counties in the study area only https://developers.google.com/earth-engine/datasets/catalog/TIGER_2018_Counties (accessed December 20, 2021).
The boundaries of the study area were used to get MODIS VIs during the study period (2006-2018). Specifically, MOD13Q1 product which is available on GEE was used. The boundary layer of study area used to retrieve the VIs for the study area.
CDL data was used to filter the VIs and only corn pixels were kept. Pixels of other crops and other land cover types were eliminated. To filter the data using CDL, a binary raster of CDL was generated. The binary mask used to eliminate non-corn pixels.
The resulting data was aggregated to county-level using median operation and downloaded from GEE. Median aggregation operation is available in GEE and is widely used to aggregate data.
The downloaded data was again matched to county-level records and yield data was integrated with the VIs. County-level yield records were obtained from USDA. The records were matched based on GEOID which is a unique identifier consisting of county and state codes.
Data was cleaned and counties where no yield was reported were removed from the study. In this step, several cleaning operations were applied on the data. Counties where no yield records were provided by USDA were eliminated. Counties where CDL have no corn pixels were also eliminated from this study. illustrates the step-by-step workflow of this study.
Figure 2. Overall workflow of the study.

Methods

The following ML methods were used to predict county-level corn yield.

Multiple linear regression (MLR)

MLR is generally used to estimate the value of one (dependent variables) using multiple (at least two) independent variables (Eberly, Citation2007). This is the very basis of regression when the relationship between several independent variables and one dependent variable is determined. For example, how different values of rainfall, temperature and vapor pressure deficit affect the crop yield can be determined by establishing the relationship between those variables and crop yield. In MLR, a linear line is fitted between the dependent variable and each independent variable which is represented in form of mathematical equation. The mathematical formula of MLR is shown in EquationEquation 1(1) $y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots \cdot β_{n} x_{n} + ε$ (1) :

(1)

y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots \cdot β_{n} x_{n} + ε

(1)

where y represents the dependent variable (crop yield in this case), $β_{0}$ is y-intercept (the value dependent variable when all the independent variables are zero), $β_{i}$ represents the regression coefficient of independent variable $x_{i}$ while e is the model error which the model variations in estimated $y$ . There are several assumptions associated with MLR which are as follows: (1) MLR assumes that the relationship between the dependent variable and independent variable is linear, (2) all the independent variables are normally distributed, (3) there is no multicollinearity in the data i.e. independent variables are not correlated with each other, (4) the mean of residuals (different between actual and observed values) is zero, and (5) residuals are independent of each other. However, these assumptions are not always true which reduces the model performance.

Polynomial regression (PR)

In linear regression, as discussed above, the model assumes the relationship between independent and dependent variable to be linear which is not always true and may make the model inaccurate. Sometimes the straight linear line is not able to capture the relationship and we need complex models. Polynomial regression is another type of regression where instead of completely linear equation, we fit the data using a curvilinear equation and establish the relationship between dependent and independent variables (Ostertagová, Citation2012). The term “polynomial regression” derives its name from its fundamental approach of expressing the relationship between dependent and independent variables through the utilization of a polynomial equation of the nth degree. The mathematical formula of a polynomial equation is shown in EquationEquation 2(2) $y = β_{0} + β_{1} x_{} + β_{2} x_{} + \dots \cdot β_{n} x n^{- 1} .$ (2) :

(2)

y = β_{0} + β_{1} x_{} + β_{2} x_{} + \dots \cdot β_{n} x n^{- 1} .

(2)

Ridge regression (RR)

RR is a modified form of linear regression which adds a penalty term to OLS regression to avoid overfitting in the function (McDonald, Citation2009). Generally, the penalty term is proportional to the square of magnitude of coefficients. This regularization helps improve the model by avoiding overfitting and improving models’ generalization. RR model is very helpful in studies where there is possibility of multicollinearity in the independent variables. RR uses l2 regularization which add the penalty term to the squared values of model’s coefficients. EquationEquation 3(3) $y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots \cdot β_{n} x_{n} + λ \sum_{i = 1}^{p} β_{i^{2}} .$ (3) illustrates the mathematical equation for RR, where $λ$ is the regularization term and the term $\sum_{i = 1}^{p} β_{i}$ represents the sum of the squares of the coefficients, penalizing large coefficients:

(3)

y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots \cdot β_{n} x_{n} + λ \sum_{i = 1}^{p} β_{i^{2}} .

(3)

Lasso regression (LR)

LR is another regression technique which uses the same method to apply regularization like RR; however, it uses the l1 regularization instead of l2 (Ranstam & Cook, Citation2018). The key difference between l1 and l2 regularization is the l1 regularization add the penalty term to the absolute values of model’s coefficients while the l2 regularization adds the penalty term to the squared values of model’s coefficients. Like RR, LR is also used in problems where there is potential of multicollinearity in the independent variables. The mathematical equation of LR is depicted in EquationEquation 4(4) $y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots \cdot β_{n} x_{n} + λ \sum_{i = 1}^{p} | β_{i} | .$ (4) . As shown in all the equations, the only difference between MLR and other two regression models, i.e. RR and LR, is that they have additional penalty terms which works as a regularizer.

(4)

y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots \cdot β_{n} x_{n} + λ \sum_{i = 1}^{p} | β_{i} | .

(4)

Partial least square regression (PLSR)

PLSR is a technique used to analyze and model the relationship between several independent variables and a dependent variable. PLSR is very useful when many highly correlated predictors are used to model their relationship with the dependent variables. It is particularly useful when the number of samples is relatively small. It transforms a large set of correlated predictors to a small number of uncorrelated predictors. The resultant predictors account for explaining the maximum amount of variance in both the input and response variables. Subsequently, the uncorrelated variables are used to construct a linear regression model between the dependent and independent variables. EquationEquation 5(5) $y = β_{0} + β_{1} t_{1} + β_{2} t_{2} + \dots \cdot β_{n} t_{n} .$ (5) illustrates the mathematical working of PLSR model. The only difference in PLSR as compared to the previous models is the presence of terms ( $t_{1}, t_{2, . .} t_{n}) .$ In PLSR, the term $t_{n}$ represents the scores of the nth PLSR component. These scores are derived from a linear combination of the original predictor variables (x) and capture the maximum covariance between x and y.

(5)

y = β_{0} + β_{1} t_{1} + β_{2} t_{2} + \dots \cdot β_{n} t_{n} .

(5)

Decision tree regression (DTR)

DTR uses decision trees to model the relationship between dependent and independent variables (Xu et al., Citation2005). DTR models recursively divide the data based on the features which provides the most information again about the predicted or estimated variable. This process is executed recursively until some conditions are met. The DTR model estimates the final value by averaging the predicted values from all the individual trees for the same sample (Asadollah et al., Citation2021). EquationEquation 6(6) $y = \frac{1}{n} {\sum_{i \in n} y}_{i}$ (6) illustrates the equation of DTR. y is the predicted value at the leaf node while n is the number of samples:

(6)

y = \frac{1}{n} {\sum_{i \in n} y}_{i}

(6)

Support vector regression (SVR)

SVR is another supervised learning method which is used to resolve regression problems (Smola & Schölkopf, Citation2004). SVR is the counterpart of famous ML learning model – SVMs which is widely used for classification related problems. SVR constructs a hyperplane in high-dimensional space so that it represents the data very optimally. SVR uses a kernel function which initially converts the data to a very high dimensional space (Ramedani et al., Citation2014). The model then uses the high-dimensional data to find a hyperplane which is very optimal. The optimal hyperplane is then used to make predictions on new data samples. The most important parameter in SVR model is selection of kernel function since the choice of kernel function determines the transformation of data into high-dimensional space (Üstün et al., Citation2007). The mathematical formula of SVR model is shown in EquationEquation 7(7) $y = ⟨w, x⟩ + b$ (7) , where y is the predicted output, x is a set of features, w is the weight vector which determines the optimal hyperplane and b is the bias term. EquationEquation 8(8) $y = | y - ⟨w, x⟩ - b | \leq ε .$ (8) shows the constraint which the SVR model aim to satisfy while finding the optimized hyperplane, where is the allowed margin of error.

(7)

y = ⟨w, x⟩ + b

(7)

(8)

y = | y - ⟨w, x⟩ - b | \leq ε .

(8)

Random forest regression (RFR)

RF is an ensemble learning method being used for both classification and regression tasks (Breiman, Citation2001). RFR is an ensemble learning model used for regression problems based on the decision tree algorithm (Huang & Liu, Citation2022). Multiple decision trees are combined to form a RF model. In RF, each decision tree is built using a random subset of variables, and each tree makes its own prediction. In classification problems, the majority vote determines the predicted class, while in regression problems, the final prediction is the average of the predictions from all the trees (Belgiu & Drăguţ, Citation2016). One of the major benefits of the RF model is its ability to handle nonlinear data very effectively. This makes it a popular choice for solving complex regression problems. EquationEquation 9(9) $y = \frac{1}{N_{t r e e s}} \sum_{i = 1}^{N_{t r e e s}} f_{i} (x)$ (9) shows the mathematical formula of RFR model, where y is the regression results, $N_{t r e e s}$ is the number of decision trees in the RF model while $f_{i} (x)$ is the result of i-th decision tree for input features $x$ . The RF equation shows that final output is the average of all the trees present in the RF model:

(9)

y = \frac{1}{N_{t r e e s}} \sum_{i = 1}^{N_{t r e e s}} f_{i} (x)

(9)

Gradient boosting machines (GBM)

GBM is a ML model used for regression and classification problems (Friedman, Citation2001). The GBM model also belongs to the family of ensemble learning models which use multiple models to achieve better prediction performance (Sagi & Rokach, Citation2018). The main idea behind the GBM model is to combine several weak models to create a strong model that can improve prediction performance. The model works in such a way that each tree corrects the errors of the previous tree sequentially, thus minimizing the loss function as the model is trained. Initially, GBM uses a decision tree for training and then uses the errors generated by the first tree to train a subsequent tree, which improves the predictions and reduces the error (Spedicato et al., Citation2018). The mathematical equation of GBM for regression is shown in EquationEquation 10(10) $y = y_{0} + α \cdot f_{1} (X) + α \cdot f_{2} (X) \cdot \dots α \cdot f_{n} (X)$ (10) . All the trees are trained based on the errors generated by preceding trees, and final predictions are made based on the collective output of all the trees. GBM is widely used due to its ability to handle complex datasets and produce highly accurate predictions. The term $f_{1}, f_{2,} . . . f_{n}$ represents the prediction from $n t^{h}$ tree:

(10)

y = y_{0} + α \cdot f_{1} (X) + α \cdot f_{2} (X) \cdot \dots α \cdot f_{n} (X)

(10)

Experimental setup

In this study, our aim was to predict county-level corn and compare nine different ML models for county-level corn yield prediction. We used corn yield and VIs data from 2006 to 2020. Furthermore, we compared the performance of in-season and end-of-season data to assess the accuracy of ML models at early stage. End-of-season yield prediction generally produces good results; however, it is important to predict yield early and before the harvest. In this context, the whole season was divided into approximately two parts based on the input data. Since the input data contained 12 timesteps from April to October with 16 days temporal resolution, for in-season yield prediction, we used the first six timesteps. The data was split into 70% and 30% for training and testing, respectively. For a reliable comparison of ML models, the training and testing data was saved independently to avoid a random split for each model. Hyperparameter tuning of models was performed using 10-fold cross validation which is effective for county scale crop yield prediction (Shahhosseini et al., Citation2021). To assess the performance of ML models, we used coefficient of determination (R²), root mean squared error (RMSE), and relative root mean squared error (RRMSE %).

Results

Descriptive statistics

The total number of records where yield was reported during 2006–2020 was 751. The maximum and minimum yield in the study area was 1.08 and 13.44 MT/ha, respectively. Average yield during the study period (2006–2020) was 7.61 MT/ha. illustrates the yield distribution in the study area (2006–2020). shows the temporal distribution of yield in the study area. Corn yield in the study area is consistently increasing except for 2012 where extreme droughts hit the Midwestern US affecting the corn yield.

Figure 3. County-level corn yield distribution in the study area.

Figure 4. Temporal distribution of corn yield in the study area.

Performance of ML models

Nine ML models were tested to predict county-level corn yield in South Dakota, USA. illustrated the training and testing R², RMSE, RRMSE (%) for all the ML models used in this study. The results of all ML models were considerably high. PLSR model outperformed the rest of the ML models with R² = 0.902 and 0.861 for training and testing, respectively. The second-best model for corn yield prediction was SVR model with R² = 0.950 and 0.856 for training and testing, respectively. The third-best model for county-level yield prediction was RR with R² = 0.867 and 0.854, respectively. Other evaluation metrics followed the same pattern. The top three models for county-level corn yield prediction are presented as bold and underlined in . The scatterplots of observed and predicted yield of top three models are presented in . The three models have relatively high R² and low RRMSE as compared to other models in corn yield prediction. As depicted in the scatterplots, it is evident that the three leading models do not manifest consistent proclivities towards overestimation or underestimation. Year-wise scatterplots of best model (PLSR) is illustrated in .

Figure 5. The scatterplots of observed and predicted yield using PLSR model (a) training, (b) testing.

Figure 6. The scatterplots of observed and predicted yield using SVR model (a) training, (b) testing.

Figure 7. The scatterplots of observed and predicted yield using RR model (a) training, (b) testing.

Figure 8. The scatterplots of in-season observed and predicted corn yield using PLSR model (year wise).

In-season and end-of-season yield prediction

One of the important aspects of crop yield prediction is the time of yield prediction. For end-of-season yield prediction, we used the full data, i.e. 12 timeseries of both NDVI and EVI. To compare in-season and end-of-season prediction, we used the three best models from to predict county-level corn yield prediction in the study area. The three best models were selected based on model performance such as R² and RMSE values. Results of in-season and end-of-season yield prediction are presented in the subsequent sections. presents the results of in-season corn yield prediction using the top three ML models. Compared to the end-of-season yield prediction (), the performance of three ML models dropped since they are utilizing a smaller number of timeseries features; however, the drop in SVR model performance is comparatively less. For end-of-season yield prediction, the RR achieved R² = 0.854. The same model achieved R² = 0.688 for in-season yield prediction. The drop in model testing R² was 0.166. Similarly, the PLSR model achieved R² = 0.861 and 0.692, respectively, for end-of-season and in-season yield prediction with a drop in model testing R² = 0.169. Finally, the SVR model achieved R² = 0.856 and 0.771 for end-of-season and in-season yield prediction, respectively, with a drop in model testing R² = 0.085. This shows that RR and PLSR experienced a nearly similar drop in model performance with a small number of features; however, the performance of SVR model was less affected by reducing the number of features (6 timeseries features instead of 12). Results obtained from these models are very important in terms of yield prediction using long-term data for food security and agricultural decision-making.

Table 4. Training and testing R² of ML models for in-season corn yield prediction.

Download CSV Display Table

Discussion

Traditionally county-level yield is estimated through surveys which is time-consuming and costly. Secondly, the traditional county-level crop yield estimates can only be done once the crop is harvested. After harvesting, it is not possible to address policy related decisions. Since ML models can predict county-level corn yield with considerable accuracy, the models can be used on large-scale to predict yield before the harvest (Sun et al., Citation2019).

In terms of performance, three models performed very well as compared to the rest of the models. PLSR, SVR and RR outperformed the rest of models by achieving high training and testing performance. Overall, only two models, i.e. MLR and DTR, did not perform too well as compared to the rest of the models due to some inherent limitations. MLR assumes the relationship between the dependent and independent variables as linear which is not always true. Furthermore, MLR model also assumes the errors to be normally distributed. Linear models are also affected by the presence of outliers as they assume the data does not have any outliers. When any of the above assumptions are not true in the data, it affects the model performance. The low performance of DTR model can be attributed to its nature as DTR model is generally prone to overfitting when a smaller number of data samples are available. These findings are important in the context of agriculture since mostly a very small number of samples are available. Our results agree with other studies which have used the PLSR model in different domains (Ali et al., Citation2023; Cheng & Sun, Citation2017). The performance of PLSR, SVR, and RR owes to the nature of these models which have a few advantages over others. Firstly, they can handle multicollinearity effectively. Multicollinearity is the presence of high correlation among the features. MLR is very sensitive to multicollinearity, thus affecting its performance. On the other hand, PLSR can overcome this issue by creating new composite variables which reduces multicollinearity in the features. Another advantage of PLSR model is its ability to handle small sample size effectively and high dimensional data since this study involves 12 timeseries of two predictors (24 features). On the other hand, SVR model has the ability to model non-linear relationships between the input and output data. This makes SVR superior over linear models such as MLR. On the other hand, the SVR model can capture non-linear relationships between the input and output data, giving it an advantage over linear models like MLR. Furthermore, SVR can yield better results in situations with limited data. This advantage stems from SVR’s inherent capabilities, such as robustness to outliers and model simplicity, which aid in better generalization. These qualities help prevent overfitting and contribute to superior performance over other models, even when data is limited. Although RR is a variant of linear regression, it adds a regularization term to ordinary least squares (OLS) which helps improve the model performance and reduces overfitting.

Our results also prove that VIs derived from satellite data can predict corn yield with high accuracy. VIs as compared to raw reflectance has the ability to highlight the vegetation part (Khan et al., Citation2022). This is since healthy vegetation reflects most of the incident light in the near-infrared region of electromagnetic spectrum and absorbs most of the light in red region. NDVI have been used very extensively for crop monitoring and estimating crop parameters such as LAI, chlorophyll content, nitrogen content and crop yield. Similarly, EVI is an enhanced VI which can model the high biomass regions very effectively. Since VIs focus more on the vegetation part, they can explain the variability in yield. Another aspect of MODIS-derived VIs is their global availability. Mostly, the products derived from MODIS data are accessible on a global scale and can offer valuable insights in regions where collecting data on other variables is challenging.

In-season yield prediction is very important for early planning and decision-making. shows the scatterplots and model performance in terms of R², RMSE and RRMSE (%). The in-season yield prediction models also produced considerable results. The training and testing R² of SVR model for in-season-corn yield prediction was 0.87 and 0.77, respectively (). This means the corn yield can be predicted with 77% accuracy in the mid-season using VIs derived from satellite data. Results produced by the models using in-season data show that the first half of growing season is very important for county-level corn yield prediction. Although the second half of the growing season provides valuable additional information, in terms of yield prediction they are not as much important as the first half. This is in line with previous research where county-level soybean yield was predicted using data from start of growing season to end-of-season with multiple timesteps. In the growing season, each timestep adds unique information about crop conditions to assist in predicting yield; however, the information obtained during initial stages proved to be more useful for yield prediction. In terms of model performance for in-season yield prediction, the overall performance of all models reduced due to a small number of features; however, the SVR model was less affected by reducing the number of features. In the presence of a small number of features, the superior performance of SVR model indicates that it is more robust and can be used for yield prediction in future studies when a small number of features are available.

Figure 9. The scatterplots of observed and predicted corn yield of SVR model using in-season data.

The availability of data on large scale and high temporal resolution provides several opportunities to explore crop yields and other related parameters (Jin et al., Citation2020). Furthermore, such studies can be further expanded by incorporating additional parameters and using a wide range of ML models on different scales to see how it affects the crop yield prediction of different regions, crop species and time. Year-wise scatterplots of in-season corn yield prediction using the SVR model are presented in .

Figure 10. The scatterplots of in-season observed and predicted corn yield using SVR model (year wise).

Finally, our study also has some limitations. For yield prediction on the county level, yield and other variables are aggregated over a large area. This can result in loss of information due to the variations and landcover in each county. To address this, yield modelling on different scales can be done to see if the county-level yield prediction results are similar to other scales. There are several factors which affect crop yield prediction differently and which may not be characterized by RS-derived VIs. These issues can be addressed in future by conducting very detailed studies at different scales, crops, and different regions.

Conclusions

In this study, we compared the performance of several ML models in estimating corn yield at the county-level in the study area. Our results indicate that the ML models used in this study were effective to predict corn yield at the county level. Data and models used in this study can be used to predict the yield of different crops. Furthermore, the utility of the data and models used in this study is not limited to yield only, as they can be useful to predict other parameters such disease severity, LAI, and nitrogen content. Supervised ML models trained with MODIS-derived VIs demonstrated robust performance, even with relatively small sample sizes. In terms of model selection, the top three models were PLSR, SVR, and RR, which achieved testing R² values of 0.861, 0.856, and 0.854, respectively. The performance of other ML models was also noteworthy. We also tested the prediction of yield using in-season data. The top three models selected from the end-of-season predictions were used to predict in-season yield, and the drop in performance was considerably less. SVR model outperformed other ML models for in-season yield prediction. This suggests that ML models and MODIS-derived VIs can predict corn yield with considerable accuracy before harvest. Further studies are needed to check the robustness of the ML models for diverse climatic conditions and at different scales.

Author contributions

Shahid Nawaz Khan: methodology, software, formal analysis, visualization, data curation, writing – original draft, investigation, validation, writing – review and editing. Abid Nawaz Khan: writing – review and editing. Aqil Tariq: Supervision, investigation, validation, Funding, writing review and editing. Linlin Lu: investigation, validation, Funding, writing review and editing. Naeem Abbas Malik: writing – original draft, investigation, validation, writing – review and editing. Muhammad Umair: validation, writing review and editing. Wesam Atef Hatamleh: writing review and editing. Farah Hanna Zawaideh: writing review and editing. All authors have read and agreed to the published version of the manuscript.

Acknowledgments

The authors extend their appreciation to the researchers supporting project number (RSP2023R384) King Saud University, Riyadh, Saudi Arabia.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The datasets used in this study are available for free. Information about the sources of data is mentioned in Sections 2.1.2–2.1.4.

Correction Statement

This article has been corrected with minor changes. These changes do not impact the academic content of the article.

Additional information

Funding

National Key R&D Program of China (Project No. 2022YFC3800700). The authors extend their appreciation to the researchers supporting project number (RSP2023R384) King Saud University, Riyadh, Saudi Arabia.

References

Ahmad, W., Iqbal, J., Nasir, M. J., Ahmad, B., Khan, M. T., Khan, S. N., & Adnan, S. (2021). Impact of land use/land cover changes on water quality and human health in district Peshawar Pakistan. Scientific Reports, 11(1), 16526. https://doi.org/10.1038/s41598-021-96075-3
PubMed Web of Science ®Google Scholar
Ali, S., Khorrami, B., Jehanzaib, M., Tariq, A., Ajmal, M., Arshad, A., Shafeeque, M., Dilawar, A., Basit, I., Zhang, L., Sadri, S., Niaz, M. A., Jamil, A., & Khan, S. N. (2023). Spatial downscaling of GRACE data based on XGBoost model for improved understanding of hydrological droughts in the Indus Basin Irrigation System (IBIS). Remote Sensing, 15(4), 873. https://doi.org/10.3390/rs15040873
Web of Science ®Google Scholar
Asadollah, S. B. H. S., Sharafati, A., Motta, D., & Yaseen, Z. M. (2021). River water quality index prediction and uncertainty analysis: A comparative study of machine learning models. Journal of Environmental Chemical Engineering, 9(1), 104599. https://doi.org/10.1016/j.jece.2020.104599
Web of Science ®Google Scholar
Asif, M., Kazmi, J. H., & Tariq, A. (2023). Traditional ecological knowledge based indicators for monitoring rangeland conditions in Thal and Cholistan Desert, Pakistan. Environmental Challenges, 13, 100754. https://doi.org/10.1016/j.envc.2023.100754
Google Scholar
Baral, S., Tripathy, A. K., & Bijayasingh, P. (2011). Yield prediction using artificial neural networks. In V. V. Das (Eds.), Computer networks and information technologies. CNC 2011. Communications in computer and information science (Vol. 142). https://doi.org/10.1007/978-3-642-19542-6_57
Google Scholar
Belgiu, M., & Drăguţ, L. (2016). Random forest in remote sensing: A review of applications and future directions. ISPRS Journal of Photogrammetry and Remote Sensing, 114, 24–15. https://doi.org/10.1016/j.isprsjprs.2016.01.011
Web of Science ®Google Scholar
Bocca, F. F., & Rodrigues, L. H. A. (2016). The effect of tuning, feature engineering, and feature selection in data mining applied to rainfed sugarcane yield modelling. Computers and Electronics in Agriculture, 128, 67–76. https://doi.org/10.1016/j.compag.2016.08.015
Web of Science ®Google Scholar
Bose, P., Kasabov, N. K., Bruzzone, L., & Hartono, R. N. (2016). Spiking neural networks for crop yield estimation based on spatiotemporal analysis of image time series. IEEE Transactions on Geoscience and Remote Sensing, 54(11), 6563–6573. https://doi.org/10.1109/TGRS.2016.2586602
Web of Science ®Google Scholar
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
Web of Science ®Google Scholar
Cai, Y. P., Guan, K. Y., Peng, J., Wang, S. W., Seifert, C., Wardlow, B., & Li, Z. (2018). A high-performance and in-season classification system of field-level crop types using time-series Landsat data and a machine learning approach. Remote Sensing of Environment, 210, 35–47. https://doi.org/10.1016/j.rse.2018.02.045
Web of Science ®Google Scholar
Caruana, R., & Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning algorithms. Proceedings of the 23rd international conference on Machine learning, USA (pp. 161–168)
Google Scholar
Cheng, J.-H., & Sun, D.-W. (2017). Partial least squares regression (PLSR) applied to NIR and HSI spectral data modeling to predict chemical properties of fish muscle. Food Engineering Reviews, 9(1), 36–49. https://doi.org/10.1007/s12393-016-9147-1
Web of Science ®Google Scholar
Craig, M. (2010). A history of the cropland data layer at NASS. USDA NASS CropScape. http://www.nass.usda.gov/Research_and_Science/Cropland/CDL_History_MEC.pdf
Google Scholar
Dash, Y., Mishra, S. K., & Panigrahi, B. K. (2018). Rainfall prediction for the Kerala state of India using artificial intelligence approaches. Computers & Electrical Engineering, 70, 66–73. https://doi.org/10.1016/j.compeleceng.2018.06.004
Web of Science ®Google Scholar
Didan, K. (2015). MOD13Q1 MODIS/Terra vegetation indices 16-day L3 global 250m SIN grid V006. NASA EOSDIS Land Processes DAAC, 10.
Google Scholar
Eberly, L. E. (2007). Multiple linear regression. Methods in molecular biology (Clifton, N.J.), 404, 165–187 https://doi.org/10.1007/978-1-59745-530-5_9.
PubMedGoogle Scholar
Elavarasan, D., Vincent, D. R., Sharma, V., Zomaya, A. Y., & Srinivasan, K. (2018). Forecasting yield by integrating agrarian factors and machine learning models: A survey. Computers and Electronics in Agriculture, 155, 257–282. https://doi.org/10.1016/j.compag.2018.10.024
Web of Science ®Google Scholar
Everingham, Y., Sexton, J., Skocaj, D., & Inman-Bamber, G. (2016). Accurate prediction of sugarcane yield using a random forest algorithm. Agronomy for Sustainable Development, 36(2). https://doi.org/10.1007/s13593-016-0364-z
Web of Science ®Google Scholar
Feng, L., Wang, Y., Zhang, Z., & Du, Q. (2021). Geographically and temporally weighted neural network for winter wheat yield prediction. Remote Sensing of Environment, 262, 112514. https://doi.org/10.1016/j.rse.2021.112514
Web of Science ®Google Scholar
Fernandes, J. L., Ebecken, N. F. F., & Esquerdo, J. C. D. (2017). Sugarcane yield prediction in Brazil using NDVI time series and neural networks ensemble. International Journal of Remote Sensing, 38(16), 4631–4644. https://doi.org/10.1080/01431161.2017.1325531
Web of Science ®Google Scholar
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 1189–1232. https://doi.org/10.1214/aos/1013203451
Web of Science ®Google Scholar
Gentleman, R., & Carey, V. (2008). Unsupervised machine learning. In Bioconductor Case Studies. Use R! (pp. 137–157). Springer. https://doi.org/10.1007/978-0-387-77240-0_10.
Google Scholar
Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., & Moore, R. (2017). Google earth engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment, 202, 18–27. https://doi.org/10.1016/j.rse.2017.06.031
Web of Science ®Google Scholar
Gornott, C., & Wechsung, F. (2016). Statistical regression models for assessing climate impacts on crop yields: A validation study for winter wheat and silage maize in Germany. Agricultural and Forest Meteorology, 217, 89–100. https://doi.org/10.1016/j.agrformet.2015.10.005
Web of Science ®Google Scholar
Green, T. R., Kipka, H., David, O., & McMaster, G. S. (2018). Where is the USA corn Belt, and how is it changing? Science of the Total Environment, 618, 1613–1618. https://doi.org/10.1016/j.scitotenv.2017.09.325
PubMed Web of Science ®Google Scholar
Guan, K., Sultan, B., Biasutti, M., Baron, C., & Lobell, D. B. (2017). Assessing climate adaptation options and uncertainties for cereal systems in West Africa. Agricultural and Forest Meteorology, 232, 291–305. https://doi.org/10.1016/j.agrformet.2016.07.021
Web of Science ®Google Scholar
Huang, Z., & Liu, Z. (2022). A complex terrain simulation approach using ensemble learning of random forest regression. Journal of the Indian Society of Remote Sensing, 50(10), 2011–2023. https://doi.org/10.1007/s12524-022-01585-w
Web of Science ®Google Scholar
Jeong, J. H., Resop, J. P., Mueller, N. D., Fleisher, D. H., Yun, K., Butler, E. E., Timlin, D. J., Shim, K. M., Gerber, J. S., Reddy, V. R., Kim, S. H., & Gonzalez-Andujar, J. L. (2016). Random forests for global and regional crop yield predictions. PLoS One, 11(6), e0156571. https://doi.org/10.1371/journal.pone.0156571
PubMed Web of Science ®Google Scholar
Jin, X., Zarco-Tejada, P. J., Schmidhalter, U., Reynolds, M. P., Hawkesford, M. J., Varshney, R. K., Yang, T., Nie, C., Li, Z., & Ming, B. (2020). High-throughput estimation of crop traits: A review of ground and aerial phenotyping platforms. IEEE Geoscience and Remote Sensing Magazine, 9(1), 200–231. https://doi.org/10.1109/MGRS.2020.2998816
Web of Science ®Google Scholar
Johnson, M. D., Hsieh, W. W., Cannon, A. J., Davidson, A., & Bedard, F. (2016). Crop yield forecasting on the Canadian Prairies by remotely sensed vegetation indices and machine learning methods. Agricultural and Forest Meteorology, 218, 74–84. https://doi.org/10.1016/j.agrformet.2015.11.003
Web of Science ®Google Scholar
Jones, J. W., Antle, J. M., Basso, B., Boote, K. J., Conant, R. T., Foster, I., Godfray, H. C. J., Herrero, M., Howitt, R. E., Janssen, S., Keating, B. A., Munoz-Carpena, R., Porter, C. H., Rosenzweig, C., & Wheeler, T. R. (2017). Brief history of agricultural systems modeling. Agricultural Systems, 155, 240–254. https://doi.org/10.1016/j.agsy.2016.05.014
Web of Science ®Google Scholar
Kern, A., Barcza, Z., Marjanovic, H., Arendas, T., Fodor, N., Bonis, P., Bognar, P., & Lichtenberger, J. (2018). Statistical modelling of crop yield in central Europe using climate data and remote sensing vegetation indices. Agricultural and Forest Meteorology, 260, 300–320. https://doi.org/10.1016/j.agrformet.2018.06.009
Web of Science ®Google Scholar
Khaki, S., & Wang, L. (2019). Crop yield prediction using deep neural networks. Frontiers in Plant Science, 10, 621. https://doi.org/10.3389/fpls.2019.00621
PubMed Web of Science ®Google Scholar
Khanal, S., Fulton, J., Klopfenstein, A., Douridas, N., & Shearer, S. (2018). Integration of high resolution remotely sensed data and machine learning techniques for spatial prediction of soil properties and corn yield. Computers and Electronics in Agriculture, 153, 213–225. https://doi.org/10.1016/j.compag.2018.07.016
Web of Science ®Google Scholar
Khan, K., Iqbal, J., Ali, A., & Khan, S. (2020). Assessment of sentinel-2-derived vegetation indices for the estimation of above-ground biomass/carbon stock, temporal deforestation and carbon emissions estimation in the moist temperate forests of Pakistan. Applied Ecology and Environmental Research, 18(1), 783–815. https://doi.org/10.15666/aeer/1801_783815
Web of Science ®Google Scholar
Khan, S. N., Li, D., & Maimaitijiang, M. (2022). A geographically weighted random forest approach to predict corn yield in the US corn Belt. Remote Sensing, 14(12), 2843. https://doi.org/10.3390/rs14122843
Web of Science ®Google Scholar
Krasnova, T., Khan, S. N., Pozdnyakov, A., & Vilgelm, A. (2021). Determinants of regional agroindustry and spillovers between Siberian local markets. In E3S Web of Conferences, Russia (pp. 01015). EDP Sciences
Google Scholar
Kunapuli, S. S., Rueda-Ayala, V., Benavídez-Gutiérrez, G., Córdova-Cruzatty, A., Cabrera, A., Fernández, C., & Maiguashca, J. (2015). Yield prediction for precision territorial management in maize using spectral data. In Precision agriculture’15 (pp. 344–358). Wageningen Academic Publishers. https://doi.org/10.3920/978-90-8686-814-8_24
Google Scholar
Li, P., Tariq, A., Li, Q., Ghaffar, B., Farhan, M., Jamil, A., Soufan, W., El Sabagh, A., & Freeshah, M. (2023). Soil erosion assessment by RUSLE model using remote sensing and GIS in an arid zone. International Journal of Digital Earth, 16(1), 3105–3124. https://doi.org/10.1080/17538947.2023.2243916
Web of Science ®Google Scholar
Liu, J., Yang, K., Tariq, A., Lu, L., Soufan, W., & El Sabagh, A. (2023). Interaction of climate, topography and soil properties with cropland and cropping pattern using remote sensing data and machine learning methods. Egypt Journal of Remote Sensors Space Science, 26, 415–426. https://doi.org/10.1016/j.ejrs.2023.05.005
Web of Science ®Google Scholar
Marx, R. W. (1986). The TIGER system: Automating the geographic structure of the United States census. Government Publications Review, 13(2), 181–201. https://doi.org/10.1016/0277-9390(86)90003-8
Google Scholar
Ma, Y., Zhang, Z., Kang, Y., & Özdoğan, M. (2021). Corn yield prediction and uncertainty analysis based on remotely sensed variables using a Bayesian neural network approach. Remote Sensing of Environment, 259, 112408. https://doi.org/10.1016/j.rse.2021.112408
Web of Science ®Google Scholar
McDonald, G. C. (2009). Ridge regression. Wiley Interdisciplinary Reviews: Computational Statistics, 1(1), 93–100. https://doi.org/10.1002/wics.14
Google Scholar
Muller, C., Elliott, J., Chryssanthacopoulos, J., Arneth, A., Balkovic, J., Ciais, P., Deryng, D., Folberth, C., Glotter, M., Hoek, S., Iizumi, T., Izaurralde, R. C., Jones, C., Khabarov, N., Lawrence, P., Liu, W. F., Olin, S., Pugh, T. A. M. … Yang, H. (2017). Global gridded crop model evaluation: Benchmarking, skills, deficiencies and implications. Geoscientific Model Development, 10(4), 1403–1422. https://doi.org/10.5194/gmd-10-1403-2017
Web of Science ®Google Scholar
Olson, A. L. (2007). The Impact of Increased Ethanol Production on Corn Basis in South Dakota [Electronic Theses and Dissertations]. https://openprairie.sdstate.edu/etd/6027
Google Scholar
Ostertagová, E. (2012). Modelling using polynomial regression. Procedia Engineering, 48, 500–506. https://doi.org/10.1016/j.proeng.2012.09.545
Google Scholar
Pantazi, X. E., Moshou, D., Alexandridis, T., Whetton, R. L., & Mouazen, A. M. (2016). Wheat yield prediction using machine learning and advanced sensing techniques. Computers and Electronics in Agriculture, 121, 57–65. https://doi.org/10.1016/j.compag.2015.11.018
Web of Science ®Google Scholar
Pantazi, X. E., Moshou, D., & Bravo, C. (2016). Active learning system for weed species recognition based on hyperspectral sensing. Biosystems Engineering, 146, 193–202. https://doi.org/10.1016/j.biosystemseng.2016.01.014
Web of Science ®Google Scholar
Peng, B., Guan, K. Y., Chen, M., Lawrence, D. M., Pokhrel, Y., Suyker, A., Arkebauer, T., & Lu, Y. Q. (2018). Improving maize growth processes in the community land model: Implementation and evaluation. Agricultural and Forest Meteorology, 250, 64–89. https://doi.org/10.1016/j.agrformet.2017.11.012
Web of Science ®Google Scholar
Ramedani, Z., Omid, M., Keyhani, A., Shamshirband, S., & Khoshnevisan, B. (2014). Potential of radial basis function based support vector regression for global solar radiation prediction. Renewable and Sustainable Energy Reviews, 39, 1005–1011. https://doi.org/10.1016/j.rser.2014.07.108
Web of Science ®Google Scholar
Ranstam, J., & Cook, J. (2018). LASSO regression. The British Journal of Surgery, 105(10), 1348–1348. https://doi.org/10.1002/bjs.10895
Web of Science ®Google Scholar
Ranum, P., Peña‐Rosas, J. P., & Garcia‐Casal, M. N. (2014). Global maize production, utilization, and consumption. Annals of the New York Academy of Sciences, 1312(1), 105–112. https://doi.org/10.1111/nyas.12396
PubMed Web of Science ®Google Scholar
Russ, G., Kruse, R., Schneider, M., & Wagner, P. (2008). Data mining with neural networks for wheat yield prediction. In P. Perner (Eds.), Advances in data mining. Medical Applications, E-commerce, marketing, and theoretical aspects. ICDM 2008. Lecture notes in Computer Science (Vol. 5077, pp. 47). Springer. https://doi.org/10.1007/978-3-540-70720-2_4
Google Scholar
Sagi, O., & Rokach, L. (2018). Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4), e1249. https://doi.org/10.1002/widm.1249
Web of Science ®Google Scholar
Shahhosseini, M., Hu, G., Huber, I., & Archontoulis, S. V. (2021). Coupling machine learning and crop modeling improves crop yield prediction in the US corn Belt. Scientific Reports, 11(1), 1–15. https://doi.org/10.1038/s41598-020-80820-1
PubMedGoogle Scholar
Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14(3), 199–222. https://doi.org/10.1023/B:STCO.0000035301.49549.88
Web of Science ®Google Scholar
Son, N., Chen, C., Chen, C., Chang, L., Duc, H., & Nguyen, L. (2013). Prediction of rice crop yield using MODIS EVI− LAI data in the Mekong Delta, Vietnam. International Journal of Remote Sensing, 34(20), 7275–7292. https://doi.org/10.1080/01431161.2013.818258
Web of Science ®Google Scholar
Spedicato, G. A., Dutang, C., & Petrini, L. (2018). Machine learning methods to perform pricing optimization. A comparison with standard GLMs. Variance, 12, 69–89. http://www.ressources-actuarielles.net/EXT/ISFA/1226.nsf/0/2405c6a5afef8798c12582fe0046fc30/$FILE/Machine-Spedicato.pdf
Google Scholar
Sun, J., Di, L. P., Sun, Z. H., Shen, Y. L., & Lai, Z. L. (2019). County-level soybean yield prediction using deep CNN-LSTM model. Sensors, 19(20), 4363. https://doi.org/10.3390/s19204363
PubMed Web of Science ®Google Scholar
Tariq, A., Hashemi Beni, L., Ali, S., Adnan, S., & Hatamleh, W. A. (2023). An effective geospatial-based flash flood susceptibility assessment with hydrogeomorphic responses on groundwater recharge. Groundwater for Sustainable Development, 23, 100998. https://doi.org/10.1016/j.gsd.2023.100998
Google Scholar
Tariq, A., Jiango, Y., Li, Q., Gao, J., Lu, L., Soufan, W., Almutairi, K. F., & Habib-Ur-Rahman, M. (2023). Modelling, mapping and monitoring of forest cover changes, using support vector machine, kernel logistic regression and naive Bayes tree models with optical remote sensing data. Heliyon, 9(2), e13212. https://doi.org/10.1016/j.heliyon.2023.e13212
PubMed Web of Science ®Google Scholar
Tariq, A., & Mumtaz, F. (2022). Modeling spatio-temporal assessment of land use land cover of Lahore and its impact on land surface temperature using multi-spectral remote sensing data. Environmental Science and Pollution Research, 30(9), 23908–23924. https://doi.org/10.1007/s11356-022-23928-3
PubMed Web of Science ®Google Scholar
Tariq, A., Mumtaz, F., Majeed, M., & Zeng, X. (2023). Spatio-temporal assessment of land use land cover based on trajectories and cellular automata Markov modelling and its impact on land surface temperature of Lahore district Pakistan. Environmental Monitoring and Assessment, 195(1), 114. https://doi.org/10.1007/s10661-022-10738-w
Web of Science ®Google Scholar
Tariq, A., & Qin, S. (2023). Spatio-temporal variation in surface water in Punjab, Pakistan from 1985 to 2020 using machine-learning methods with time-series remote sensing data and driving factors. Agricultural Water Management, 280, 108228. https://doi.org/10.1016/j.agwat.2023.108228
Web of Science ®Google Scholar
Tariq, A., Yan, J., Gagnon, A. S., Riaz Khan, M., & Mumtaz, F. (2022). Mapping of cropland, cropping patterns and crop types by combining optical remote sensing images with decision tree classifier and random forest. Geo-Spatial Information Science, 1–19. https://doi.org/10.1080/10095020.2022.2100287
Web of Science ®Google Scholar
Todey, D., Trobec, J., & Mogil, H. M. (2009). South Dakota's Climate and Weather. Weatherwise: THE POWER, THE BEAUTY, THE EXCITEMENT, 62, 16–23.
Google Scholar
Üstün, B., Melssen, W., & Buydens, L. (2007). Visualisation and interpretation of support vector regression models. Analytica chimica acta, 595(1–2), 299–309. https://doi.org/10.1016/j.aca.2007.03.023
PubMed Web of Science ®Google Scholar
van Klompenburg, T., Kassahun, A., & Catal, C. (2020). Crop yield prediction using machine learning: A systematic literature review. Computers and Electronics in Agriculture, 177, 105709. https://doi.org/10.1016/j.compag.2020.105709
Web of Science ®Google Scholar
Wahla, S. S., Kazmi, J. H., Sharifi, A., Shirazi, S. A., Tariq, A., & Joyell Smith, H. (2022). Assessing spatio-temporal mapping and monitoring of climatic variability using SPEI and RF machine learning models. Geocarto International, 37(27), 14963–14982. https://doi.org/10.1080/10106049.2022.2093411
Web of Science ®Google Scholar
Wairegi, L. W., Van Asten, P. J., Tenywa, M. M., & Bekunda, M. A. (2010). Abiotic constraints override biotic constraints in east African highland banana systems. Field Crops Research, 117(1), 146–153. https://doi.org/10.1016/j.fcr.2010.02.010
Web of Science ®Google Scholar
Wall, L., Larocque, D., & Léger, P. M. (2008). The early explanatory power of NDVI in crop yield modelling. International Journal of Remote Sensing, 29(8), 2211–2225. https://doi.org/10.1080/01431160701395252
Web of Science ®Google Scholar
Whetton, R., Zhao, Y. F., Shaddad, S., & Mouazen, A. M. (2017). Nonlinear parametric modelling to study how soil properties affect crop yields and NDVI. Computers and Electronics in Agriculture, 138, 127–136. https://doi.org/10.1016/j.compag.2017.04.016
Web of Science ®Google Scholar
Wieder, W., Shoop, S., Barna, L., Franz, T., & Finkenbiner, C. (2018). Comparison of soil strength measurements of agricultural soils in Nebraska. Journal of Terramechanics, 77, 31–48. https://doi.org/10.1016/j.jterra.2018.02.003
Web of Science ®Google Scholar
Xu, M., Watanachaturaporn, P., Varshney, P. K., & Arora, M. K. (2005). Decision tree regression for soft classification of remote sensing data. Remote Sensing of Environment, 97(3), 322–336. https://doi.org/10.1016/j.rse.2005.05.008
Web of Science ®Google Scholar

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

County-level corn yield prediction using supervised machine learning

ABSTRACT

Introduction