699
Views
0
CrossRef citations to date
0
Altmetric
Research Article

The use of maximum entropy and ecological niche factor analysis to decrease uncertainties in samples for urban gain models

, &
Article: 2222980 | Received 27 Jan 2023, Accepted 05 Jun 2023, Published online: 13 Jun 2023

ABSTRACT

Uncertainty is a common problem in spatial modeling and geographical information systems (GIS). Furthermore, urban gain modeling (UGM) contains various dimensions and components of uncertainties. Data sampling is important in UGM, and may cause the results of the models to contain many uncertainties as well as affects their precision and accuracy. A poorly sampled or biased dataset can lead to inaccurate predictions and decreased performance of the models. This paper aims to present and develop novel strategies for sampling and building training datasets that can enhance the performance of data-driven models. In other words, the present study used maximum entropy (ME) and ecological niche factor analysis (ENFA) models to select pure non-change samples with minimal uncertainty for training datasets in UGM of Isfahan and Tabriz cities in Iran. The urban gain of two time intervals of 1992–2002 and 2002–2012 were used for Tabriz City and two time intervals of 1994–2004 and 2004–2014 for Isfahan City. Nine and 14 urban gain drivers were used in the UGM of Isfahan and Tabriz cities, respectively. After the ME and ENFA models produced a training dataset with change and non-change samples with the lowest uncertainty, three well-known models, namely random forest (RF), artificial neural network (ANN), and support vector machine (SVM) were used for the modeling. Moreover, the ME and ENFA models that were used to investigate the uncertainty of the sampling procedure were used as the one-class prediction models. Compared to extant studies, the proposed ME – based sampling strategy increased the area under the receiver operating characteristic curve (AUROC), figure of merit, producer’s accuracy, and overall accuracy by 5.5%, 5%, 5%, and 3%, respectively, in the validation phase of Isfahan City and by 5%, 6%, 14%, and 17%, respectively, for Tabriz City. For Isfahan, the accuracies of ME (AUROC = 0.649) and ENFA (AUROC = 0.661) one – class models were closer to that of the ANN – ME (AUROC = 0.646), ANN – ENFA (AUROC = 0.619), and RF – ENFA (AUROC = 0.631) models but differed significantly from that of the RF – ME (AUROC = 0.737) model. For Tabriz, the accuracies of ME (AUROC = 0.657) and ENFA (AUROC = 0.688) one – class models were lower than that of the two class RF-ME (AUROC = 0.852), and ANN-ME (AUROC = 0.778) models. The results showed that the ME model was able to identify relatively pure non-change samples and properly remove impure non-change samples from the training dataset. This study discovered that binary models are preferable to one-class models, and showed that an optimal sampling strategy is an essential step in UGM as it can decrease uncertainty. As such, modelers must adopt efficient sampling methods.

1. Introduction

In various scientific fields, uncertainty analysis and management is an important step in efficient modeling (Aven Citation2010). Urban gain models (UGMs) help policy – and decision – makers to adopt vital policies that avert environmental issues (Matthews et al. Citation2007). This includes policies that can decrease the potential threats of urban gain such as environmental degradation (El Araby Citation2002), loss of biodiversity (Hansen, DeFries, and Turner Citation2012; McDonald et al. Citation2020), changes in land surface temperature (Nurwanda and Honjo Citation2020; Ullah, Jing, and Wadood Citation2020), destruction of farmlands (Liu et al. Citation2014; Surjan, Ara Parvin, and Shaw Citation2016), and changes in water quality (Dong, Liu, and Chen Citation2014; Zhao et al. Citation2015). Therefore, UGMs can help urban planners and decision-makers predict future urban gain areas, plan land usage, and create basic urban infrastructure and amenities as well as adopt the necessary policies to protect the environment (Bakker et al. Citation2008; Don, Schumacher, and Freibauer Citation2011; Martin et al. Citation2013; Van Minnen et al. Citation2009). Although a plethora of statistical, machine learning, and data mining models have been used to create UGMs, the produced UGMs all contain uncertainties in various dimensions and components. Uncertainty may arise due to the data and the models used in the modeling (Tayyebi, Tayyebi, and Khanna Citation2014). If left unaddressed, these uncertainties will lead to inaccurate results when depicting the relationship between independent and dependent variables, which in turn causes the UGMs to produce erroneous results. Therefore, the various dimensions of uncertainty in urban gain modeling (UGM) must be examined.

UGM using data-driven models such as artificial neural networks (ANNs) is conducted by considering two time intervals of t1-t2 and t2-t3 (Ahmadlou, Karimi, and Pontius Citation2021; Shafizadeh-Moghadam et al. Citation2017, Citation2017). For example, the UGM uses all or a part of the data from the first time interval including urban gain drivers and the urban gain variable as the training dataset. The models are then validated using the data from the second time interval including the urban gain drivers and the urban gain variable. Significant uncertainties were noted in the input data such as in the predictor variables and even the dependent variable, as well as in the model uncertainty (Tayyebi, Tayyebi, and Khanna Citation2014). Tayyebi et al. (Citation2014) concluded that uncertainties in the input data were more damaging than uncertainties in the parameters of the model. Therefore, the various dimensions and components of uncertainty in the input data of a model must be examined. As such, this present study analyzes the uncertainty of a sampling strategy that is proposed for creating a training dataset for UGM by unrealistically assuming that the input data is error-free and that the parameters of the models are uncertainty-free.

In UGM, the first time interval contains both change and non-change samples, with typically more non-change samples than change samples. This is known as the imbalance problem (Ahmadlou, Karimi, and Pontius Citation2021; Gu et al. Citation2008), one of the biggest challenges plaguing UGM training datasets, even in highly-researched fields (Ahmadlou, Karimi, and Pontius Citation2021; Pirizadeh et al. Citation2021). More specifically, as change samples are often surrounded by a large number of non-change samples, an imbalance problem occurs when selecting samples for the modeling (Ahmadlou, Karimi, and Pontius Citation2021).

Machine learning models are built using training datasets (Jaydhar et al. Citation2022; Ruidas et al. Citation2021, Citation2022). Although multiple methods have examined creating training datasets for UGMs using samples from the first-time interval, they all contain different degree of uncertainties. When creating a training dataset for UGM, it is common to randomly select 70% of the whole samples including change and non-change samples from the first-time interval (Shafizadeh-Moghadam et al. Citation2017, Parvinnezhad et al. Citation2021; Tayyebi and Pijanowski Citation2014; Tayyebi et al. Citation2014). Although this approach is more frequently used, it produces the highest level of uncertainty as the first-time interval contains significantly fewer change samples than non-change samples (imbalance problem). As such, the training dataset contains more non-change samples (Ahmadlou, Karimi, and Pontius Citation2021). As machine learning and data mining models are more likely to learn the non-change samples that are present in higher quantities, they fail to model the change samples, which is the primary goal. The second approach is to select equal quantities of change and non-change samples (Karimi et al. Citation2019; Pal et al. Citation2022). Ahmadlou, Karimi, and Pontius (Citation2021) and Karimi et al. (Citation2019) randomly selected equal quantities of the change and non-change samples from the first-time interval and discovered that the non-change samples contained a significant amount of uncertainty. In recent studies, it has been assumed that these non-change samples do not have the potential to change in the future, and they were entered into the modeling as the opposite of the change samples. In other words, randomly selecting non-change samples from samples that have the potential to change in the future causes significant uncertainty to arise in the modeling. This is because the model may encounter many non-change samples that have the same spatial drivers and features as change samples, but they have not changed. Thus, another severe modeling challenge is the selection of pure non-change samples with no to low potential to change. The third approach is to use all the data from the first interval for modeling (Ahmadlou, Karimi, and Pontius Citation2021). The challenges of this approach include differing degrees of class inequality (imbalance problem) and non-change samples that have the potential to change affecting the accuracy of the models. As these uncertainties in the training dataset make it challenging to build a UGM, efficient approaches are needed to manage and overcome these problems.

A balanced training dataset that contains change and pure non-change samples should be used to develop UGMs. It is easy to select change samples for the training dataset of the UGMs as uncertainties only ensue from errors in extracting and providing these samples. However, it is very difficult to select the pure non-change samples with no or low potential to change as the number of urban gain samples is much less than the non-change samples, and significantly more non-change samples than change samples are selected for the training dataset. Therefore, random sampling causes the model to be biased in favor of the non-change samples. Such models have good accuracy in modeling of non-change samples (frequent class) and bad accuracy in modeling of change samples (infrequent class), while the goal is to model urban gain samples. Multiple studies have faced these modeling issues (Parvinnezhad et al. Citation2021; Pontius et al. Citation2018; Shafizadeh-Moghadam et al. Citation2017; Ahmadlou, Karimi, and Pontius Jr Citation2021). Pontius et al. (Citation2018) used various models to simulate land use changes (LUCs) in 13 study areas with varying change rates. They found that low LUC rates lead to lower predictive accuracy. Therefore, the novelty of this paper lies in proposing and examining the efficacy of two approaches, maximum entropy (ME) and ecological niche factor analysis (ENFA), for creating balanced training datasets containing equal quantities of change samples and pure non-change samples for UGM of Tabriz and Isfahan Cities in Iran, both of which have different urban gain rates.

2. Study areas and data sets

The urban gain of two major megacities in Iran, Tabriz and Isfahan, were used to model and evaluate the proposed sampling strategies due to their non-linear and complex urban gain pattern as well as differing urban gain rates ().

Figure 1. Location of both study areas.

Figure 1. Location of both study areas.

2.1. Study areas

The urban gain of two periods, 1990 to 2000 and 2000 to 2010 were examined for Tabriz City and 1994 to 2004 and 2004 to 2014 for Isfahan City. Isfahan and Tabriz are two old, large, and industrial cities in Iran with numerous tourist attractions. Their growing immigration rates have increased the demand for residential sites and industrial developments, leading to the urban expansion and growth of Tabriz and Isfahan.

Isfahan is located in Central Iran at 32.38° N and 51.38° E with an average elevation of 1587 m above sea level. Its northern and southern halves are divided by The Zayandeh Rud River, which played an important role in the urban gain of Isfahan in the past. In 1994 and 2004, agricultural and open lands were the dominant types of land use. However, by 2014, most of the city had been urbanized, which alludes to Isfahan’s high urban gain rate.

Tabriz is a major city in North-Western Iran located at 32.38° N and 51.38° E, with an average elevation of 1500 m above sea level. As the industrial center of the northwest, its population is expected to increase to 1,940,000 by 2030. The population of the city increased 6-fold and its urban growth increased 18-fold between 1956 and 2011. In 1992, its lands were predominantly barren or urbanized.

2.2. Dataset

Landsat images of Isfahan in 1994, 2004, and 2014 were obtained from the United States Geological Survey (USGS) and used to prepare land use maps and identify urban gain areas. The area had five land-use classes, namely croplands, open lands, built-up areas, water bodies, and salt marshes. They were classified using the maximum likelihood classification (MLC) with an overall accuracy of 84%, 86%, and 87% for 1994, 2004, and 2014, respectively. The urban gain maps of 1994–2004 and 2004–2014 were obtained by comparing the land-use maps of 1994 with 2004 and of 2004 with 2014. Nine significant urban gain factors of Isfahan were used for the modeling procedure (). The distance maps were obtained using Euclidean distance analysis in the geographical information system (GIS) environment, the elevation map was obtained from the 30 m digital elevation model (DEM) of Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER), and the slope map was obtained from elevation map in the GIS environment. Landsat images of Tabriz in 1992, 2002, and 2012 were obtained from the USGS and used to prepare land use maps. The MLC was used to classify the satellite images into urban, vegetation, and open lands. Fourteen significant urban gain factors of Tabriz were used for the modeling procedure (). The altitude and slope maps of Tabriz were obtained from the DEM of the ASTER. The urban drivers varied between both cities. As such, the urban gain drivers of both cities were selected using the available expert opinions and extant studies on the two cities. Moreover, to compare the modeling results of the present study to other studies that have been carried out in these two cities, the urban gain of two time intervals of 1992–2002 and 2002–2012 were used for Tabriz City and two time intervals of 1994–2004 and 2004–2014 for Isfahan City.

Table 1. List of spatial drivers of urban gain in Isfahan and Tabriz.

3. Methodology

depicts the modeling process and the proposed sampling strategy. The method used to create the training dataset for the various data-driven models should solve the class imbalance problem and include non-change samples with no to low potential for change. The study involves several steps, which are outlined below:

Figure 2. The flowchart of the urban gain modeling.

Figure 2. The flowchart of the urban gain modeling.

Step 1: After preparing the necessary land use maps, the urban gain cells of the selected time intervals 1992 to 2002 and 2002 to 2012 for Tabriz and 1994 to 2004 and 2004 to 2014 for Isfahan were extracted in the GIS environment. The urban gain drivers were then prepared for the two cities.

Step 2: The first-time interval of both study areas was used for the sampling procedure and to build the models. The urban gain potential of the non-change samples was calculated using the ENFA and ME models as well as the urban gain samples and drivers of the first-time interval. The proposed sampling strategy preserves the change samples of the first-time interval and enters them into the training dataset and then selects an equivalent number of non-change samples with the lowest ME and ENFA to create the training dataset ().

Step 3: This training dataset was then used to construct three well-known machine learning models, random forest (RF), artificial neural network (ANN), and support vector machine (SVM). These models have been used in various fields (Das and Chandra Pal Citation2020; Saha et al. Citation2022).

Step 4: The model was validated using the urban gain that occurred in the second time interval and the urban gain drivers at t2. The total operating characteristics (TOC), figure of merit (FoM), producer accuracy (PA), and overall accuracy were used and calculated using the Hits, False Alarms, and Misses entries of the confusion matrix () for validation of models.

Table 2. The error matrix.

Step 5: The suitability maps obtained from the ME and ENFA were entered directly into the validation phase and compared to those of the three machine learning models ().

3.1. Maximum entropy (ME)

The Shannon entropy is a basic concept in the information theory that Claude Shannon developed in 1940 to assess uncertainty in a random process (Gray Citation2011). The ME mainly identifies a probability distribution to meet any constraints in the data (Berger, Della Pietra, and Della Pietra Citation1996). With a series of constraints on urban gain cells, UGMs aim to identify the unknown distribution (P), presenting a set of urban gain drivers. The information available for this distribution is the mean of the features (X) under P in each change cell defined as follows:

(1) PX=1ni=1nfjxi(1)

The goal is to identify the distribution PX as an approximate of the actual distribution in the change cells. According to the ME principle, of all the possible distributions that satisfy the constraints, the distribution with the ME is the best, as calculated by Eq. 2 (Buchen and Kelly Citation1996; Van Campenhout and Cover Citation1981):

(2) HPˉ=xXPˉxlnPˉx(2)

where ln represents the natural logarithm. Based on convex duality theory and to maximize the entropy of the given constraints, the Gibbs distribution is the only distribution with the smallest Kullback-Leibler that satisfies all the constraints without additional presumptions (Della Pietra, Pietra, and Lafferty Citation1997; Nasser and Cessac Citation2014). This distribution function is proportional to the conditional probability of being positive. Refer to (Phillips, Anderson, and Schapire Citation2006; Phillips, Dudík, and Schapire Citation2004) for more details.

3.2. Ecological niche factor analysis (ENFA) model

Ecological niche factor analysis (ENFA) is a multivariate method that uses factor analysis and ecological niche theory to study the distribution of species according to environmental variables and presence-only locations without the need for absent locations (Hirzel et al. Citation2002). An ecological niche is formed in the model via habitats that are available and used by the species (Brotons et al. Citation2004). This present study equates the urban gain cells with the species spread at various regions over time. The ENFA model calculates the difference between the predictor variables in the change cells and other cells of the entire studied area (Basille et al. Citation2008). The model does not require any information on the non-change cells and calculates the suitability of the cells using the change cells and the predictor variables. Refer to (Basille et al. Citation2008) for more details on the ENFA model.

3.3. Validating the proposed sampling strategies

The data from the second time interval were used to evaluate the quality of the two proposed sampling strategies. Their outputs were converted to maps with values of 0 and 1 with a threshold of 0.5. Cells > 0.5 indicate that they have the potential to change use, while cells < 0.5 indicate that they do not have the potential to change use (Tayyebi and Pijanowski Citation2014). The strategy with the highest consistency was chosen as the best sampling strategy.

3.4. Applying the proposed sampling strategies in UGM

Three well-known and widely-used machine learning algorithms, RF, ANN, and SVM, were used to test the proposed sampling strategies. Multiple studies on UGM have used these models (Jun Citation2021; Shafizadeh-Moghadam et al. Citation2017). As the purpose of this present study was to examine the ability of two proposed sampling strategies to efficiently control and manage uncertainty in training datasets, the ANN, SVM, and RF models are not discussed in detail. After preparing the urban gain maps of Tabriz and Isfahan, the urban gain samples from the first-time interval were used to develop the ME, and ENFA models. The outputs of these models are maps with values between 0 and 1, which indicate the change potential by considering the change samples in the first interval. In these maps, cells with a value close to 1 mean that the value of the drivers of those cells are close to these drivers of the change samples. By setting a threshold of 0.5, non-change cells with values that exceed this threshold are removed from the training dataset. An equal number of non-change samples as change samples were selected from the < 0.5 samples according the clustering-based sampling approach that (Ahmadlou, Karimi, and Pontius Jr Citation2021) proposed. The ANN, SVM, and RF models were then developed using this training dataset and evaluated using the urban gain of the second time interval. The predication of these three models were compared to the real values. The models were evaluated using the TOC (Pontius and Kangping Citation2014) and the Hits, Misses, and False Alarms entries in the confusion matrix.

This study primarily uses the ME and ENFA models to examine the uncertainty in the urban gain training dataset. Nonetheless, the outputs of these models can be used for UGM. These models, also known as one-class algorithms (Moya and Hush Citation1996), are developed solely using change samples. The outputs of these models were compared with that of the urban gain of the second time interval. The results were evaluated using the TOC, the FoM (Eq. 3), PA (Eq. 4), and OA (Eq. 5):

(3) FoM=HitsHits+Misses+FalseAlarms(3)
(4) PA=HitsHits+Misses(4)
(5) OA=Hits+CorrectRejectionsHits+Misses+FalseAlarms+CorrectRejections(5)

4. Results

4.1. Sampling of the urban gain modeling using ME and ENFA

The urban gain drivers for Isfahan in 1994 and Tabriz in 1992 were used as the predictor variables and the urban gain of Isfahan between 1994 and 2004 and the urban gain of Tabriz between 1992 and 2002 were used as the target or dependent variable. The ME and ENFA of all the non-change cells were calculated using the change samples and the urban gain drivers of the first-time interval. depict the ME and ENFA maps calculated for the non-change cells for Isfahan in the first-time interval. These maps were generated only using the urban gain samples of the first-time interval and without the use of any of the non-change samples. As seen, the ME and ENFA range between 0 to 0.96 and 0 to 1, respectively. The higher and closer to one the ME, the higher the potential of a non-change sample to change. Conversely, the lower the ME, the lower the potential for urban gain. An equal number of non-change samples as change samples was generated from the first-time interval (black points in ) to depict the uncertainty of the sampling approaches that extant studies have adopted. As seen in (A1–A6), a large number of the randomly generated non-change samples were placed in cells with high ME and ENFA. Therefore, if these non-change samples were used to create training datasets for data mining and machine learning models, it would have led to a high level of sampling uncertainty. This was because some of the non-change samples had a high potential for urban gain. However, extant studies considered these samples as non-change samples in the training dataset. The green points seen in identify samples that could be non-change samples and have the lowest ME.

Figure 3. The maximum entropy of non-change samples for Isfahan City.

Figure 3. The maximum entropy of non-change samples for Isfahan City.

Figure 4. The ecological niche factor analysis of non-change samples for Isfahan City.

Figure 4. The ecological niche factor analysis of non-change samples for Isfahan City.

The ME and ENFA of all the cells were calculated using the urban gain samples and drivers of the first time interval for Tabriz (). A large and equal number of non-change samples as the change samples was randomly generated and placed in cells with high ME and ENFA. To create the training dataset, non-change cells with ME and ENFA above 0.5 were removed from the study areas. Then, as (Ahmadlou, Karimi, and Pontius Jr Citation2021) recommend, a clustering-based approach was used to select an equal number of non-change samples as change samples in the first-time interval from cells with ME and ENFA less than 0.5. To test the efficacy of the proposed sampling strategies, the ANN, SVM, and RF models were trained using the training dataset and their results were compared with that of extant studies that had used common sampling methods to create training datasets.

Figure 5. The maximum entropy of non-change samples for Tabriz City.

Figure 5. The maximum entropy of non-change samples for Tabriz City.

Figure 6. The ecological niche factor analysis of non-change samples for Tabriz City.

Figure 6. The ecological niche factor analysis of non-change samples for Tabriz City.

4.2. Urban gain modeling using the proposed sampling strategies

4.2.1 Binary classifiers

These models were developed using binary labels 0 and 1, where 0 indicates that the examined phenomenon did not occur while 1 indicates that the examined phenomenon did occur. In urban gain modeling, these values refer to the non-change (0) and change samples (1). Binary models are the most common examples of UGMs. After using ENFA and ME to create the training datasets for Isfahan and Tabriz, three well-known binary models, RF, ANN, and SVM, were used in UGM. More specifically, the ME and ENFA models were used to create two training datasets for Tabriz and two training datasets for Isfahan. Six hybrid models, ME – ANN, ME – RF, ME – SVM, ENFA – ANN, ENFA – RF, and ENFA – SVM, were then developed for each case. The data of the second time interval including the urban gain variable and drivers at t2 were used to validate the developed models. depict the error maps as well as the values of four entries in error matrix () for Isfahan and Tabriz, respectively. They also provide the prediction success rates of the models.

Figure 7. The spatially distributed errors of the (a) ME-ANN, (b) ME-RF, (c) ME-SVM, (d) ENFA-ANN, (e) ENFA-RF, and (f) ENFA-SVM model’s urban gain predictions for Isfahan City in the second time interval (2004–2014).

Figure 7. The spatially distributed errors of the (a) ME-ANN, (b) ME-RF, (c) ME-SVM, (d) ENFA-ANN, (e) ENFA-RF, and (f) ENFA-SVM model’s urban gain predictions for Isfahan City in the second time interval (2004–2014).

Figure 8. The spatially distributed errors of the (a) ME-ANN, (b) ME-RF, (c) ME-SVM, (d) ENFA-ANN, (e) ENFA-RF, and (f) ENFA-SVM model’s urban gain predictions for Tabriz City in the second- time interval (2002–2012).

Figure 8. The spatially distributed errors of the (a) ME-ANN, (b) ME-RF, (c) ME-SVM, (d) ENFA-ANN, (e) ENFA-RF, and (f) ENFA-SVM model’s urban gain predictions for Tabriz City in the second- time interval (2002–2012).

4.2.2. ENFA and ME models as one-class classifiers for UGM

Apart from the binary ANN, RF, and SVM models, the outputs of the ME and ENFA models as one-class classifiers were directly used as suitability maps. As such, an equal number of cells with the highest ME and ENFA as urban gain cells in the second-time interval were selected and considered the predicted urban gain cells for the second time interval. depict the error maps of these models for the two cities in the second-time interval.

Figure 9. The spatially distributed errors of the (a) ME (b) ENFA model’s urban gain predictions for Isfahan City in the second time interval (2004–2014).

Figure 9. The spatially distributed errors of the (a) ME (b) ENFA model’s urban gain predictions for Isfahan City in the second time interval (2004–2014).

Figure 10. The spatially distributed errors of the (a) MaxEnt (b) ENFA models’ UG predictions for Tabriz City in the second time period (2002–2012).

Figure 10. The spatially distributed errors of the (a) MaxEnt (b) ENFA models’ UG predictions for Tabriz City in the second time period (2002–2012).

4.2.3. Validating the models using TOC and the four entries in the confusion matrix

show the TOC curve of the six hybrid binary models (i.e., ME – ANN, ME – SVM, ME – RF, ENFA – ANN, ENFA – SVM, and ENFA – RF) with two one-class models of ME and ENFA for Isfahan and Tabriz, respectively. For Isfahan, the proposed ME-based sampling approach outperformed the proposed ENFA – based sampling. As seen in , the RF – ME model (AUC = 0.737) was the most accurate, followed by the ANN – ME (AUC = 0.646), RF – ENFA (AUC = 0.631), ANN – ENFA (AUC = 0.619), SVM – ENFA (AUC = 0.512), and SVM – ME (AUC = 0.509) models. Furthermore, the SVM-based models were overfitted. The accuracies of ME (AUC = 0.649) and ENFA (AUC = 0.661) one – class models were closer to that of the ANN – ME, ENFA – ANN, and RF – ENFA models but differed significantly from that of the RF – ME model. Compared to the ANN (AUC = 0.682), SVM (AUC = 0.481), and RF (AUC = 0.661) constructed by balance sampling without ME and ENFA models, the proposed RF-ME (AUC = 0.737) increased the area under the receiver operating characteristic curve (AUROC) by 5.5% in the validation phase of Isfahan City.

Figure 11. The TOC and AUC of the proposed models for Isfahan City.

Figure 11. The TOC and AUC of the proposed models for Isfahan City.

Figure 12. The TOC and AUC of the proposed models for Tabriz City.

Figure 12. The TOC and AUC of the proposed models for Tabriz City.

The proposed ME-based sampling approach outperformed the proposed ENFA – based sampling approach for Tabriz as well. As seen in , the RF – ME model (AUC = 0.852) was the most accurate, followed by the ANN – ME (AUC = 0.778), RF – ENFA (AUC = 0.504), SVM – ENFA (AUC = 0.503), ANN – ENFA (AUC = 0.502), and SVM – ME (AUC = 0.449) models. The RF – ENFA, SVM – ENFA, ANN – ENFA, and SVM – ME models were overfitted. The accuracies of ME (AUC = 0.657) and ENFA (AUC = 0.668) one – class models were lower than that of the two class RF-ME and ANN-ME models. Compared to the ANN (AUC = 0.71), SVM (AUC = 0.521), and RF (AUC = 0.791) constructed by balance sampling without ME and ENFA models, the proposed RF-ME (AUC = 0.852) increased the AUROC by 6% in the validation phase of Tabriz City. provides the PA, OA, and FoM of all the models. As seen, the PA, OA, and FoM of the ME – RF and ME – ANN models were higher than that of the other models.

Table 3. The validation of the eight models using FOM, OA, and PA.

5. Discussion

Urban gain is very important due to its impact on ozone concentration, water quality, and pollution, food security, and so on. Although multiple studies have developed models to depict urban gain behaviors, very few studies have examined the uncertainties in the training datasets that are used in these models, as these datasets are often plagued by imbalance issues and the impurity of non-change samples.

Class imbalance problem in the training dataset can be overcome by using equal quantities of change and non-change samples. Past studies have randomly selected change and non-change samples to create training datasets. However, as urban gain datasets contain significantly fewer change samples than non-change samples, random sampling causes the training dataset to contain more non-change samples, which skews the model toward these samples. Therefore, urban gain modelers should consider using equal quantities of change and non-change samples in the training dataset as machine learning and statistical models require balanced training datasets. This present study used under-sampling, which some extant studies have used, to build a balanced training dataset.

Apart from class imbalance problem, the impurity of the non-change samples used in the training data set is another significant issue. More specifically, UGMs may encounter samples that have identical features but some labeled change and others non-change. As there was no logical approach of selecting non-change samples from available cells in the past, UGMs contained samples that had been randomly chosen from the available cells. The findings of this present study indicate that randomly selecting cells for non-change samples creates samples that, in reality, may have a high potential for change as these samples have been erroneously labeled non-change and entered in the training datasets. Therefore, this present study proposed a balanced sampling approach that uses two approaches, namely ME and ENFA to select cells with the lowest potential for change as non-change samples and entering them into the training dataset. Conway and Wellen (Citation2011) have used ENFA model to examine the purity of the non-change samples, which they used to model the urban gain of Barnegat Bay watershed in New Jersey, United States. However, the one-class ENFA model failed to outperform the logistic regression model. Conversely, this present study found that the one-class ENFA model outperforms the ANN, RF, and SVM binary models in both study areas. This could be because Conway and Wellen (Citation2011) created their logistic regression model using non-change samples with the lowest urban gain potential that their ENFA-based urban gain suitability map had overestimated. The ENFA-based binary models of this present study, however, had reasonable results. This present study also used the ME model to select pure non-change samples and build a one-class model. However, the ENFA model outperformed the ME model in both Tabriz and Isfahan. Nevertheless, the binary ANN and RF models constructed using the non-change samples that the ME probability map selected outperformed the one-class models as well as other models built based on ENFA probability map. Although there were no significant differences between the ANN-ME model for Isfahan and the one-class models of ME and ENFA, the ME-based binary ANN and RF models outperformed the other models in both study areas. Similar to the findings of Ahmadlou, Karimi, and Pontius (Citation2021), the SVM-based models of this present study were overfitted in both study areas. Therefore, it fails to model the urban gain of both study areas.

The use of samples with the biggest variety in the training dataset for UGMs is also a significant challenge. Therefore, after using the ENFA and ME models to remove non-change samples with change potential from the training dataset, this present study used the framework proposed by Ahmadlou, Karimi, and Pontius (Citation2021) to diversify the non-change samples.

As the urban gain patterns of Isfahan and Tabriz cities are very complex, multiple studies have attempted to model the urban gain in these areas. Most of these studies have focused on developing new hybrid models. More specifically, Parvinnezhad et al. (Citation2021) proposed using support vector regression to integrate an adaptive neural fuzzy inference system and a fuzzy rough set to model the urban gain of Tabriz City. A comparison of the accuracy of the modeling results of that study and that of this present study indicates that focusing on sampling can improve the performance of a model better than developing hybrid models. Apart from that, Shafizadeh-Moghadam et al. (Citation2017) developed a model that used the Land Transformation Model (LTM) and cellular automata to model the urban gain of Isfahan City. A comparison of the accuracy of the modeling results of that study and that of this present study also proves that using a suitable sampling approach is more important than developing hybrid models.

This present study examined using ME and ENFA models for sampling as well as one-class classifiers and discovered that they provided less accuracy than binary models. Therefore, binary

models are preferable to one-class models. This finding is in line with the study of Zhu et al. (Citation2018) that compared two one-class models, namely one-class SVM, kernel density estimation, and two binary models namely, ANN and SVM. Similar finding was reported by Pandey, Reza Pourghasemi, and Chand Sharma (Citation2020) that compared one class ME and binary SVM.

Imbalance issue and the impurity of non-changes samples are very complex in multiple LUC modeling as, apart from imbalances between change and non-change classes, imbalances also occur between the change classes. Therefore, future studies may endeavor to overcome these issues in multiple LUC modeling. Imbalance issues in the training datasets of the study areas, which contain various interclass imbalance ratios, also warrant further study. To address the impurity issue of non-changed samples in multiple land use changes, researchers can provide suitability maps using ME for each type of the LUC classes to select the cells with the lowest potential for change as non-change samples. Also, a simple solution to overcome the imbalance issue between the change classes is to select an equal number of the change samples from each type of the LUC classes as the land use class with the smallest number of change samples.

6. Conclusion

This present study explored the uncertainties that arise in samples that are used in UGM, namely imbalance problem and impurity of the non-changes samples. Sampling is one of the most important steps when UGM using data-driven models as it may result in many uncertainties in the model outputs and affect its precision and accuracy. As such, this present study used two balanced ME- and ENFA-based sampling approaches for UGM. Three well-known and widely used data mining models, namely ANN, SVM and RF and six hybrid models, namely ME-ANN, ME-SVM, ME-RF, ENFA-ANN, ENFA-SVM, and ENFA-RF that had been constructed using proposed sampling strategies were used to evaluate the efficacy of the proposed sampling approaches Two ME- and ENFA-based one-class models were also developed and compared with proposed two-class hybrid models. The urban gain of Isfahan City at the two time intervals of 1994–2004 and 2004–2014 and that of Tabriz City between 1992 to 2002 and 2002 to 2012 were used to evaluate the proposed sampling approaches. The proposed sampling approaches were found to significantly increase the accuracy of the data mining models and decrease the size of the training dataset and computational load of these models. Furthermore, the non-change samples that are selected for use in a training dataset should have the lowest potential for change and differ completely from the change samples. Therefore, the concept of “garbage in, garbage out” is important in data mining and selecting the correct samples for the training dataset significantly affects the success rate of machine learning and data mining models. The binary data mining models that this present study developed also outperformed the one-class models.

This study provided a new perspective of sampling strategy, and proposed two ME- and ENFA-based sampling approaches for creating the training dataset for urban gain models. As data sampling is one of the most significant data preprocessing steps in the data mining process, researchers and modelers may use the adapted and proposed sampling strategy in the present study to improve the accuracy of the other machine learning and data mining techniques like decision trees, which are very large in number. Moreover, future studies may investigate using the sampling approaches that this present study proposed in other study areas with different rates of urban gain.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The data presented in this study are available on request from the corresponding author.

References

  • Ahmadlou, M., M. Karimi, and R. G. Pontius Jr. 2021. “A New Framework to Deal with the Class Imbalance Problem in Urban Gain Modeling Based on Clustering and Ensemble Models.” Geocarto International 37 (19): 5669–20. https://doi.org/10.1080/10106049.2021.1923826.
  • Aven, T. 2010. “Some Reflections on Uncertainty Analysis and Management.” Reliability Engineering & System Safety 95 (3): 195–201. https://doi.org/10.1016/j.ress.2009.09.010.
  • Bakker, M. M., G. Govers, A. van Doorn, F. Quetier, D. Chouvardas, and M. Rounsevell. 2008. “The Response of Soil Erosion and Sediment Export to Land-Use Change in Four Areas of Europe: The Importance of Landscape Pattern.” Geomorphology 98 (3–4): 213–226. https://doi.org/10.1016/j.geomorph.2006.12.027.
  • Basille, M., C. Calenge, E. Marboutin, R. Andersen, and J.-M. Gaillard. 2008. “Assessing Habitat Selection Using Multivariate Statistics: Some Refinements of the Ecological-Niche Factor Analysis.” Ecological Modelling 211 (1–2): 233–240. https://doi.org/10.1016/j.ecolmodel.2007.09.006.
  • Berger, A., S. A. Della Pietra, and V. J. Della Pietra. 1996. “A Maximum Entropy Approach to Natural Language Processing.” Computational Linguistics 22 (1): 39–71.
  • Brotons, L., W. Thuiller, M. B. Araújo, and A. H. Hirzel. 2004. “Presence‐Absence versus Presence‐Only Modelling Methods for Predicting Bird Habitat Suitability.” Ecography 27 (4): 437–448. https://doi.org/10.1111/j.0906-7590.2004.03764.x.
  • Buchen, P. W., and M. Kelly. 1996. “The Maximum Entropy Distribution of an Asset Inferred from Option Prices.” The Journal of Financial and Quantitative Analysis 31 (1): 143–159. https://doi.org/10.2307/2331391.
  • Conway, T. M., and C. C. Wellen. 2011. “Not Developed Yet? Alternative Ways to Include Locations without Changes in Land Use Change Models.” International Journal of Geographical Information Science 25 (10): 1613–1631. https://doi.org/10.1080/13658816.2010.534738.
  • Das, B., and S. Chandra Pal. 2020. “Assessment of Groundwater Recharge and Its Potential Zone Identification in Groundwater-Stressed Goghat-I Block of Hugli District, West Bengal, India.” Environment, Development and Sustainability 22 (6): 5905–5923. https://doi.org/10.1007/s10668-019-00457-7.
  • Della Pietra, S., V. D. Pietra, and J. Lafferty. 1997. “Inducing Features of Random Fields.” IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (4): 380–393. https://doi.org/10.1109/34.588021.
  • Dong, Y., Y. Liu, and J. Chen. 2014. “Will Urban Expansion Lead to an Increase in Future Water Pollution Loads?—A Preliminary Investigation of the Haihe River Basin in Northeastern China.” Environmental Science and Pollution Research 21 (11): 7024–7034. https://doi.org/10.1007/s11356-014-2620-6.
  • Don, A., J. Schumacher, and A. Freibauer. 2011. “Impact of Tropical Land‐Use Change on Soil Organic Carbon Stocks–A Meta‐Analysis.” Global Change Biology 17 (4): 1658–1670. https://doi.org/10.1111/j.1365-2486.2010.02336.x.
  • El Araby, M. 2002. “Urban Growth and Environmental Degradation: The Case of Cairo, Egypt.” Cities 19 (6): 389–400. https://doi.org/10.1016/S0264-2751(02)00069-0.
  • Gray, R. M. 2011. Entropy and Information Theory. Springer Science & Business Media. https://doi.org/10.1007/978-1-4419-7970-4.
  • Gu, Q., Z. Cai, L. Zhu, and B. Huang. 2008. Data Mining on Imbalanced Data Sets. Paper presented at the 2008 International Conference on advanced computer theory and engineering, Phuket, Thailand.
  • Hansen, A. J., R. S. DeFries, and W. Turner. 2012. “Land Use Change and Biodiversity.” In Land Change Science, 277–299. Springer Netherlands. https://doi.org/10.1007/978-1-4020-2562-4_16.
  • Hirzel, A. H., J. Hausser, D. Chessel, and N. Perrin. 2002. “Ecological‐Niche Factor Analysis: How to Compute Habitat‐Suitability Maps without Absence Data?” Ecology 83 (7): 2027–2036. doi. 10.1890/0012-9658 (2002)083[2027:ENFAHT]2.0.CO;2.
  • Jaydhar, A. K., S. Chandra Pal, A. Saha, A. R. M. T. Islam, and D. Ruidas. 2022. “Hydrogeochemical Evaluation and Corresponding Health Risk from Elevated Arsenic and Fluoride Contamination in Recurrent Coastal Multi-Aquifers of Eastern India.” Journal of Cleaner Production 369:133150. https://doi.org/10.1016/j.jclepro.2022.133150.
  • Jun, M.-J. 2021. “A Comparison of a Gradient Boosting Decision Tree, Random Forests, and Artificial Neural Networks to Model Urban Land Use Changes: The Case of the Seoul Metropolitan Area.” International Journal of Geographical Information Science 35 (11): 2149–2167. https://doi.org/10.1080/13658816.2021.1887490.
  • Karimi, F., S. Sultana, A. Shirzadi Babakan, and S. Suthaharan. 2019. “An Enhanced Support Vector Machine Model for Urban Expansion Prediction.” Computers, Environment and Urban Systems 75:61–75. https://doi.org/10.1016/j.compenvurbsys.2019.01.001.
  • Liu, Y., R. Yang, H. Long, J. Gao, and J. Wang. 2014. “Implications of Land-Use Change in Rural China: A Case Study of Yucheng, Shandong Province.” Land Use Policy 40:111–118. https://doi.org/10.1016/j.landusepol.2013.03.012.
  • Martin, Y., H. Van Dyck, N. Dendoncker, and N. Titeux. 2013. “Testing Instead of Assuming the Importance of Land Use Change Scenarios to Model Species Distributions Under Climate Change.” Global Ecology and Biogeography 22 (11): 1204–1216. https://doi.org/10.1111/geb.12087.
  • Matthews, R. B., N. G. Gilbert, J. G. P. Alan Roach, and N. M. Gotts. 2007. “Agent-Based Land-Use Models: A Review of Applications.” Landscape Ecology 22 (10): 1447–1459. https://doi.org/10.1007/s10980-007-9135-1.
  • McDonald, R. I., A. V. Mansur, M. C. Fernando Ascensão, K. Crossman, T. Elmqvist, A. Gonzalez, B. Güneralp, D. Haase, and M. Hamann. 2020. “Research Gaps in Knowledge of the Impact of Urban Growth on Biodiversity.” Nature Sustainability 3 (1): 16–24. https://doi.org/10.1038/s41893-019-0436-6.
  • Moya, M. M., and D. R. Hush. 1996. “Network Constraints and Multi-Objective Optimization for One-Class Classification.” Neural Networks 9 (3): 463–474. https://doi.org/10.1016/0893-6080(95)00120-4.
  • Nasser, H., and B. Cessac. 2014. “Parameter Estimation for Spatio-Temporal Maximum Entropy Distributions: Application to Neural Spike Trains.” Entropy 16 (4): 2244–2277. https://doi.org/10.3390/e16042244.
  • Nurwanda, A., and T. Honjo. 2020. “The Prediction of City Expansion and Land Surface Temperature in Bogor City, Indonesia.” Sustainable Cities and Society 52:101772. https://doi.org/10.1016/j.scs.2019.101772.
  • Pal, S. C., D. Ruidas, A. Saha, A. R. M. T. Islam, and I. Chowdhuri. 2022. “Application of Novel Data-Mining Technique-Based Nitrate Concentration Susceptibility Prediction Approach for Coastal Aquifers in India.” Journal of Cleaner Production 346:131205. https://doi.org/10.1016/j.jclepro.2022.131205.
  • Pandey, V. K., H. Reza Pourghasemi, and M. Chand Sharma. 2020. “Landslide Susceptibility Mapping Using Maximum Entropy and Support Vector Machine Models Along the Highway Corridor, Garhwal Himalaya.” Geocarto International 35 (2): 168–187. https://doi.org/10.1080/10106049.2018.1510038.
  • Parvinnezhad, D., M. Reza Delavar, B. C. Pijanowski, and C. Claramunt. 2021. “Integration of Adaptive Neural Fuzzy Inference System and Fuzzy Rough Set Theory with Support Vector Regression to Urban Growth Modelling.” Earth Science Informatics 14 (1): 17–36. https://doi.org/10.1007/s12145-020-00522-0.
  • Phillips, S. J., R. P. Anderson, and R. E. Schapire. 2006. “Maximum Entropy Modeling of Species Geographic Distributions.” Ecological Modelling 190 (3–4): 231–259. https://doi.org/10.1016/j.ecolmodel.2005.03.026.
  • Phillips, S. J., M. Dudík, and R. E. Schapire. 2004. A Maximum Entropy Approach to Species Distribution Modeling. Paper presented at the Proceedings of the twenty-first international conference on Machine learning, Banff, Alberta, Canada.
  • Pirizadeh, M., N. Alemohammad, M. Manthouri, and M. Pirizadeh. 2021. “A New Machine Learning Ensemble Model for Class Imbalance Problem of Screening Enhanced Oil Recovery Methods.” Journal of Petroleum Science and Engineering 198:108214. https://doi.org/10.1016/j.petrol.2020.108214.
  • Pontius, R. G., J.-C. Castella, T. De Nijs, Z. Duan, E. Fotsing, N. Goldstein, K. Kok, E. Koomen, C. D. Lippitt, and W. McConnell. 2018. Trends in Spatial Analysis and Modelling: Decision-Support and Planning Strategies. 143–164.
  • Pontius, R. G., Jr, and S. Kangping. 2014. “The Total Operating Characteristic to Measure Diagnostic Ability for Multiple Thresholds.” International Journal of Geographical Information Science 28 (3): 570–583. https://doi.org/10.1080/13658816.2013.862623.
  • Ruidas, D., S. Chandra Pal, A. Reza Md Towfiqul Islam, and A. Saha. 2021. “Characterization of Groundwater Potential Zones in Water-Scarce Hardrock Regions Using Data Driven Model.” Environmental Earth Sciences 80 (24): 1–18. https://doi.org/10.1007/s12665-021-10116-8.
  • Ruidas, D., S. Chandra Pal, A. Reza Md Towfiqul Islam, and A. Saha. 2022. “Hydrogeochemical Evaluation of Groundwater Aquifers and Associated Health Hazard Risk Mapping Using Ensemble Data Driven Model in a Water Scares Plateau Region of Eastern India.” Exposure and Health 15 (1): 1–19. https://doi.org/10.1007/s12403-022-00480-6.
  • Saha, A., S. Chandra Pal, I. Chowdhuri, P. Roy, and R. Chakrabortty. 2022. “Effect of Hydrogeochemical Behavior on Groundwater Resources in Holocene Aquifers of Moribund Ganges Delta, India: Infusing Data-Driven Algorithms.” Environmental Pollution 314:120203. https://doi.org/10.1016/j.envpol.2022.120203.
  • Shafizadeh-Moghadam, H., A. Asghari, M. Taleai, M. Helbich, and A. Tayyebi. 2017. “Sensitivity Analysis and Accuracy Assessment of the Land Transformation Model Using Cellular Automata.” GIScience & Remote Sensing 54 (5): 639–656. https://doi.org/10.1080/15481603.2017.1309125.
  • Shafizadeh-Moghadam, H., A. Asghari, A. Tayyebi, and M. Taleai. 2017. “Coupling Machine Learning, Tree-Based and Statistical Models with Cellular Automata to Simulate Urban Growth.” Computers, Environment and Urban Systems 64:297–308. https://doi.org/10.1016/j.compenvurbsys.2017.04.002.
  • Shafizadeh-Moghadam, H., A. Tayyebi, M. Ahmadlou, M. Reza Delavar, and M. Hasanlou. 2017. “Integration of Genetic Algorithm and Multiple Kernel Support Vector Regression for Modeling Urban Growth.” Computers, Environment and Urban Systems 65:28–40. https://doi.org/10.1016/j.compenvurbsys.2017.04.011.
  • Surjan, A., G. Ara Parvin, and R. Shaw. 2016. “Impact of Urban Expansion on Farmlands: A Silent Disaster.” In Urban Disasters and Resilience in Asia, 91–112. Elsevier. https://doi.org/10.1016/B978-0-12-802169-9.00007-0.
  • Tayyebi, A., and B. C. Pijanowski. 2014. “Modeling Multiple Land Use Changes Using ANN, CART and MARS: Comparing Tradeoffs in Goodness of Fit and Explanatory Power of Data Mining Tools.” International Journal of Applied Earth Observation and Geoinformation 28:102–116. https://doi.org/10.1016/j.jag.2013.11.008.
  • Tayyebi, A., B. C. Pijanowski, M. Linderman, and C. Gratton. 2014. “Comparing Three Global Parametric and Local Non-Parametric Models to Simulate Land Use Change in Diverse Areas of the World.” Environmental Modelling & Software 59:202–221. https://doi.org/10.1016/j.envsoft.2014.05.022.
  • Tayyebi, A. H., A. Tayyebi, and N. Khanna. 2014. “Assessing Uncertainty Dimensions in Land-Use Change Models: Using Swap and Multiplicative Error Models for Injecting Attribute and Positional Errors in Spatial Data.” International Journal of Remote Sensing 35 (1): 149–170. https://doi.org/10.1080/01431161.2013.866293.
  • Ullah, M., L. Jing, and B. Wadood. 2020. “Analysis of Urban Expansion and Its Impacts on Land Surface Temperature and Vegetation Using RS and GIS, a Case Study in Xi’an City, China.” Earth Systems and Environment 4 (3): 583–597. https://doi.org/10.1007/s41748-020-00166-6.
  • Van Campenhout, J., and T. Cover. 1981. “Maximum Entropy and Conditional Probability.” IEEE Transactions on Information Theory 27 (4): 483–489. https://doi.org/10.1109/TIT.1981.1056374.
  • Van Minnen, J. G., K. Klein Goldewijk, E. Stehfest, B. Eickhout, G. van Drecht, and R. Leemans. 2009. “The Importance of Three Centuries of Land-Use Change for the Global and Regional Terrestrial Carbon Cycle.” Climatic Change 97 (1–2): 123–144. https://doi.org/10.1007/s10584-009-9596-0.
  • Zhao, W., X. Zhu, X. Sun, Y. Shu, and Y. Li. 2015. “Water Quality Changes in Response to Urban Expansion: Spatially Varying Relations and Determinants.” Environmental Science and Pollution Research 22 (21): 16997–17011. https://doi.org/10.1007/s11356-015-4795-x.
  • Zhu, A.-X., Y. Miao, L. Yang, S. Bai, J. Liu, and H. Hong. 2018. “Comparison of the Presence-Only Method and Presence-Absence Method in Landslide Susceptibility Mapping.” Catena 171:222–233. https://doi.org/10.1016/j.catena.2018.07.012.