Full article: A large-scale extraction framework for mapping urban in-formal settlements using remote sensing and semantic segmentation

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Urban informal settlements (UISs) are densely populated and poorly developed residential areas in urban areas. The mapping of UISs using remote sensing is crucial for urban planning and management. However, the large-scale extraction of UISs is impeded by the labor-intensive task of collecting numerous training samples and the lack of automatic and effective city partition. To overcome these challenges, we proposed a large-scale extraction framework for UISs based on semantic segmentation of high-resolution remote sensing images. Utilizing Deeplab V3 Plus as the foundational extraction model, the proposed framework introduces fast sample collection based on GLCM features. Besides, an automatic city partition approach combined with clustering and fine-tuning was proposed to enhance the performance on extracting a specific category of UISs. The results of the case study conducted in 36 major Chinese cities show that the proposed framework achieved good performance, with an overall F1 score of 85.76%. Furthermore, comparative assessments were performed to demonstrate the effectiveness of automatic city partition. The proposed framework offers a practical approach for the large-scale extraction of UISs, which holds great significance for sustainable development, poverty estimation, infrastructure construction, and urban planning.

Keywords:

1. Introduction

Urban informal settlements (UISs) are densely populated areas with low-quality and non-durable housing, inadequate facilities, insufficient living space, and an insecure residential status (UN-Habitat Citation2015). These settlements usually lack basic infrastructure and public services, leading to low living standards and an increased risk of disasters and social problems. UISs are on the rise worldwide, particularly in Eastern and Southeastern Asia, sub-Saharan Africa, and Central and Southern Asia (UN-Habitat Citation2013; Arif et al. Citation2023). It is estimated that 25% of the world’s urban population lives in UISs, with this proportion continuing to grow (UN-Habitat Citation2016).

During China’s rapid urbanization in the last twenty years, a multitude of UISs have emerged (Zhang et al. Citation2018). Immigrant workers constructed houses of varying sizes on vacant lots to obtain living spaces in urban areas without top-down planning, which led to the formation of UISs (Dong et al. Citation2020). This type of settlement process is prevalent in Northeast China’s old industrial base. Furthermore, due to extensive urban expansion, certain rural villages were surrounded and integrated into newly formed urban areas, transforming into urban villages. This phenomenon is particularly notable in southern China, including cities like Shenzhen and Guangzhou (Wang et al. Citation2009; Lin and De Meulder Citation2012). Currently, an estimated 100 million people reside in UISs across China.

The Advance in the spatial resolution of satellite remote sensing have enabled the monitoring of UISs’ spatial patterns by visual interpretation (Kuffer et al. Citation2016; Fallatah et al. Citation2022). UISs are frequently characterized by small-scale structures, limited vegetation, and high population density. UISs are different from other urban areas on high-resolution remote sensing images, allowing for visual interpretation to extract UISs (Gruebner et al. Citation2014). Although visual-interpretation methods have been applied at the city scale, the manual extraction of UISs is labor-intensive and time-consuming. Moreover, the differences in interpreters’ understanding of extraction criteria result in the variations of visual-interpretation results. The limitations of manual methods become evident when applying them to large-scale areas, encompassing numerous cities.

Automatic extraction methods for UISs have gained popularity due to their potential to reduce labor and time costs. The most frequently used method for extracting UISs is geographic object-based image analysis (GEOBIA) (Blaschke et al. Citation2014), which employs multi-scale segmentation to identify irregular consistent objects. The GEOBIA approach achieved good performance (Hofmann Citation2001; Rhinane et al. Citation2011; Shekhar Citation2012; Kohli et al. Citation2013; Zhao et al. Citation2020) because the hierarchical network of objects successfully represents the composition and configuration of houses in UISs. However, determining appropriate segmentation parameters for the large-scale and fine-grained extraction of UISs is a challenging and time-consuming task.

Compared with GEOBIA, machine learning-based methods have demonstrated significant effectiveness in larger study areas (Huang et al. Citation2015; Leonita et al. Citation2018; Prabhu and Parvathavarthini Citation2021; Chen et al. Citation2022; Matarira et al. Citation2022). One reason for their success is the utilization of powerful classifiers, including neural networks, random forest, support vector machines (SVM), and decision trees to identify UISs. Another reason is the utilization of human-designed representations and features. Contextual image features, such as physical appearances and the gray-level co-occurrence matrix (GLCM), are effective for extracting UISs (Liu et al. Citation2017; Zhao et al. Citation2020). However, manually designed features are incomplete and over-specified, and the traditional classifiers in machine learning can only be trained on limited samples.

Deep learning is a crucial subfield of machine learning that concentrates on effectively training multilayer convolutional neural networks to capture complex feature representations from input data. The features that are automatically learned from a large number of samples are highly robust. The deep convolutional neural network has been applied to semantic segmentation, the task of assigning semantic category to every pixel. In urban environment, it is hard to obtain hand-craft features of pixels robust to the various land-use with similar textures to UISs. The end-to-end semantics segmentation networks based on convolutional neural networks has been employed in mapping UISs at the pixel level. Variations in fully convolutional networks, Deeplab, and UNet, have achieved more accurate mapping results than previous methods (Mboga et al. Citation2017; Persello and Stein Citation2017; Pan et al. Citation2020; Zhao et al. Citation2020; Fan, Li, Han, et al. Citation2022; Fan, Li, Li, et al. Citation2022; Fan, Li, Song, et al. Citation2022).

The application of deep learning to the large-scale mapping of UISs faces two primary challenges. The key to the success of deep learning is the automatic learning of features from large samples. Extracting UISs at a larger scale increases the workload of collecting positive samples and negative samples which are easily confused with UISs. Another challenge in the large-scale application of deep learning is automatic and effective partition. In the existing research on large-scale mapping and data production using remote sensing, partition, which divides research areas into multiple regions with relatively consistent features, improved mapping accuracies (Zhao et al. Citation2020). Partition was performed using regular shapes, such as hexagons (Huang et al. Citation2022), without considering feature differences within mapping zones. Some research considered feature similarity more carefully and used expert knowledge for partition (Zhao et al. Citation2020; Mao et al. Citation2022). However, automatic partition methods for UIS extraction have not been thoroughly investigated.

To address these challenges, this research proposes a large-scale extraction framework of UISs based on semantic segmentation of remote sensing images. Utilizing Deeplab V3 Plus as the foundational extraction model, the proposed framework used fast sample collection based on GLCM features to obtain features robust to easily confused parcels. Furthermore, an automatic city partition approach combined with clustering and fine-tuning was proposed to enhance the performance on extracting a specific category of UISs. The case study conducted on 36 major Chinese cities reveals that the proposed framework had high accuracy and strong robustness for large-scale UIS extraction.

The main contributions of this article are summarized as follows:

We proposed a large-scale framework for extracting UISs from remote sensing images, which can efficiently obtain accurate results
A practical approach of collecting negative samples is developed based on texture features to enhance discriminability to various parcels easily confused with UIS.
By using automatic city partition, the large inter-class variation of UISs existed in large-scale study areas was reduced, resulting in improved UIS mapping accuracy.

The structure of this article is as follows: Section 2 introduces the study area and datasets. Section 3 outlines the proposed methodology for the large-scale automatic extraction of UISs. Section 4 reports the results. Section 5 offers a discussion of the results. Section 6 concludes this study.

2 Study area and data

2.1. Study area

The study was conducted in the urban areas of China’s 36 major cities, including four municipalities, 27 provincial capitals, and five municipalities with independent planning status. The 36 cities located in different regions of China () were grouped into three levels according to the urban population: level I (greater than 10 million, 7 cities), level II (5 million to 10 million, 12 cities), and level III (<5 million, 17 cities). Harbin, Shenyang, and Jilin, located in old industry base, were the first regions in China to become industrialized. Due to the large number of industrial workers and concentrated residence, these cities have witnessed the emergence of numerous shantytown areas. After the 1980s, the Pearl River Delta region underwent large-scale industrialization and urbanization. Villages were surrounded by industrial and commercial areas, resulting in numerous urban villages. In the twentieth century, the cities in western China, including Chengdu, Xi’an, and Chongqing, also experienced rapid expansion. Despite the variations in urbanization processes, UISs are prevalent in the chosen 36 cities.

Figure 1. Distribution of the 36 cities included in this study.

2.2. Data sources

Remote sensing images and auxiliary data were used for the extraction and mapping of UIS, and all used data are listed in .

Table 1. Data details.

Download CSV Display Table

The high-resolution remote sensing images acquired from China’s Ziyuan-3 satellite were employed to produce digital ortho-photo maps (DOMs) through a series of preprocessing steps, including orthorectification, geometric correction, and mosaicking. The produced DOMs had a spatial resolution of 2.1 meters and consisted of three bands: red, green, and blue. Despite variations in the acquisition dates of remote sensing images, all utilized images were obtained in 2017. After producing the DOMs, the maps were utilized to extract UISs in the study area.

Auxiliary data included urban and administrative boundaries in this study. The global urban boundary and prefecture-level administrative unit were employed to limit extraction areas. The urban extent data of the 36 cities was obtained from the multi-temporal dataset of global urban boundaries (GUB, Li et al. Citation2020). We removed the patches that were smaller than 1 km² in the GUB dataset. The global administrative area was utilized for mapping extraction results.

3. Methods

This study presented a framework for the large-scale extraction of UISs, as illustrated in . In order to reduce confusion with the features exhibiting similar textures to UISs, such as villas and forests, a substantial number of negative samples were collected. Based on the Deeplab V3 plus model, the focal loss function (Lin et al. Citation2017) was employed in the subsequent training step to mitigate the impact of class imbalance, as described in Section 3.2. Due to the substantial intra-class variance of UISs, an automatic city partition method was developed (Section 3.3) by clustering and fine-tuning to obtain models that performed well on one category of UIS. The complete extraction process is detailed in Section 3.4.

Figure 2. Workflow of the proposed framework.

3.1. Reference samples collection

Homogeneous and dense textures were observed on remote sensing images of UISs where similar small buildings were repeated. However, similar textures also appeared in villa areas, forests, and large roofs, leading to potential confusion with UISs. In order to reduce confusion, we employed GLCM to extract the regions with homogeneous and dense textures. Training samples were then selected from the study area and especially these extracted regions.

The GLCM contrast ratio is calculated with two different offsets to identify regions with dense textures. The offset parameter controls the distance between two pixels in the GLCM. Similar pixel value pairs can be detected in the regions with dense textures regardless of the offset value, thus maintaining a constant GLCM contrast in those areas despite the increase of offsets. In contrast, in formal areas with sparse textures, where buildings are far apart, the GLCM contrast increase with larger offsets, facilitating the identification of dense textures through the following equation: (1) $r = \frac{C_{offset = (0, w_{2})}}{C_{offset = (0, w_{1})}}$ (1) where C represents the GLCM contrast. Buildings are typically oriented along an east-west axis, with the texture phenomenon in the north-south direction being more significant. Therefore, the offset in the east-west direction was set to 0, while the offsets w₂ and w₁ in the north-south direction were set to different values, with w₂ being larger than w₁.

We collected training samples of UISs from the grided study area using random and selective sampling. The urban area of a city was divided into equally sized grids, from which a given number of grids were randomly selected for collecting training samples. We then selected grids with a GLCM contrast ratio higher than a set threshold value. The UISs in the selected grids were labeled through visual interpretation. Random sampling diversified background features, and captured multi-story and high-rise formal settlements. Simultaneously, selective sampling intentionally incorporated easily confusable parcels, such as villa areas. By randomly and selectively collecting samples, we augmented the training dataset with informal settlements and formal settlements, including commodity houses and villa areas, for learning robust features to distinguishing between UISs, formal settlements, and other land-use in urban areas.

3.2. Model training

The Deeplab V3 Plus model (Chen et al. Citation2018) was employed as a binary classifier to classify pixels into UISs and others. In popular semantic segmentation models, two types of structures were used to capture multi-scale features. The first structure is an encoder-decoder structure, where a high-to-low encoder extracts low-resolution representations layer by layer, and a low-to-high decoder recovers high-resolution feature maps. The second structure is atrous spatial pyramid pooling (ASPP) module, which directly connects one layer to four atrous convolution layers with different atrous rates and then stacks the features of different scales. The Deeplab V3 Plus model combins these two architectures and performed well in semantic segmentation tasks. In the remote sensing community, the Deeplab V3 Plus model has been used directly or as a base architecture for identifying eco-environment elements (Wang et al. Citation2022), clouds (Segal-Rozenhaimer et al. Citation2020), and so on.

The effects of class imbalance were mitigated by employing the focal loss function. Urban areas are complex regions for living, production, and trade, and the proportion of UIS is limited. Random selection of training samples from the study area resulted in a smaller number of samples of UISs compared to other background features. Additionally, negative samples were collected from dense texture areas, further exacerbating the issue of class imbalance in our sample dataset. To address the issue, the cross-entropy loss function in Deeplab V3 Plus was replaced with the focal loss function.

Focal loss, a variant of the standard cross-entropy loss, addresses the issue of class imbalance and hard samples. It is calculated as follows: (2) $FL (p_{t}) = - α_{t} {(1 - p_{t})}^{γ} log (p_{t})$ (2) (3) $p_{t} = {\begin{cases} p if y = 1 \\ 1 - p otherwise, \end{cases}$ (3) (4) $α_{t} = {\begin{cases} α if y = 1 \\ 1 - α otherwise . \end{cases}$ (4) p represents the predicted probability, the parameters α controls the class weight, and the parameters γ controls the degree of up-weighting of hard-to-classify pixels. The value of α is in the range of 0 to 1, and was set based on the ratio of positive to negative samples. Since the number of positive samples was limited, we emphasized the importance of correctly classifying difficult positive samples by setting γ larger than 0.

3.3. Automatic city partition

UISs exhibit significant regional variations. The physical characteristics of UISs, such as the color, shape, and structure of roofs, diverge considerably in cities that are geographically distant. Moreover, inter-city differences in the appearance of UISs usually surpass intra-city differences. To address these regional disparities, we proposed an automatic scheme of city partition comprising two steps: (1) partitioning cities into distinct categories via clustering, and (2) fine-tuning the extraction model using the samples from cities within the same category.

In the partition step, we sampled patches of 64 × 64 pixels from the images of UISs and extracted a 384-dimensional feature for all patches. The patches were then clustered using a density-based method (Rodriguez and Laio Citation2014). If a majority of the patches sampled from a city belonged to a cluster i, the city was assigned to the cluster i. We utilized color, texture, and shape features to describe patches. The features used were as follows:

The histogram of RGB bands from the DOM images was used as a color representation for each patch. The DOM images contained three bands, and we computed a 64-bin histogram for each band. The three histograms were then concatenated to generate a 192-dimensional feature.
The histogram of window variance was employed as a texture representation for each patch. The RGB images of each patch were first converted to grayscale. The window variance for all pixels was then computed to generated a 64-bin histogram of window variance.
The Scale-Invariant Feature Transform (SIFT) feature (Lowe Citation2004) was used as a shape representation of each patch. The SIFT method consists of two steps: detecting and describing stable feature points. Without detecting feature points, we divided one patch into non-overlapping sub-patches of 16x16 pixels and directly described the sub-patches. The 16 SIFT features were fused into a 128-dimensional feature using max-pooling.

After partitioning the study area into distinct categories, we fine-tuned a semantic segmentation model for each category of UISs using a smaller learning rate. The base model for fine-tuning was the model obtained by training all the samples collected from the 36 cities. The base model was fine-tuned with the samples belonging to the same category.

3.4. Extraction process

With the samples collected from the 36 cities as delineated in Section 3.1, we trained the DeepLab V3 Plus model using focal loss as a loss function as described in Section 3.2. After training and city partition as detailed in Section 3.3, we extracted UISs using fine-tuned models. The complete steps of extraction and assessment are listed below:

Large training samples were collected using the method described in Section 3.1. A total of 26,769 samples of 256 × 256 pixels were collected from the urban areas of the 36 cities of China. Of these samples, 90% were used for training and the remaining 10% were used to assess the accuracy of extraction results.
The DeepLab V3 Plus model with focal loss was constructed using the mmsegmentation (MMSegmentation Contributor Citation2020) codebase. The momentum algorithm was used to train the model. When training the DeepLab V3 Plus model, the batch size and training iteration were set to 4 and 160,000, respectively. The first 140,000 iterations were used to train all samples and obtain a base model M_b, and the remaining 20,000 iterations were used for fine-tuning. The learning rate started at an initial value of 0.01 and followed a polynomial decay with a power of 0.9. For focal loss, the parameter α was set to 0.75 and the parameter γ to 2 to handle the imbalance of training data.
UISs of each city were extracted by fine-tuned models. The 36 cities were classified into 8 categories (), as described in Section 3.3. Then, we fine-tuned the model M_b by training on the samples of a category with 20,000 iterations. We extracted UISs of a city using its corresponding fine-tuned model.
We calculated producer’s accuracy (PA), user’s accuracy (UA), and F1 score of extraction results to assess accuracy. We trained all samples from the 36 cities and the samples from a single cluster separately, using the same training parameters as those described in step 2. We obtained the model M_a by training all samples, the fine-tuned model M_f, and the model M_c by training the samples of one category, and compared their performance to show the effectiveness of automatic city partition.

Table 2. City partition results of UISs.

Display Table

4. Results

4.1. Accuracy assessment

The local extraction results of UIS in typical cities are shown in . The proposed method successfully distinguished commodity housing, which differs significantly from UISs, and small houses with disorganized layouts that have similar texture to UISs on remote sensing images. Moreover, small green areas were separated from UIS patches, further demonstrating the fine-scale capabilities of our framework. Due to the diverse acquisition time and background features, the remote sensing images of the study area showed differences in tone and color. As indicated by the local extraction results in , our framework performed well under various optimal illumination conditions and urban backgrounds.

Figure 3. Local extraction results of representative cities.

The mean F1 score of extracting UISs in the 36 cities was 85.76%, and the overall PA and UA were 87.35% and 84.23%, respectively. We selected 20 villa areas from each city to evaluate the performance of distinguishing between informal and formal settlements. In the 720 villa areas examined, 11 were erroneously classified as informal settlements. However, the accuracy of identifying villa areas as formal settlements remained high at 98.47%. The high extraction accuracy suggest that the proposed framework can meet the demand for the large-scale mapping of UIS using remote sensing images.

displays the extraction results of 9 cities, revealing three typical patterns of spatial distribution of UISs. In cities such as Harbin and Xi’an, the density of UISs gradually increased from the center to the periphery of urban areas. In contrast, Shenzhen exhibited a near-uniform distribution of UISs throughout its urban areas. The third pattern was the agglomeration of UISs in specific regions of urban areas, as exemplified by the high agglomeration of UISs in the western part of Nanning.

Figure 4. UIS maps of typical cities.

4.2. Robustness performance

For individual cities, the F1 scores of the extraction results ranged from 82.23% for Chongqing to 89.56% for Shenzhen, with over 85% accuracy achieved in 19 cities. Additionally, the automatic framework demonstrated high stability and reliability, with a small standard deviation of 1.93%.

presents a comparison between the model M_c trained with the samples belonging to one category and the fine-tuned model M_f for each category. The overall performance of the fine-tuned models was significantly better than that of the model M_c, with F1-score improvements ranging from 2.04% to 3.58%. The difference in UA was small, while the difference in PA was much larger, indicating a more precise boundary of UISs as shown in . The extraction results of models M_f and M_c also showed significant difference in , due to some roads and small buildings being misidentified as UISs by the model M_c.

Figure 5. Illustrations of some UIS extraction results.

Table 3. Performance of models obtained by training samples of a specific category and finetuned models.

Download CSV Display Table

presents a comparison between the model M_a, trained on all the samples, and the fine-tuned model M_f for each category of UISs. In comparison to the model M_a, the overall F1 score of fine-tuned models improved, with an average improvement rate of 0.67%. Except for categories B and G, the overall PAs of the other categories increased, while the overall UAs decreased. The increase in PA outweighed the decrease in UA, resulting in improved F1 scores. In , the visual difference between the extraction results of the model M_a and M_f was small but perceptible. The Model M_f excluded larger structures from the extraction results, leading to more uniform textures within the extracted regions. After fine-tuning, the model M_f performed worse than the model M_a for category G, while the model M_f exhibited limited improvement in extraction accuracy for category B. In the cities of categories B and G, some areas were misidentified by both the model M_a and M_f, with more pixels misidentified by the fine-tuned model M_f as shown in . Overall, the F1 score of fine-tuned models was higher than that of the model obtained by training all samples, and the boundaries of UISs extracted by fine-tuned models were more accurate.

Figure 6. Typical misidentified example of the model M_a obtained by training all samples and the fine-tuned model M_f. (a) image (b) result of the model M_a (c) result of the model M_f

Table 4. Comparison of the accuracies using the model obtained by training all samples and finetuned models.

Download CSV Display Table

4.3. Extraction results of 36 cities

The area and proportion of UISs varied significantly among the 36 cities studied (). The total area of UISs in these cities was 3882.11 km², accounting for 7.39% of the urban areas. Overall, there were 9 cities where UISs accounted for more than 10% of urban areas, with Shijiazhuang being the only city exceeding 20%. Beijing ranked first in the areas of UIS and urban area, while Shanghai with the second-largest urban area, had a lower area ratio, illustrating a balance between quantity and quality of urban development. Chongqing and Chengdu had the lowest proportion of UISs, indicating a higher level of urbanization and better living spaces for citizens.

Table 5. Area and proportion of UISs for the 36 China’s major cities.

Download CSV Display Table

illustrates the spatial distributions of the proportion of UISs in the 36 major cities. With the Yangtze River as the dividing line, the cities in North and Southeast China had higher proportions of UISs. The cities located in the Yangtze River Economic Belt, including Chengdu, Chongqing, Wuhan, Hefei, Nanjing, and Shanghai, were characterized by their expansive urban areas and relatively low proportions of UISs. Beijing, Tianjin, Shijiazhuang, Jinan, and Qingdao in North China all had high proportions of UISs. The southeast coast was the most economically developed area in China, with high UIS proportions observed in Guangzhou, Haikou, Xiamen, and Fuzhou. The urbanization levels of the Yangtze River Economic Belt, in terms of UISs, were significantly higher compared to the southeastern coastal regions and North China.

Figure 7. Map of UIS proportions in urban areas of the 36 cities.

5. Discussion

5.1. Extraction performance

UISs in the 36 cities exhibited significant intra-class differences, which were alleviated by automatic city partition. The city partition allowed the extraction model to focus on a certain category of UISs. While learning with all samples captured the common features of UISs, fine-tuning with the samples from a specific category enabled the model to capture the unique details of that category. After automatic city partition, target images were matched with their corresponding fine-tuned model, resulting in more accurate extraction of UIS boundaries and an overall improvement in accuracy. The effectiveness of city partition highlighted the limitations of the Deeplab V3 Plus model in simultaneously learning the details of all categories of UISs, despite its large number of parameters. City partition proved to be both effective and necessary for large-scale extraction, given the significant intra-class variance and limited learning capacity of semantic segmentation models.

An adequate number of negative samples was crucial for the effectiveness of city partition. The combination of city partition and fine-tuned models resulted in the more accurate extraction of UIS boundaries. However, we observed that a fine-tuned model and a model trained on all the samples both misidentified a region resembling UIS, with the fine-tuned model mistakenly labeling a higher number of pixels (). Many instances of misclassification offset the accuracy improvement brought by city partition. For instance, the fine-tuned model of Category H exhibited a decrease in accuracy. Using a specific semantic segmentation model and ensuring an adequate number of negative samples can mitigate the misidentification of areas with similar textures, thereby highlighting the performance of fine-tuned models in extracting precise boundaries. Therefore, the combination of city partition and rapid collection of negative samples was meaningful for the large-scale mapping of UISs.

The integration of random and selective sample collection with end-to-end learning facilitated the differentiation between informal and formal settlements. The subtle variations in textural features between villa areas and UISs in the ZY-3 DOM posed a challenge for designing artificial features suitable for large-scale extraction. Instead of checking extraction results manually and taking error results as training samples, we actively collecting samples with similar textures to UIS, ensuring that the training dataset included representative parcels of UISs, villas, and commercial housing. By leveraging a semantic segmentation model based on deep learning, robust features were automatically acquired from the collected training samples, ensuring accurate discrimination between formal and informal settlements.

The framework proposed is unable to differentiate between historical residences and UISs. Historical residences, such as Shanghai’s Shikumen, are characterized by houses arranged in adjacent rows, resulting in similar texture features to UISs in remote sensing images. The proposed method utilized computer vision techniques to differentiate UISs based on their distinct physical characteristics in complex urban environment. Due to the small visual distinction between UISs and historical residences, the proposed method is unable to differentiate between the two types of residential areas. In the urban areas of China, the quantity of historical residences is relatively small compared to that of UISs. Therefore, the extracted historical residences of the proposed framework had limited influence on the extraction results of UISs.

5.2. Potential applications

The regionwide and nationwide extraction of UISs would greatly benefit from the proposed framework for the large-scale extracting UISs from high-resolution remote sensing images. UISs are prevalent in various cities across China, necessitating the extraction of UIS at the national scale and enhancing geographical understanding of UIS. The main types of UISs are covered in the 36 major cities studied. The proposed model enables rapid location of UISs in additional cities and speeds up the collection of positive samples. Moreover, as the study area expands to more prefecture-level cities, manual sample collection and city partition based on expert knowledge become increasingly tedious. The rapid collection and automated partition methods presented in this study are significant for the nationwide extraction of UISs.

China has initiated and expedited its annual shantytown renovation program (Li et al. Citation2018), leading to fluctuations in the number of UISs each year. Monitoring the spatio-temporal distribution patterns of UISs using remote sensing images is necessary for urban management and planning. The proposed framework allowed for the rapid acquisition of initial results, serving as a reference for subsequent multi-temporal extraction. In cases where the quality and color differences between multi-temporal remote sensing images are not significant, the model trained on single-temporal images can be applied to other temporal images. After city partition, collecting samples sequentially from each temporal images according to the proposed method, and training the multi-temporal samples were effective for the multi-temporal extraction. However, the multi-temporal extraction of UISs encounters challenges related to the consistency of multi-temporal results, as well as the rapid and effective reusability of models and samples obtained from single-temporal images. It is necessary to extend the proposed framework to gain a precise understanding of the temporal dynamics of UISs.

5.3. The UIS in China

After city partition, the eight categories of urban villages exhibited silent similarities and differences in terms of building density, size, roof colors, and building layout. Across the 36 cities, small houses were densely distributed in UISs. Typically, the spacing between adjacent houses was only 1-2 pixels in ZY-3 DOM images. The compact layout allowed local residents to make the most of limited space. The big difference of UISs was roof colors, and blue, red, or gray roofs were all observed in the study. Additionally, the building layouts of UISs varied, with some being chaotically arranged and others following a neat and orderly pattern. Buildings were often arranged in rows in the villages of Northern China, while buildings in UISs were laid out in arrays after demolition and reconstruction in the Southern China. The regional variations of building in the rural areas of China, and the transforming real estates into urban villages, were possible reasons for the differences of UISs across the 36 cities.

A possible reason for the phenomenon that the proportions of UISs in the urban areas along the Yangtze River were low and those in other regions were high (), was the morphology of rural settlements. The physical morphology of rural settlements showed remarkable regional disparities. The settlements in the rural areas of the Northern China were clustered and compact. Dispersed rural settlements and arc-belt rural settlements were the main types of settlement patterns in the Yangtze River Basin. In mountainous Southeastern China with large population and little land for construction, villagers built their house around a center, forming clustered rural settlements. Under rapid urbanization, dispersed rural settlements in the Yangtze River Basin were prone to expropriation, while clustered rural settlements in Northern and Southeastern China were likely to be encircled by urban areas.

The proportions of slum in some cities of developing countries are listed in . The proportions of UISs in Jinan, Shijiazhuang were higher than the proportions of slum in Mumbai and Rio, further indicating the significance of UISs in China. The proportion of slums in different cities varied greatly as shown in , and our extraction results proved that the proportion of UISs in the 36 cities of China varied greatly, too. It is necessary to monitoring global UISs to compare urban living environment of low-income and poverty-stricken population (Arif and Gupta Citation2020).

Table 6. Slum proportions of typical cities in developing countries.

Download CSV Display Table

6. Conclusions

In this article, we proposed a framework for the large-scale extraction of UISs using high-resolution remote sensing images. In the proposed framework, the ratio of GLCM contrasts with two different offsets was employed to accelerate collecting training samples. Further, an automatic city partition by clustering was combined with finetuning to train models for a specific category of UISs. To the best of our knowledge, this is the first study to highlight the distinction between the large-scale and city-scale extraction of UISs in the era of deep learning. The experiments were conducted in the 36 major cities of China. The framework displayed good performance of extracting UISs. The overall F1 score of the 36 cities was 85.76%, with a small standard deviation of 1.93%. Comparative experiments prove that the proposed automatic city partition improved accuracy, and the main accuracy improvement was reflected in the boundary details of UISs. This finding proves the superiority of the proposed framework in handling UIS extraction tasks.

The large number of UISs extracted from remote sensing images underscored the critical need for the redevelopment UISs. We propose that local governments institute a comprehensive suite of supportive policies aimed at encouraging social capital engagement in the redevelopment process. Implementing a top-down land-use planning is imperative for ensuring the sustainable development of UISs. Additionally, the extracted urban villages located in central urban areas offered valuable guidance for real estate development. We recommend collaborative efforts to bolster the refurbishment of urban villages for the well-being of local people.

The proposed framework in this article facilitated the efficient and accurate large-scale mapping of UISs from remote sensing images. The multi-temporal framework for large-scale mapping of UISs will be further investigated in our future work. Moreover, for the purpose of understanding the dynamics of UISs, we plan to conduct multi-temporal mapping of UISs in China using remote sensing images.

Acknowledgments

The authors thank the suggestions and comments from anonymous reviewers, which greatly improved this manuscript.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

The Open Research Fund of Key Laboratory of Land Environment and Disaster Monitoring, Ministry of Natural Resources, China University of Mining and Technology (No. LEDM2021B08); National Key R&D Program of China (No. 2022YFF1303405).

References

Arif M, Gupta K. 2020. Spatial development planning in peri-urban space of Burdwan City, West Bengal, India: statutory infrastructure as mediating factors. SN Appl Sci. 2(11):1779. doi: 10.1007/s42452-020-03587-0.
Web of Science ®Google Scholar
Arif M, Sengupta S, Mohinuddin SK, Gupta K. 2023. Dynamics of land use and land cover change in peri urban area of Burdwan city, India: a remote sensing and GIS based approach. GeoJournal. 88(4):4189–4213. doi: 10.1007/s10708-023-10860-3.
Web of Science ®Google Scholar
Blaschke T, Hay GJ, Kelly M, Lang S, Hofmann P, Addink E, Queiroz Feitosa R, Van Der Meer F, Van Der Werff H, Van Coillie F, et al. 2014. Geographic object-based image analysis – towards a new paradigm. ISPRS J Photogramm Remote Sens. 87(100):180–191. doi: 10.1016/j.isprsjprs.2013.09.014.
PubMedGoogle Scholar
Chen D, Tu W, Cao R, Zhang Y, He B, Wang C, Shi T, Li Q. 2022. A hierarchical approach for fine-grained urban villages recognition fusing remote and social sensing data. Int J Appl Earth Obs Geoinformation. 106:102661. doi: 10.1016/j.jag.2021.102661.
Web of Science ®Google Scholar
Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y, editors. Comput Vis – ECCV 2018 [Internet]. Vol. 11211. Cham: Springer International Publishing; [accessed 2024 Mar 12]; p. 833–851. doi: 10.1007/978-3-030-01234-2_49.
Google Scholar
Dong L, Wang Y, Lin J, Zhu E. 2020. The community renewal of shantytown transformation in old industrial cities: evidence from Tiexi Worker Village in Shenyang, China. Chin Geogr Sci. 30(6):1022–1038. doi: 10.1007/s11769-020-1164-6.
Web of Science ®Google Scholar
Dubovyk O, Sliuzas R, Flacke J. 2011. Spatio-temporal modelling of informal settlement development in Sancaktepe district, Istanbul, Turkey. ISPRS J Photogramm Remote Sens. 66(2):235–246. doi: 10.1016/j.isprsjprs.2010.10.002.
Web of Science ®Google Scholar
Engstrom R, Sandborn A, Yu Q, Burgdorfer J, Stow D, Weeks J, Graesser J. 2015. Mapping slums using spatial features in Accra, Ghana. In 2015 Jt Urban Remote Sens Event JURSE [Internet]. Lausanne, Switzerland: IEEE; [accessed 2024 Mar 15]; p. 1–4. doi: 10.1109/JURSE.2015.7120494.
Google Scholar
Fallatah A, Jones S, Wallace L, Mitchell D. 2022. Combining object-based machine learning with long-term time-series analysis for informal settlement identification. Remote Sens. 14(5):1226. doi: 10.3390/rs14051226.
Google Scholar
Fan R, Li F, Han W, Yan J, Li J, Wang L. 2022. Fine-scale urban informal settlements mapping by fusing remote sensing images and building data via a transformer-based multimodal fusion network. IEEE Trans Geosci Remote Sensing. 60:1–16. doi: 10.1109/TGRS.2022.3204345.
Web of Science ®Google Scholar
Fan R, Li J, Li F, Han W, Wang L. 2022. Multilevel spatial-channel feature fusion network for urban village classification by fusing satellite and streetview images. IEEE Trans Geosci Remote Sensing. 60:1–13. doi: 10.1109/TGRS.2022.3208166.
Web of Science ®Google Scholar
Fan R, Li J, Song W, Han W, Yan J, Wang L. 2022. Urban informal settlements classification via a transformer-based spatial-temporal fusion network using multimodal remote sensing and time-series human activity data. Int J Appl Earth Obs Geoinformation. 111:102831. doi: 10.1016/j.jag.2022.102831.
Web of Science ®Google Scholar
Gruebner O, Sachs J, Nockert A, Frings M, Khan M, Lakes T, Hostert P. 2014. Mapping the slums of Dhaka from 2006 to 2010. Dataset Pap Sci. 2014:1–7. doi: 10.1155/2014/172182.
Google Scholar
Hofmann P. 2001. Detecting informal settlements from Ikonos image data using methods of object oriented image analysis : an example from Cape Town (South Africa). In: Proceedings of the 2nd international symposium remote sensing of urban areas. Regensburg, Germany; p. 107–118.
Google Scholar
Huang X, Liu H, Zhang L. 2015. Spatiotemporal detection and analysis of urban villages in mega city regions of China using high-resolution remotely sensed imagery. IEEE Trans Geosci Remote Sens. 53(7):3639–3657. doi: 10.1109/TGRS.2014.2380779.
Web of Science ®Google Scholar
Huang X, Yang J, Wang W, Liu Z. 2022. Mapping 10 m global impervious surface area (GISA-10m) using multi-source geospatial data. Earth Syst Sci Data. 14(8):3649–3672. doi: 10.5194/essd-14-3649-2022.
Web of Science ®Google Scholar
Kit O, Lüdeke M, Reckien D. 2012. Texture-based identification of urban slums in Hyderabad, India using remote sensing data. Appl Geogr. 32(2):660–667. doi: 10.1016/j.apgeog.2011.07.016.
Web of Science ®Google Scholar
Kohli D, Warwadekar P, Kerle N, Sliuzas R, Stein A. 2013. Transferability of object-oriented image analysis methods for slum identification. Remote Sens. 5(9):4209–4228. doi: 10.3390/rs5094209.
Google Scholar
Kuffer M, Pfeffer K, Sliuzas R. 2016. Slums from space—15 years of slum mapping using remote sensing. Remote Sens. 8(6):455. doi: 10.3390/rs8060455.
Google Scholar
Leonita G, Kuffer M, Sliuzas R, Persello C. 2018. Machine learning-based slum mapping in support of slum upgrading programs: the Case of Bandung City, Indonesia. Remote Sens. 10(10):1522. doi: 10.3390/rs10101522.
Google Scholar
Li X, Kleinhans R, van Ham M. 2018. Shantytown redevelopment projects: state-led redevelopment of declining neighbourhoods under market transition in Shenyang, China. Cities. 73:106–116. doi: 10.1016/j.cities.2017.10.016.
Web of Science ®Google Scholar
Li X, Gong P, Zhou Y, Wang J, Bai Y, Chen B, Hu T, Xiao Y, Xu B, Yang J, et al. 2020. Mapping global urban boundaries from the global artificial impervious area (GAIA) data. Environ Res Lett. 15(9):094044. doi: 10.1088/1748-9326/ab9be3.
Web of Science ®Google Scholar
Lin T-Y, Goyal P, Girshick R, He K, Dollar P. 2017. Focal loss for dense object detection. In: Proc IEEE Int Conf Comput Vis ICCV. Venice, Italy; p. 2999–3007.
Google Scholar
Lin Y, De Meulder B. 2012. A conceptual framework for the strategic urban project approach for the sustainable redevelopment of “villages in the city” in Guangzhou. Habitat Int. 36(3):380–387. doi: 10.1016/j.habitatint.2011.12.001.
Web of Science ®Google Scholar
Liu H, Huang X, Wen D, Li J. 2017. The use of landscape metrics and transfer learning to explore urban villages in China. Remote Sens. 9(4):365. doi: 10.3390/rs9040365.
Google Scholar
Lowe DG. 2004. Distinctive image features from scale-invariant keypoints. Int J Comput Vis. 60(2):91–110. doi: 10.1023/B:VISI.0000029664.99615.94.
Web of Science ®Google Scholar
Mao L, Zheng Z, Meng X, Zhou Y, Zhao P, Yang Z, Long Y. 2022. Large-scale automatic identification of urban vacant land using semantic segmentation of high-resolution remote sensing images. Landsc Urban Plan. 222:104384. doi: 10.1016/j.landurbplan.2022.104384.
Web of Science ®Google Scholar
Matarira D, Mutanga O, Naidu M. 2022. Google earth engine for informal settlement mapping: a random forest classification using spectral and textural information. Remote Sens. 14(20):5130. doi: 10.3390/rs14205130.
Google Scholar
Mboga N, Persello C, Bergado J, Stein A. 2017. Detection of informal settlements from VHR images using convolutional neural networks. Remote Sens. 9(11):1106. doi: 10.3390/rs9111106.
Google Scholar
MMSegmentation Contributor. 2020. MMSegmentation: openMMLab semantic segmentation toolbox and benchmark [Internet]. https://github.com/open-mmlab/mmsegmentation.
Google Scholar
Pan Z, Xu J, Guo Y, Hu Y, Wang G. 2020. Deep learning segmentation and classification for urban village using a worldview satellite image based on U-Net. Remote Sens. 12(10):1574. doi: 10.3390/rs12101574.
Google Scholar
Persello C, Stein A. 2017. Deep fully convolutional networks for the detection of informal settlements in VHR images. IEEE Geosci Remote Sensing Lett. 14(12):2325–2329. doi: 10.1109/LGRS.2017.2763738.
Web of Science ®Google Scholar
Prabhu R, Parvathavarthini B. 2021. An enhanced approach for informal settlement extraction from optical data using morphological profile-guided filters: A case study of Madurai city. Int J Remote Sens. 42(17):6688–6705. doi: 10.1080/01431161.2021.1943039.
Web of Science ®Google Scholar
Rhinane H, Hilali A, Berrada A, Hakdaoui M. 2011. Detecting slums from SPOT data in Casablanca Morocco using an object based approach. JGIS. 3(3):217–224. doi: 10.4236/jgis.2011.33018.
Google Scholar
Rodriguez A, Laio A. 2014. Clustering by fast search and find of density peaks. Science. 344(6191):1492–1496. doi: 10.1126/science.1242072.
PubMed Web of Science ®Google Scholar
Segal-Rozenhaimer M, Li A, Das K, Chirayath V. 2020. Cloud detection algorithm for multi-modal satellite imagery using convolutional neural-networks (CNN). Remote Sens Environ. 237:111446. doi: 10.1016/j.rse.2019.111446.
Web of Science ®Google Scholar
Shekhar S. 2012. detecting slums from quick bird data in Pune using an object oriented approach. Int Arch Photogramm Remote Sens Spatial Inf Sci. XXXIX-B8:519–524. doi: 10.5194/isprsarchives-XXXIX-B8-519-2012.
Google Scholar
UN-Habitat. 2013. Streets as public spaces and drivers of urban prosperity. Nairobi, Kenya: UN-Habitat.
Google Scholar
UN-Habitat. 2015. Habitat III issue papers 22—informal settlements. Nairobi, Kenya: UN-Habitat.
Google Scholar
UN-Habitat. 2016. World Cities Report 2016: urbanization and development-emerging futures. New York: Un-Habitat.
Google Scholar
Wang C, Zhang R, Chang L. 2022. A study on the dynamic effects and ecological stress of eco-environment in the headwaters of the Yangtze river based on improved DeepLab V3+ Network. Remote Sens. 14(9):2225. doi: 10.3390/rs14092225.
Google Scholar
Wang YP, Wang Y, Wu J. 2009. Urbanization and informal development in China: urban villages in Shenzhen. Int J Urban Regional Res. 33(4):957–973. doi: 10.1111/j.1468-2427.2009.00891.x.
Web of Science ®Google Scholar
Wurm M, Taubenböck H. 2018. Detecting social groups from space–assessment of remote sensing-based mapped morphological slums using income data. Remote Sens Lett. 9(1):41–50. doi: 10.1080/2150704X.2017.1384586.
Web of Science ®Google Scholar
Wurm M, Taubenböck H, Weigand M, Schmitt A. 2017. Slum mapping in polarimetric SAR data using spatial features. Remote Sens Environ. 194:190–204. doi: 10.1016/j.rse.2017.03.030.
Web of Science ®Google Scholar
Zhang F, Zhang C, Hudson J. 2018. Housing conditions and life satisfaction in urban China. Cities. 81:35–44. doi: 10.1016/j.cities.2018.03.012.
Web of Science ®Google Scholar
Zhao L, Ren H, Cui C, Huang Y. 2020. A partition-based detection of urban villages using high-resolution remote sensing imagery in Guangzhou, China. Remote Sens. 12(14):2334. doi: 10.3390/rs12142334.
Google Scholar

A large-scale extraction framework for mapping urban in-formal settlements using remote sensing and semantic segmentation

Abstract

1. Introduction