1,479
Views
1
CrossRef citations to date
0
Altmetric
Research Article

Capturing deprived areas using unsupervised machine learning and open data: a case study in São Paulo, Brazil

, , &
Article: 2214690 | Received 03 Dec 2022, Accepted 12 May 2023, Published online: 19 May 2023

ABSTRACT

Managing the rapid growth of deprived areas (commonly known as slums, informal settlements, etc.) in cities of Low- to Middle-Income Countries (LMICs) demands detailed and consistent information that is often unavailable. Recent Earth Observation (EO) mapping approaches with supervised classification models overlook the diversity of deprived areas and require resource-intensive training sets. In this study, we analyse the potential of unsupervised machine learning (ML) models to capture intra-urban diversity of deprived areas in São Paulo, using solely open geodata. We provide a workflow of characterising deprivation at a city scale with a disaggregated approach, offering scalability and transferability potential. First, we extract a pool of spatial features from open geospatial datasets to characterise the morphological and environmental conditions of the study area. After input preparation, we train and optimise a k-means model, including a coupled feature importance tool. Four cluster types emerged with different deprivation aspects such as higher and lower accessibility to services and infrastructure, sparser and denser occupation; regular and complex morphology; flat and steep terrain. This alternative methodology to capture diversity of deprived areas with open EO-based features can inform locally targeted, thus more efficient, urban policies and interventions.

Introduction

One out of eight people in the world lives in slum-like settlements, marked by poor housing and service provision (UN-Habitat, Citation2016). Most of them are located in cities of Low-to-Middle-Income Countries (LMIC), where high urban growth rate and weak planning capacities coincide (Bastos da Cunha et al., Citation2015). This leads to high housing demands that cannot be accommodated by the local governments, thus, to fast development of deprived areas (Kohli et al., Citation2012).

The traditional concept of slums has been used widely in literature (UN-Habitat, 2003), and despite its irrefutable significance, it is limited to identify the deprived at the household level and overlooks the areal characteristics of slums (Lucci et al., Citation2018; Olthuis et al., Citation2015). This definition leads to entire areas being classified as slums, disregarding the area-level nuances and multiple manifestations (Thomson D R et al., Citation2020; Patel et al., Citation2014). Therefore, this study adopts the term “deprived area” as neighbourhoods facing deprivation at social, environmental and ecological dimensions that differ across and within cities, and are highly dependent on their local contexts (Abascal et al., Citation2021).

With the current international interest in slum improvement, there are multiple governmental efforts to upgrade these areas, requiring consistent and up to date spatial information about their living conditions to support effective pro-poor plans and interventions (Kohli et al., Citation2013; Thomson et al., Citation2020). Yet, acquiring such information is a challenge. The most traditional method to acquire data on deprived areas is census surveys. The fieldwork produces information at the household level typically aggregated at administrative units, which are likely to ignore the morphological local heterogeneities and stigmatise neighbourhoods by assigning average values to the entire unit (Martínez et al., Citation2016; Taubenböck et al., Citation2018).

Besides, the large temporal gaps and the resource-consuming process of household data collection (e.g. Census) are a huge challenge to LMICs (Owen & Wong, Citation2013). To overcome such issues, the increasing availability of Earth Observation (EO) data allows disaggregated information with high temporal resolution. The recent remote sensing (RS) studies extract spatial, spectral and textural features from satellite imagery and translate it into information on the living conditions in deprived areas (Leonita et al., Citation2018). However, even with the rising availability of low-cost drone imagery, the acquisition of Very-High-Resolution (VHR) multispectral data is still cost prohibitive, restraining the EO-based approaches to generate image-based information to small urban districts (Gram-Hansen et al., Citation2019; Merodio et al., Citation2021; Small, Citation2014).

In this context, the potential of open geodata sources – such as the United States Geological Survey (USGS), the Copernicus Program and Open Street Maps (OSM) platforms – to provide a rich repository for characterizing different urban areas have been systematically explored in recent years (Bhuyan et al., Citation2022; Geiß et al., Citation2017). Within the deprivation studies, however, only a few studies have incorporated such datasets due to the harnessing and integration challenges (Ajami et al., Citation2019; Mahabir et al., Citation2020). Aditionally, the quality and completeness of opengeodata, especially derived from volunteered mapping initiatives, should be inspected for uncertainties. Cartographic and thematic innacuracies, non-systematic updates and level of completeness across different areas of the globe are some of the limitations to consider when building an approach on such datasets (Eckle et al., Citation2016; Yeboah et al., Citation2021).

In addition to data constraints to acquire information on deprived areas, literature shows methodological shortcomings. EO-based studies present different purposes and modelling approaches. Most research is focused on mapping the settlements, i.e. delineation purposes, classifying deprived and non-deprived areas based on their physical characteristics extracted from VHR imagery (Gevaert et al., Citation2019; Kohli et al., Citation2016). There are different methods to identify deprived areas – visual interpretation, pixel-based image classification and object-based image analysis (Dos Santos et al., Citation2022) – but recent Machine Learning (ML) and Deep Learning (DL) methods have demonstrated high performance (Ajami et al., Citation2019; Duque et al., Citation2017; Luo et al., Citation2022).

Supervised ML models are developed to classify deprived settlements in (semi-)automatic ways, combining spectral and textural information (Kuffer et al., Citation2016); or using ML-OBIA techniques (Duque et al., Citation2017); or pixel-based classifiers such as in Mahabir et al. (Citation2020). Despite the undeniable importance of the detection models, they overlook the diverse nature of deprived settlements, obstructing the development of more contextualised, hence, effective policies (Kuffer et al., Citation2020). Moreover, the few studies that do work on capturing the different characteristics of deprived areas (Ajami et al., Citation2019; Luo et al., Citation2022) rely on the acquisition of high-quality training and reference data for the supervised models, making adoption in LMICs difficult (Jochem et al., Citation2020). This task is not only cost-prohibitive due to high VHR costs, but also resource-intensive, which hampers the scalability of these models (Gevaert et al., Citation2019). Especially in the context of COVID-19 pandemic, there is a pressing demand for data-driven methods that can provide an alternative to field work (Brito et al., Citation2020). Modelling with unsupervised ML techniques could be posed as an alternative, but till this day, no gridded unsupervised ML approach to capture local urban deprivation was documented in the literature. Even though non-gridded approaches are more common in research, they are much more computationally demanding and require the availability of homogeneous units (e.g. land parcels or streets blocks) that are often inaccessible in LMICs (Kuffer et al., Citation2020).

Motivated by these challenges, the present study proposes a workflow, exploring the potential unsupervised ML models to capture intra-urban diversity of deprived areas at city scale, using solely open data sources. The city of São Paulo, Brazil, serves as a case study.

The structure of the remainder of this paper is as follows. First part, we describe the research methods and materials, covering the case study area, the data collection, the data processing, the unsupervised machine learning model and the validation process. Then, we present the results, followed by the discussion. Finally, we summarise the results and provide recommendations for further research.

Materials and methods

The general workflow to capture deprived areas using a gridded unsupervised approach is presented in , based on four main steps as follows: (i) we select a pool of indicators and collect open data sources, considering a specific set of requirements and the physical aspects of deprivation of the study area, (ii) define a sensible unit of analysis, extract spatial features and integrate them, (iii) employ unsupervised ML model and feature importance techniques and (iv) assess the model results, including multiple validation procedures. After the subsection about the case study, the following four subsections describe each one of the four steps presented in the methodology flowchart.

Figure 1. Overview of the proposed methodology.

Figure 1. Overview of the proposed methodology.

Case study: São Paulo

São Paulo, capital of the State of São Paulo, is the most populated city in Brazil (12.3 million) and covers 1.521 km2 according to the latest census (IBGE, Citation2010). We selected the city to develop the methodology, considering three main reasons: (1) The city’s complex layout and unprecedented growth rate of deprived settlements towards the peripheral areas. The population living in the informal areas grew four times the formal population, which poses an urgent and pressing challenge to policy makers (Pasternak & D’Ottaviano, Citation2016). (2) In 2019, the Agency of Urban Development and the Federal University of ABC Region in São Paulo developed a mapping and classification study for the Metropolitan Area of the State of São Paulo using Logistic Regression (da Fonseca Feitosa et al., Citation2021). The study was applied in different cities, but till the present date, no research has been released in the capital city. (3) The availability of a recent national data set on the spatial extent of deprived areas. The existence of an open-source GIS layer delineating the deprived areas is a prime requirement of the present workflow. The Brazilian Institute of Geography and Statistics (IBGE) released the new census layer in 2019, mapping deprived areas as Aglomerados Subnormais - abbreviated as AGSN and directly translated as Substandard Settlements (). The concept, used since 1991, describes settlements with at least 51 houses in a substandard urban setting, deprived of legal ownership, regular arrangement and service provision (IBGE, Citation2010).

Figure 2. Study area Sao Paulo with substandard settlements in 2019. Source: (IBGE, Citation2019).

Figure 2. Study area Sao Paulo with substandard settlements in 2019. Source: (IBGE, Citation2019).

The previous AGSN layer, from the 2010 census, has omission and topological issues. Our visual inspection proved similar results to Ferreira and Feitosa (Citation2020) that investigated such inconsistencies on the new layer and found better spatial coverage and fixed topological errors.

Step 1 - data collection

EO-based studies have shown that the physical characteristics of deprived areas can be derived from RS imagery that can be acquired worldwide (Ghaffarian et al., Citation2018; Leonita et al., Citation2018). However, in view of their underlying heterogeneity, some local adjustments are necessary in order to choose the spatial features that can translate most of the socioeconomic information of the study area (Olthuis et al., Citation2015; Wurm et al., Citation2017). We structure the data collection process by summarising the observable features that can indicate the spatial variations of deprived areas on the ground (). shows the built-up morphology domain that encompasses the differences in building geometry (size, shape), density and layout (texture) characteristics, as well as accessibility to services and infrastructure; and the land morphology domain including topography, presence of green areas and hydrography.

Figure 3. Diagram of the domains of the diversity of deprived areas.

Figure 3. Diagram of the domains of the diversity of deprived areas.

Besides the conceptual structure, the data collection process was guided by the following set of requirements, relying on the work of Mahabir et al. (Citation2020): all features must be spatial, quantitative, available for the entire study area, recurrently updated, provided from open sources and manifest the local nuances of the study area.

Given the concepts and requirements outlined above, we compiled a list of 32 spatial features identified in literature to capture deprivation to build an open geodatabase for São Paulo (). In addition, we collected two auxiliary datasets to support the validation of the results: Census data (2010), acquired from the IBGE web portal and aggregated at the census tract level – the 10-year gap remains only due to the suspension of the 2020 census activities of 2020 due to pandemic – and the municipal land use layer Considering the replicability potential of the model and the unprecedented growth rate of deprived areas, there is a risk of erroneous assessment of the mapping results. For this reason, we focused this statistical validation on analysing the separability capacity of the clusters between socioeconomic aspects of deprivation.

Table 1. List of EO-based features found in literature.

Step 2 - data processing

After data collection, we conduct a series of processing steps to extract the spatial features described in and integrate them as the model input.

Spatial unit of analysis

Considering the benefits of providing gridded outputs with fuzzy continuous boundaries, we followed a pixel-based approach, and hence required the definition of a sensibly sized grid. This is a very important task, but rarely documented in literature. The few studies that do explain their decision process adopt a trial-and-error approach mostly considering: (1) the spatial resolution of the input datasets – here ranging from 10 to 100 m – and the morphology of the settlements (Taubenböck & Kraff, Citation2014; Wang et al., Citation2019); (2) the homogeneity of the sampled pixels (Owen & Wong, Citation2013). Thus, after inspecting the 2019 AGSN layer and correcting clear digitation errors, e.g. settlement polygons extrapolating into water bodies, we analysed the area of the 1,575 polygons through descriptive statistics, which indicate high variability in size (). Based on previous studies on mapping and characterisation of deprived areas, we created four regular grids of 10 × 10, 20 × 20, 50 × 50 and 100 × 100 m2 cells and overlaid them with the AGSN layer to choose the one that best depicts the urban structure of the deprived areas in São Paulo (Duque et al., Citation2017; Kit et al., Citation2012; Mahabir et al., Citation2020).

Table 2. Area descriptive statistics of AGSN polygons.

Next, we excluded non-homogeneous pixels, i.e. where deprived and non-deprived areas exist, mainly found on the boundaries of the polygons. We performed a macro data-driven approach to establish the sampling threshold, indicating the minimal percentage of each pixel to be intersected by the AGSN layer, thus, the percentage of homogeneity of the deprived pixels. We assess three thresholds based on quartiles, comparing their total grid area with the total area of the AGSN polygons.

Feature extraction process

After the selection of the spatial unit of analysis, we derived the 32 GIS-and RS-based input features listed in . illustrates the workflow of seven main operations, some only encompassing standard processing steps – clip to extent, project, resample to chosen resolution – other followed by individualised extraction processes. All the operations shown in are performed in an open GIS platform with inbuilt functions, except from the fourth and fifth, that require specific processing steps to define the moving window size from which the features are calculated.

Figure 4. Processing steps of feature extraction structured in seven main operations.

Figure 4. Processing steps of feature extraction structured in seven main operations.

Operation 1 extracted population count and night-time lights from WorldPop datasets. Operation 2 derived the Digital Elevation Model (DEM) and respective slope values from the Aster GDEM layer. Operation 3 included processing of spectral features derived from the mean values of Sentinel 2A imagery and the calculation of the Normalised Difference Vegetation Index (NDVI).

Operations 4 and 5 were based on the work of Hall-Beyer (Citation2017) to compute a Principal Component Analysis (PCA) using Python packagesFootnote1 to support the window size decision, aka kernel size, by interpreting three main output metrics: correlation matrix, factor loadings and communality table. In operation 4, we extract the Gray-Level Co-occurrence Matrix (GLCM) from the Landsat 8 panchromatic band in R Studio to derive seven texture features – Mean, Variance, Homogeneity, Contrast, Dissimilarity, Entropy and Second Moment – from five different window sizes (3 × 3, 5 × 5, 7 × 7, 9 × 9, 11 × 11), which were further assessed by PCA metrics.

Operation 5 extracted the Landscape Metrics (LSM) from the World Settlement Footprint (WSF) layer. Relying on the study of Frazier & Kedron (Citation2017), we calculated five metrics – Patch Area Mean, Shape Index, Fractal Dimension Index, Patch Density and Aggregation Index – using Fragstats with four window sizes (3 × 3, 5 × 5, 7 × 7 and 9 × 9), which are also evaluated with a PCA. The flowchart in illustrates the PCA experiments above in more detail. First, we assessed the metrics using all window sizes as input. If we could not derive a conclusion of which one maintains the highest PC loading scores, we ran the model with the features of one window size at a time.

Figure 5. Workflow of PCA iterations to choose the best moving window for hand-crafted features.

Figure 5. Workflow of PCA iterations to choose the best moving window for hand-crafted features.

Operations 6 and 7 involve the extraction of the GIS-based features. After comparison, data from the Humanitarian OpenStreetMap (HOT) is used to calculate Euclidean distance to determine accessibility to services, while data from OSM is used to compute accessibility to infrastructure. The SQL queries expression to extract each attribute is described in Appendix 1. The WDPA layer is combined with OSM to measure the distance to green areas. Following the work of Mahabir et al. (Citation2020), Kernel Density Estimation (KDE) is used to calculate the density-related features with a 1000 m search radius, which maintain the spatial structure of the features explicit even after the smoothing effect (Leonita et al., Citation2018).

Data Integration

After feature extraction, we integrated and prepared them as input for the unsupervised ML model. The chosen and sampled grid was converted to a point class layer, where the pixel centroids of each spatial feature was extracted and recorded. Given the size of the study area (149,947 cells, hence, points) and the processing capacity of GIS platforms, this process was performed in seven batches to encompass the entire extent.

Then, we merged the resulting attributes tables into a single georeferenced tabular file, export it to R StudioFootnote2 and run Exploratory Data Analysis. Due to the differences in ranges among features values, we standardised the feature values (Z-Score) and analysed the descriptive statistics to detect outliers and avoid biases, check normal distribution and handle null values of population count as zero (Cai et al., Citation2010). Finally, we also compute the Pearson Correlation coefficient to investigate possible multicollinearity between features.

Step 3 - unsupervised machine learning model

After input preparation, we chose the best algorithm fit for the unsupervised ML task. Considering the instructions of Han et al. (Citation2012), we selected the K-Means algorithm, because (1) it handles continuous numerical data such as the spatial features here used; (2) it performs well for high volume dataset with small processing time for regular machine capacities.Footnote3 Besides, when compared to other suitable algorithms, such as HDBSCAN, K-Means has simple parameters requirements, higher stability and compatibility with open programming languages (Campello et al., Citation2013; Madhulatha, Citation2012; McInnes et al., Citation2016). Considering the usability of the model in LMIC, simplicity and availability are fundamental technical criteria to the choice of the modelling approach in this study.

K-Means is one of the most popular unsupervised algorithms, falling into the category of portioning distance-based methods. The model initialises by portioning the data points based on “k” (number of) clusters, then it assigns the points to the closest cluster by calculating the mean-square distance (MSD) and comparing with the total within-cluster sum of square (WCSS), recomputing new cluster centroids accordingly and iteratively, until the clusters assignment no longer change. The WCSS, aka inertia or distortion score, sums the distance from each data point within a cluster to the cluster centroid (Chiang & Mirkin, Citation2010), hence, when the lowest WCSS score is constantly achieved, the highest internal coherence is achieved per cluster, so the model stops.

We ran the algorithm using PythonFootnote4 and used the elbow method to choose the optimal k value. A plot of WCCS as a function of the number of clusters is used to determine the value of k when WCSS decreases exponentially and stabilises (Pedregosa et al., Citation2011). This breakpoint (“elbow”) suggests the optimal number of clusters. If more than one k value is indicated, it is suggested to train the models and assess the results (Han et al., Citation2012). We trained successive models, increasing the number of clusters from 3 to 10, computing and plotting all the total within-cluster sum of square (WCSS) results. The assessment considered the flattening of the elbow function with plots mentioned above and the computational costs (the more clusters, the longer the machine takes to derive them).

Optimisation techniques

In addition to EDA applied on the model input, literature highlights the usefulness of dimensionality reduction techniques for model optimisation (Martino et al., Citation2017). Clustering models like K-Means can be sensitive to the input, prone to overfitting and biassed results (Roy et al., Citation2020). There are several supervised feature selection algorithms, but little unsupervised algorithms are documented (Alelyani et al., Citationn.d. Li et al., Citation2017). We included a feature importance tool in R, aka “FeatureImpCluster” (Pfaffel, Citation2021), that is documented, openly available in the Comprehensive R Archive Network (CRAN) and has not yet been applied to deprivation studies.

The tool calculates a permutation classification rate, i.e. the number of wrong cluster assignments divided by the number of iterations. When the algorithm changes a feature value and the model error elevates, the resulting permutation score is higher, indicating a higher relevance of the feature (Thrun et al., Citation2021). This tool works coupled with the k-means algorithm, computing the feature scores with simple statistical analysis, which is not computationally intensive like most of the feature importance techniques. The main by-products are an overall mean classification rate per feature and the rate aggregated by cluster type.

Following the instructions of Fränti and Sieranoja (Citation2019), conducted seven experiments, by including and removing input features, to check the model robustness and how much information is changed or maintained accordingly. The findings of the PCA, EDA and feature importance were used to reason the selection of specific features.

Step 4 – model evaluation

To validate the refined model results, we used three qualitative assessments. First, we visually inspected the results using Google Earth and Street View images for internal validation, followed by a statistical assessment with auxiliary reference datasets (AGSN polygons, output of a model trained only with census-derived features and the municipal land use map). Next, we conducted an external validation with a local specialist through a semi-structured interview, where we had a guided discussion on the modelled spatial patterns, the major influencing factors and the applicability of the approach.

Results

In this section, we present the results of the proposed methodology of capturing intra-urban deprivation at city scale. We illustrate the selected spatial unit of analysis and moving window size, as well as the EDA findings. Then, we provide the outputs of the refined k-means model and the assessment of such findings.

Selection of spatial unit of analysis

The inspection of the AGSN polygons () indicates that 10 m and 100 m resolution only depict well the extreme settlement sizes, i.e. the minority of settlements. The medium settlement size is more appropriately depicted by 20 and 50 m grid sizes, however, 20 m performs better for elongated settlements. For the choice between 10 and 20 m grid size, we considered the reduction of computational time and the avoidance of unnecessary identifiable details of vulnerable communities. These considerations led to the conclusion that a small grid size (10 m) does not necessarily benefit the model, especially referring to the original resolution of the input datasets (Gram-Hansen et al., Citation2019). Thus, 20 × 20 m was chosen as the unit of analysis for São Paulo.

Figure 6. Exemplary AGSN polygons of deprived settlements overlayed with varying grid sizes: from left to right 100m, 50m, 20m, 10m.

Figure 6. Exemplary AGSN polygons of deprived settlements overlayed with varying grid sizes: from left to right 100m, 50m, 20m, 10m.

Next, we compared the different sampling thresholds to maximise the pixels’ homogeneity. We chose to sample the cells that have at least 50% of their area intersected by the AGSN layer. Their total area is 59,978,800 m2, which is the closest to the AGSN polygons area of 60,244,428 m2. shows that the sampled grid of 20 × 20 m can depict even irregularly shaped deprived settlements, while avoiding most non-deprived information that could lead to misinterpretation of the spatial features.

Figure 7. Comparison between the grid base layer before and after the sampling process (threshold of 50%).

Figure 7. Comparison between the grid base layer before and after the sampling process (threshold of 50%).

Selection of moving window sizes: PCA approach

As described in the Materials and Methods section, the feature extraction process created 35 GLCM texture features and 20 LSM features. Several PCAs were run to select moving window sizes; results are documented in detail at the open repositoryFootnote5 (Trento Oliveira, Citation2021). For the GLCM features, the PCA metrics indicated the 5 × 5 window size expressing less correlation among features and higher loading scores when compared to the other kernels. The metrics also show second moment and contrast features have the lowest communality values and account for a small proportion of the model variation.

For the LSM features the 5 × 5 moving window is chosen due to the fewer collinearity patterns when compared to the other window sizes. We also decide to remove the Aggregation Index (AI) and Fractal Index (FRAC) features of the model due to their very high correlation within each other and among the other features.

Exploratory data analysis

Exploratory data analysis consisted of generating boxplots and exclude outliers based on them, plotting histograms and correlation matrix.Footnote6 The results provide insights on input patterns such as that most data points are close to service and infrastructure facilities, but a few outliers are very far from the provision (). It also shows the similar distribution and high collinearity of the spectral bands, which justify the model experiment without them. Considering the possible model improvements for dimensionality reduction, one experiment was also done excluding features with high collinearity (|r|>0.8).

Figure 8. EDA histograms created in R displaying input data patterns through descriptive statistics.

Figure 8. EDA histograms created in R displaying input data patterns through descriptive statistics.

K-Means implementation

For model optimisation, we performed seven experiments, checking the model consistency and the refinement of the results. As shown in , the tests include the following input: (1) only with GIS-based features; (2) only RS-based features; (3) both GIS- and RS-based features; (4) without Sentinel 2A spectral bands; (5) without Sentinel 2A but including Band 5; (6) only the selected features from Test 4; (7) including census-derived features.

Figure 9. Flowchart of model optimisation experiments.

Figure 9. Flowchart of model optimisation experiments.

The first two models tested the sensitivity of the model to the input, thus, they were initialised 20 times (Karmitsa et al., Citation2018). In each initialisation, a very similar result appeared – considering the number of clusters and spatial cluster patterns – which indicates good model robustness, especially when combined with feature importance scores. When GIS- and RS-based features were included in Model 3, the significance of the features remained similar to the results in Models 1 and 2. However, feature importance tools often indicate input features with high collinearity as highly significant to the model (Nugrahita & Surjandari, Citation2020), which can generate biassed results. For this reason, we trained Model 4 without all spectral bands and Model 5 with only band 5, as band 5 indicates an estimation of vegetation chlorophyll content and vegetation is present within deprived settlements in São Paulo. The illustration A in shows that both models have less pixelation effect, likely derived from the spectral information, but Model 4 also can detect non-residential land use types within the settlements which is very significant information.

Figure 10. Comparison of model outputs.

Figure 10. Comparison of model outputs.

Model 6 tests the value of data redundancy on the K-Means algorithm, by removing the highly correlated features and the lowest misclassification rates. The result shows that the algorithm requires detailed information to detect finer differences between cluster types, hence highlighting morphological heterogeneities at the intra-urban level. Illustration B in shows how Model 4 captures the non-residential land uses that are not detected by Model 6, mainly due the removal of contrast and homogeneity texture features that identify fragmentation of the urban fabric.

Finally, we ran a model combining the input features of Model 4 with 12 census-derived features (see Appendix 2), to assess the impact of adding household-level information that are not depicted by the EO-derived features. Despite the temporal mismatch of the AGSN boundaries (2019) and the census data (2010), it adds important levels of deprivation, such as average income and homeownership. However, Model 7 cannot capture specific land uses and it detects the discrete census tract boundaries – creating spatial fallacies in the gridded approach (Illustration C - ).

It is important to state that in all experiments, the elbow method was calculated using two Python packages for consistency. The resulting plots for Model 4, chosen as the optimal one for further assessment, indicates three eligible k values (Appendix 3). We mapped and inspected each of them visually for a qualitative interpretation as well.

K-Means results

São Paulo is differentiated into four clusters; and the model provides two main outputs: (1) a map with the generated cluster types and violin plots with feature values per cluster type (Appendix 4), that we condense in radar graphs using the mean feature values for easier visual comparison (); (2) the misclassification scores, providing information on the relevance of each feature per cluster type ().

Figure 11. Radar graphs and clustering map of emerged cluster types.

Figure 11. Radar graphs and clustering map of emerged cluster types.

Cluster 0 is allocated to fewer cells, predominantly located in the outskirts of the city, with low accessibility to financial, educational and health services. Cluster 0 occupies elevated areas close to dense and healthy vegetation, indicated by the higher DEM and NDVI values. Lowest night-time light (VIIRS) values indicate both less population density and lower economic status. We visually inspected the cells to check whether the lowest values of mean and variance (texture features) reveal the presence of bare soil or slum-like areas and found several construction sites, hence, bare soil. Due to their peripheral unpopulated location, the cells of Cluster 0 present very low geometry and density values, which indicate the absence of buildings captured by the WSF layer. These features could be depicting small, scattered settlements in the periphery. Thus, we label this cluster as “Infant Settlements in Open Spaces”.

Cluster 1, located near non-built-up and rural areas, reflects higher deprivation levels than Cluster 0, because of the higher population counts, the even further distance from main roadways, highest density of poor-quality roads and highest slope values. High levels of homogeneity and entropy are controversial, indicating both spatial uniformity and disorder, which has to do with the organic landscape layout. The geometry features present fragmentation of the urban fabric and irregular and sprawled structures, such as settlements in an early development stage. For this reason, we named the cluster “Poorly Consolidated Settlements”.

Cluster 2 can be found in various locations around the city, often near roadways and riverways, clearly closer to services and infrastructure, assuring basic provision. The proximity to waterways is mostly referring to channelized rivers, since we only used major water veins in the study, even so, it might indicate higher flood risk. The lower vegetation values and high night-time light values can indicate industrial areas included in the sampling. Texture-wise, the high mean and variance features, combined with the lowest homogeneity values and skewed distribution of patch mean area, suggest more pixel edges and sharpening, indicating a formal layout such as non-residential land uses. Therefore, we nominated Cluster 2 as “Less deprived settlements connected to non-residential areas”.

Finally, Cluster 3 corresponds to almost half of the total cells and are mostly centrally located. It presents the highest population and dead-end street density values. The highest homogeneity and lowest entropy, contrast and dissimilarity values indicate pixel uniformity. Combined with the high mean texture values, the cells depict denser and compacted urban areas. The skewed distribution of the shape index and large urban patches indicate more complex settlements. Based on this, we entitled the cluster “Densely urbanised and mature settlements”.

The output of the feature importance tool is also very relevant for the analysis of the cluster types (Appendix 5). Several RS-based features have prevailing importance, but it makes sense as they are 2/3 of the features included. In light of the above interpretation, the highest importance of the geometry and vegetation features to Cluster 0 are quite meaningful. For Cluster 1, service provision scores highest, together with high DEM and nighttime light suggesting peripheral areas with lower income in elevated terrain. The inverse influence of texture features to Cluster 2 and 3 state the built-up structure of each cluster type: to 2, it refers to the presence of non-residential areas and to 3, evinces a slum-like morphology.

K-Means validation

Visual assessment

The visual assessment of the clusters shows clear morphological differences. shows two deprived areas, each containing two of the cluster types. The first area is located in the far north, catching much poorer accessibility and service provision. View A presents a low-density settlement within a highly vegetated area (Cluster 0) and View B shows a hilly terrain with precarious building structures (Cluster 1). The second area shows the Paraisópolis district, the largest and most populated deprived settlement in the city, close to the centre, thus provided with basic infrastructure. View C exhibits larger non-residential building sizes and even proper sidewalks for Cluster 2, while View D illustrates Cluster 3 with dense residential fabric, with multi-storey buildings.

Figure 12. Visual assessment of clusters using satellite and street-view images. Source: (Google Maps, retrieval date: 24th May 2021).

Figure 12. Visual assessment of clusters using satellite and street-view images. Source: (Google Maps, retrieval date: 24th May 2021).

Statistical assessment

We compared the distribution of cluster polygon sizes with the areal statistics of the AGSN polygons (). We defined the assignment of a cluster type to a settlement considering that at least 50% of the settlement area belongs to a certain cluster. shows that the two peripheral clusters (0 and 1) have a significant difference. Cluster 0 is mostly allocated in large settlements – encompassing commission errors – while Cluster 1 is often assigned to small ones. This suggests another level of area deprivation, as the settlements are not only located in periurban areas, but also take place in a scattered way.

Table 3. Number of settlements per cluster type according to the polygons areal statistics – Quartile 1 (Q1): 0.7 ha. Median (M): 1.9 ha. Quartile 3 (Q3): 4.2 ha.

Then, we trained a model with 12 census-derived variables (Appendix 2) to be used as reference data estimating the socioeconomic conditions of the settlements. Two clusters were derived from the elbow method, labelled “less deprived” and “more deprived” (workflow details in Appendix 6). illustrates the differences between the two clusters, where higher built-up density, average income and provision of sewage infrastructure and lower average household population refers to less deprived areas. Clusters 2 and 3 mainly overlay with less deprived cells, while half of Cluster 1 cells overlay the more deprived cluster. This indicates another level of precariousness to Cluster 1, already depicted as the most morphologically deprived among them.

Figure 13. Radar graph with mean features values of census-derived model and bar graph showing the distribution of each cluster type per census-based category.

Figure 13. Radar graph with mean features values of census-derived model and bar graph showing the distribution of each cluster type per census-based category.

Lastly, we overlaid the results with the land use layer. Initially with 30 land use classes, we aggregated to six: mix use, conservation areas, high-income residential, low-income residential, commerce/industry and transport corridors. shows the distribution of each cluster type per aggregated land use class, and it validates the results of our model. Cluster 0 mostly occurs in conservation areas but also in industrial areas (with bare soil). In relative values per cluster, Cluster 1 mostly present in low-income residential and conservation areas. Considering the hilly location, it might indicate more social vulnerability. Cluster 2 has a diverse distribution across land uses, which makes sense with its various locations, but the graph also indicates its significant predominance in non-residential classes.

Figure 14. Cells count per cluster type for each aggregated land-use class.

Figure 14. Cells count per cluster type for each aggregated land-use class.

Expert evaluation

The results of the model were presented to a local expert on urban deprivation who confirmed the diversity of the deprived settlements in São Paulo and how well the resulting cluster types are qualifying these areas. She evinced the heterogeneity between central and peripheral areas, the existence of early-development stage settlements that face severe deprivation, depicted by Cluster 1. The fact that the model can capture these recently occupied regions with lack of service and infrastructure provision was found to be very important for local policies.

Another useful aspect of the clustering results is the differentiation of built-up and non-built-up areas, especially considering the errors of commission in the AGSN layer. She highlighted that the proximity of waterways and highways of Cluster 2 also reflect a contamination factor, e.g. open sewer and industrial pollution. The expert also stressed the existence of non-residential land uses within the AGSN boundaries and praised the ability of the model to identify these areas.

Discussion

Without discrediting the necessity of traditional binary approaches of deprived vs non-deprived areas as a starting point, we are proposing a step further by exploring the potential of machine learning models to capture their intra-urban diversity. We designed a methodology that uses a gridded unsupervised approach and solely open GIS- and RS-based data and software packages, dealing with scalability and transferability issues of state-of-the-art models.

By employing an unsupervised ML method, one does not require training and testing datasets which are computationally demanding and often unaffordable for LMICs, ensuring scalability potential. Using features derived from freely available datasets guarantees transferability potential, fighting current data poverty issues. This combination allows applicability to other contexts and provides a step forward in understanding deprivation at different scales. Yet, the model still requires tailoring from two main fronts: the selection and extraction of features and the spatial unit of analysis. Particularly to the use of open geodata, the cleaning and preparation process is quite demanding, once we inspected each dataset for quality issues in order to reduce modelling uncertainties.

This research throws light on the local manifestation of deprivation by developing a list of potential spatial features, using open data sources in an effort to bridge the gaps of census data and VHR imagery, considering spatial coverage, resolution and completeness. The eight dimensions conceptualised here to characterise the living conditions of neighbourhoods of São Paulo can be generalised to other study areas but require tailoring, not only regarding the inclusion or removal of a certain feature, but also their combined interpretability for planning decisions. For example, Cluster 3 shows a high density of dead-end streets and high built-up density which can increase costs for infrastructure implementation, while for Cluster 1 the intervention costs might be more affected by the local topography and sprawled distribution. However, it is important to state that the features extraction processes proposed here are time-consuming and not fully automated, which can hinder the practicality of the workflow.

The sensible selection of the unit of analysis is a major step in the methodology. In addition to avoiding ecological fallacies, it increases comparability with other reference datasets for decision-making, such as global gridded outputs or hazard maps for disaster risk plans. Besides the increasing number of recent slum studies employing gridded approaches, they rarely document how the unit of analysis was chosen. Moreover, most of the work uses VHR imagery and we present a promising result with 20 m resolution, sufficiently detailed for a city scale analysis. Nevertheless, the increasing availability of high-quality sources would be only beneficial for deprivation research.

It is important to stress that features interpretations can change depending on the adopted unit of analysis. Especially texture- and geometry-related features can provide very different depictions in a different spatial resolution. For example, with 1 m resolution, high entropy values indicate complex and dense morphology, while with a 20 m – one pixel encompassing 2 housing units on average – high entropy suggests higher variability between the pixel and its surrounding, which is associated with more formal areas.

Additionally, with respect to reproducibility, this research employs an efficient and simple algorithm, which results highly depend on the number of clusters and the input dimensionality. The outcomes here presented indicate that removing some features due to high collinearity or model significance can hamper the cluster’s separability. The usefulness of the feature importance tool is undeniable, especially for interpretability of model robustness and performance, but for dimensionality reduction the effects could be more considerable for larger scales, hence, larger datasets.

Conclusion

We have demonstrated that an unsupervised gridded ML approach, not yet exploited in literature for slum characterization studies, can capture the morphological dimensions of deprived areas and identify the intra-city differences. The multiple experiments and validation procedure show the good model performance and the significant contribution of input features for each resulting cluster type. The main advantage of the methodology is its transferability and scalability capacity, pushing the model application towards larger scales, thus, critical to global development goals.

We have shown that physical characteristics of such settlements can be derived from open GIS- and RS-based data and how well they can generate relevant information on the nuances of deprivation within the city. Despite our results, future work should not drop spectral features automatically, because the model is context-based and a good cluster separability can come from land cover details, e.g. in areas where detection of bare soil is important. Local specialists are fundamental for the decision on the inclusion or exclusion of certain features. They can provide insights during both the development of indicators and interpretation of clustering results.

In this sense, recently available land cover products could also be beneficial to the model results. Considering the environment of certain deprived areas, high-resolution land cover maps can improve the model’s ability to diffentiate between bare soil and other land cover classes (Dos Santos et al., Citation2022), between bare soil and roofing materials (Kuffer et al., Citation2016; Owen & Wong, Citation2013) and the quality of road infrastructure (Nobrega et al., Citation2008).We acknowledge the importance of a basic understanding of the study area, while the proposed feature extraction workflow remains the same.

In terms of applicability, this study can inform decision-makers for locally tailored policy-making such as: the location of pocket slums located in conservation areas (Cluster 0); the urgency of service provision for settlements in steep locations (Cluster 1); the presence of non-residential land uses in bordering zones of the AGSN layer (Cluster 2); and the predominance of complex and dense urban fabric (Cluster 3).

Nevertheless, we see room for further investigation and improvement that should primarily focus on upscaling the model to regional level and the full automation of the workflow. Future study should automate the model in an API environment that can process all datasets, without relying on the computational capacity of the user. This would ensure the full applicability of this pilot research to the municipalities. To ensure the full applicability of this pilot research to the municipalities, GIS experts with knowledge on the local urban systems are required for tailoring the model features and unit of analysis.

Acknowledgments

The authors acknowledge the support of Alexandra Pedro on the validation procedure, Dr. Raian Maretto and Dr. Flávia Feitosa for the technical consultancy on machine learning models, and Dr. Caroline Gevaert for special assistance on uncertainties sources.

Disclosure statement

No potential conflict of interest was reported by the authors.

Data availability statement

The entire workflow, datasets, model input, code scripts and model output are archived and available at GitHub platform (https://github.com/ltrentooliveira/MSc_Archive) to ensure maximal replicability.

Additional information

Funding

This research received no external funding.

Notes

References

  • Abascal, Á., Rothwell, N., Shonowo, A., Thomson, D., Elias, P., Elsey, H., Yeboah, G., & Kuffer, M. (2021, March). “Domains of deprivation framework” for mapping slums, informal settlements, and other deprived areas in LMICs to improve Urban planning and policy: A scoping review. Preprints, 1–23. https://doi.org/10.20944/preprints202102.0242.v1
  • Ajami, A., Kuffer, M., Persello, C., & Pfeffer, K. (2019). Identifying a slums’ degree of deprivation from VHR images using convolutional neural networks. Remote Sensing, 11(11), 1282. https://doi.org/10.3390/rs11111282
  • Alelyani, S., Tang, J., & Liu, H. (2014). Feature selection for clustering: A review. In Aggarwal, Charu C., Reddy, Chandan K. (Ed.), Data Clustering - Algorithms and Applications 1st ed., Chapman and Hall/CRC. 9781315373515 .
  • Bastos da Cunha, M., Firpo de Souza Porto, M., Pivetta, F., Zancan, L., Santos Francisco, M., Brum Pinheiro, A., Melo Souza, F., & Calazans, R. (2015). O desastre no cotidiano da favela: reflexões a partir de três casos no Rio de Janeiro [An everyday disaster in favela: reflections based on three cases in Rio de Janeiro]. O Social Em Questão, 33, 95–122.
  • Bhuyan, K., Westen, C. V., Wang, J., & Raj, S. (2022). Mapping and characterising buildings for flood exposure analysis using open ‑ source data and artificial intelligence. In Natural Hazards Issue 0123456789. Springer Netherlands. https://doi.org/10.1007/s11069-022-05612-4
  • Brito, P. L., Ku, M., Koeva, M., Pedrassoli, J. C., Wang, J., Costa, F., & Freitas, A. D. D. (2020). The spatial dimension of COVID-19: The potential of earth observation data in support of slum communities with evidence from Brazil. ISPRS International Journal of Geo-Information, 9(9), 557. https://doi.org/10.3390/ijgi9090557
  • Cai, D., Zhang, C., & He, X. (2010). Unsupervised feature selection for multi-cluster data. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 333–342. https://doi.org/10.1145/1835804.1835848
  • Campello, R. J. G. B., Moulavi, D., & Sander, J. (2013). Density-based clustering based on hierarchical density estimates. In J. Pei, V.S. Tseng, L. Cao, H. Motoda, & G. Xu (Eds.), Advances in knowledge discovery and data mining (Vol. 7819, pp. 160–172). Springer. https://doi.org/10.1007/978-3-642-37456-2_14
  • Chiang, M. M. T., & Mirkin, B. (2010). Intelligent choice of the number of clusters in k-means clustering: An experimental study with different cluster spreads. Journal of Classification, 27(1), 3–40. https://doi.org/10.1007/s00357-010-9049-5
  • da Fonseca Feitosa, F., Vieira Vasconcelos, V., Moutinho Duque de Pinho, C., Frizzi Galdino da Silva, G., da Silva Gonçalves, G., Correa Danna, L. C., & Seixas Lisboa, F. (2021). IMMerSe: An integrated methodology for mapping and classifying precarious settlements. Applied Geography, 133, 102494. https://doi.org/10.1016/J.APGEOG.2021.102494
  • Dos Santos, B. D., de Pinho, C. M. D., Oliveira, G. E. T., Korting, T. S., Escada, M. I. S., & Amaral, S. (2022). Identifying precarious settlements and Urban fabric typologies based on GEOBIA and data mining in Brazilian amazon cities. Remote Sensing, 14(3), 704. https://doi.org/10.3390/RS14030704
  • Duque, J. C., Patino, J. E., & Betancourt, A. (2017). Exploring the potential of machine learning for automatic slum identification from VHR imagery. Remote Sensing, 9(9), 1–23. https://doi.org/10.3390/rs9090895
  • Ebert, A., Kerle, N., & Stein, A. (2009). Urban social vulnerability assessment with physical proxies and spatial metrics derived from air- and spaceborne imagery and GIS data. Nat Hazards, 48(2), 275–294. 10.1007/s11069-008-9264-0
  • Eckle, M., De Albuquerque, J. P., Herfort, B., Zipf, A., Leiner, R., Wolff, R., & Jacobs, C. (2016). Leveraging OpenStreetMap to support flood risk management in municipalities: A prototype decision support system. Proceedings of the International ISCRAM Conference Krystiansand, Norway, May.
  • Ferreira, N. D. J., & Feitosa, F. D. F. (2020). Cartografias das Favelas : Uma Análise Comparativa. In Pandemia e Cotidiano, 5, I Seminário Nacional – Urbanismo, Espaço e Tempo. Cidade.
  • Fränti, P., & Sieranoja, S. (2019). How much can k-means be improved by using better initialization and repeats? Pattern Recognition, 93, 95–112. https://doi.org/10.1016/j.patcog.2019.04.014
  • Frazier, A E., & Kedron, P. (2017). Landscape Metrics: Past Progress and Future Directions. Curr Landscape Ecol Rep, 2(3), 63–72. 10.1007/s40823-017-0026-0
  • Geiß, C., Schauß, A., Riedlinger, T., Dech, S., Zelaya, C., Guzmán, N., Hube, M. A., Arsanjani, J. J., & Taubenböck, H. (2017). Joint use of remote sensing data and volunteered geographic information for exposure estimation: Evidence from Valparaíso, Chile. Natural Hazards, 86(S1), 81–105. https://doi.org/10.1007/s11069-016-2663-8
  • Gevaert, C. M., Kohli, D., & Kuffer, M. (2019, May). Challenges of mapping the missing spaces. 2019 Joint Urban Remote Sensing Event, JURSE 2019, 2019–2022. https://doi.org/10.1109/JURSE.2019.8809004
  • Ghaffarian, S., Kerle, N., & Filatova, T. (2018). Remote sensing-based proxies for Urban disaster risk management and resilience: A review. Remote Sensing, 10(11), 1760. https://doi.org/10.3390/rs10111760
  • Gram-Hansen, B. J., Helber, P., Coca-Castro, A., Kopackova, V., & Bilinski, P. (2019). Mapping informal settlements in developing countries using machine learning and low resolution multi-spectral data AAAI/ACM Conference on AI, Ethics, and Society (AIES ’19) January 27–28, 2019 Honolulu, HI, USA. 361–368.
  • Hall-Beyer, M. (2017). Practical guidelines for choosing GLCM textures to use in landscape classification tasks over a range of moderate spatial scales Practical guidelines for choosing GLCM textures to use in. International Journal of Remote Sensing, 38(5), 1312–1338. https://doi.org/10.1080/01431161.2016.1278314
  • Han, J., Kamber, M., & Pei, J. (2012). Cluster analysis: Basic concepts and methods. In Data Mining 3rd, (pp. 443–495). https://doi.org/10.1016/B978-0-12-381479-1.00010-1
  • IBGE. (2010, August). SP Capital. Censo Demográfico.
  • Instituto Brasileiro de Geografia e Estatística (IBGE). (2019). Aglomerados Subnormais IBGE. https://ibge.gov.br/geociencias/organizacao-do-territorio/tipologias-do-territorio/15788-aglomerados-subnormais.html
  • Jochem, W. C., Leasure, D. R., Pannell, O., Chamberlain, H. R., Jones, P., & Tatem, A. J. (2020). Classifying settlement types from multi-scale spatial patterns of building footprints. Environment & Planning B: Urban Analytics & City Science, 48(5), 1–19. https://doi.org/10.1177/2399808320921208
  • Karmitsa, N., Bagirov, A. M., & Taheri, S. (2018). Clustering in large data sets with the limited memory bundle method. Pattern Recognition, 83, 245–259. https://doi.org/10.1016/j.patcog.2018.05.028
  • Kit, O., Lüdeke, M., & Reckien, D. (2012). Texture-based identification of urban slums in Hyderabad, India using remote sensing data. Applied Geography, 32(2), 660–667. https://doi.org/10.1016/j.apgeog.2011.07.016
  • Kohli, D., Sliuzas, R., Kerle, N., & Stein, A. (2012). Computers, environment and Urban systems an ontology of slums for image-based classification. Computers, Environment and Urban Systems, 36(2), 154–163. https://doi.org/10.1016/j.compenvurbsys.2011.11.001
  • Kohli, D., Stein, A., & Sliuzas, R. (2016). Uncertainty analysis for image interpretations of urban slums. Computers, Environment and Urban Systems, 60, 37–49. https://doi.org/10.1016/j.compenvurbsys.2016.07.010
  • Kohli, D., Warwadekar, P., Kerle, N., Sliuzas, R., & Stein, A. (2013). Transferability of object-oriented image analysis methods for slum identification. Remote Sensing, 5(9), 4209–4228. May 2014. https://doi.org/10.3390/rs5094209
  • Kuffer, M., Pfeffer, K., Baud, I., & Sliuzas, R. (2013). Analysing sub-standard areas using high resolution remote (VHR) sensing imagery, N-AERUS XIV, Enschede, 12 - 14th September 2013. https://ris.utwente.nl/ws/portalfiles/portal/30133835/Kuffer2013analysing.pdf
  • Kuffer, M., Pfeffer, K., Sliuzas, R., & Baud, I. (2016). Extraction of slum areas from VHR imagery using GLCM variance. IEEE Journal of Selected Topics in Applied Earth Observations & Remote Sensing, 9(5), 1830–1840. https://doi.org/10.1109/JSTARS.2016.2538563
  • Kuffer, M., Thomson, D. R., Boo, G., Mahabir, R., Grippa, T., Vanhuysse, S., Engstrom, R., Ndugwa, R., Makau, J., Darin, E., de Albuquerque, J. P., & Kabaria, C. (2020). The role of earth observation in an integrated deprived area mapping “system” for low-to-middle income countries. Remote Sensing, 12(6), 982. https://doi.org/10.3390/rs12060982
  • Leonita, G., Kuffer, M., Sliuzas, R., & Persello, C. (2018). Machine learning-based slum mapping in support of slum upgrading programs: The case of Bandung City, Indonesia. Remote Sensing, 10(10), 1522. https://doi.org/10.3390/rs10101522
  • Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., & Liu, H. (2017). Feature selection: A data perspective. ACM Computing Surveys, 50(6), 1–45. https://doi.org/10.1145/3136625
  • Lilford, R., Kyobutungi, C., Ndugwa, R., Sartori, J., Watson, S I., Sliuzas, R., Kuffer, M., Hofer, T., Porto de Albuquerque, J, & Ezeh, A. (2019). Because space matters: conceptual framework to help distinguish slum from non-slum urban areas. BMJ Glob Health, 4(2), e001267 10.1136/bmjgh-2018-001267
  • Lucci, P., Bhatkal, T., & Khan, A. (2018). Are we underestimating urban poverty? World Development, 103, 297–310. doi:10.1016/j.worlddev.2017.10.022
  • Luo, E., Kuffer, M., & Wang, J. (2022). Urban poverty maps - from characterising deprivation using geo-spatial data to capturing deprivation from space. Sustainable Cities and Society, 84(July), 104033. https://doi.org/10.1016/j.scs.2022.104033
  • Madhulatha, T. S. (2012). An overview on clustering methods. IOSR Journal of Engineering, 2(4), 719–725. https://doi.org/10.9790/3021-0204719725
  • Mahabir, R., Agouris, P., Stefanidis, A., Croitoru, A., & Crooks, A. T. (2020). Detecting and mapping slums using open data: A case study in Kenya. International Journal of Digital Earth, 13(6), 683–707. https://doi.org/10.1080/17538947.2018.1554010
  • Martínez, J., Pfeffer, K., & Baud, I. (2016). Factors shaping cartographic representations of inequalities. Maps as products and processes. Habitat International, 51, 90–102. https://doi.org/10.1016/j.habitatint.2015.10.010
  • Martino, A., Rizzi, A., & Mascioli, F. M. F. (2017). Efficient approaches for solving the large-scale k-medoids problem. IJCCI 2017 - Proceedings of the 9th International Joint Conference on Computational Intelligence, Ijcci, 338–347. https://doi.org/10.5220/0006515003380347
  • McInnes, L., Healy, J., & Astels, S. (2016). The HDBSCAN Clustering Library. IBM.
  • Merodio, P., Jimena, O., Carrillo, J., Kuffer, M., Thomson, D. R., Luis, J., Quiroz, O., Villaseñor, E., Vanhuysse, S., Abascal, Á., Oluoch, I., Nagenborg, M., Persello, C., & Brito, P. L. (2021). Earth observations and statistics: Unlocking sociodemographic knowledge through the power of satellite images. Sustainability, 13(22), 12640. https://doi.org/10.3390/su132212640
  • Nobrega, R. A. A., Hara, C. G. O., & Quintanilha, J. A. (2008). An object-based approach to detect road features for informal settlements near Sao Paulo , Brazil. In Object-Based Image Analysis. Lecture Notes in Geoinformation and Cartography. Springer 978-3-540-77058-9, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77058-9_32
  • Nugrahita, R., & Surjandari, I. (2020). Identify product families using cluster analysis: Case study in Passenger Car Radial (PCR) tire product. IOP Conference Series: Materials Science and Engineering PAPER. https://doi.org/10.1088/1757-899X/909/1/012057
  • Olthuis, K., Benni, J., Eichwede, K., & Zevenbergen, C. (2015). Slum upgrading: Assesing the importance of location and a plea for a spatial approach. Habitat International, 50, 270–288. https://doi.org/10.1016/j.habitatint.2015.08.033
  • Owen, K. K., & Wong, D. W. (2013). An approach to differentiate informal settlements using spectral, texture, geomorphology and road accessibility metrics. Applied Geography, 38, 107–118. https://doi.org/10.1016/j.apgeog.2012.11.016
  • Pasternak, S., & D’Ottaviano, C. (2016). Favelas no Brasil e em São Paulo: avanços nas análises a partir da Leitura Territorial do Censo de 2010*. Cadernos Metrópole, 18(35), 75–100. https://doi.org/10.1590/2236-9996.2016-3504
  • Patel A, Koizumi, N., & Crooks, A. (2014). Measuring slum severity in Mumbai and Kolkata: A household-based approach. Habitat International, 41, 300–306. 10.1016/j.habitatint.2013.09.002
  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). 2.3. Clustering — scikit-learn 0.24.2 documentation. Machine Learning in Python.
  • Pfaffel, O. (2021). Feature importance for partitional clustering (0.1.4). CRAN.
  • Roy, D., Bernal, D., & Lees, M. (2020). An exploratory factor analysis model for slum severity index in Mexico City. Urban Studies, 57(4), 789–805. https://doi.org/10.1177/0042098019869769
  • Sliuzas, R., Kuffer, M., & Masser, I. (2010). The Spatial and Temporal Nature of Urban Objects. In C. Jürgens & T. Rashed (Eds.), Remote Sensing of Urban and Suburban Areas. Springer Science+Business Media. https://doi.org/10.1007/978-1-4020-4385-7
  • Small, C. (2014). Mapping urban growth and development as continuous fields in space and time. Geography Department University of Sao Paulo, 1(spe), 155–179. https://doi.org/10.11606/rdg.v0i0.555
  • Taubenböck, H., & Kraff, N. J. (2014). The physical face of slums: A structural comparison of slums in Mumbai, India, based on remotely sensed data. Journal of Housing and the Built Environment, 29(1), 15–38. https://doi.org/10.1007/s10901-013-9333-x
  • Taubenböck H, Kraff N & Wurm M. (2018). The morphology of the Arrival City - A global categorization based on literature surveys and remotely sensed data. Applied Geography, 92, 150–167. 10.1016/j.apgeog.2018.02.002
  • Taubenböck, H., Kra, N. J., & Wurm, M. (2018). The morphology of the arrival city - a global categorization based on literature surveys and remotely sensed data. Applied Geography, 92(February), 150–167. https://doi.org/10.1016/j.apgeog.2018.02.002
  • Thomson, D. R., Kuffer, M., Boo, G., Hati, B., Grippa, T., Elsey, H., Linard, C., Mahabir, R., Kyobutungi, C., Maviti, J., Mwaniki, D., Ndugwa, R., Makau, J., Sliuzas, R., Cheruiyot, S., Nyambuga, K., Mboga, N., Kimani, N. W., de Albuquerque, J. P., & Kabaria, C. (2020). Need for an Integrated Deprived Area “Slum” Mapping System (IDEAMAPS) in Low- and Middle-Income Countries (LMICs). Social Sciences, 9(5), 80. https://doi.org/10.3390/socsci9050080
  • Thrun, M. C., Ultsch, A., & Breuer, L. (2021). Explainable AI framework for multivariate hydrochemical time series Machine Learning and Knowledge Extraction. 3. February. 1–29. doi:10.3390/make3010009
  • Trento Oliveira, L. (2021). The diversity of deprived areas: Applications of unsupervised machine learning and open geodata (Issue August) [University of Twente]. https://essay.utwente.nl/88986/
  • UN-Habitat. (2016). World Cities Report 2016: Urbanization and Development–Emerging Futures. In UN-Habitat (Nairobi, Kenya)Accessed 11 07 2022 . Available online: https://unhabitat.org/world-cities-report-2016
  • Van Dijk, M., Moorthy, I., Nguyen, B., See, L., & Fritz, S. (2019). Tracking poverty using satellite imagery and big data. The International Institute for Applied Systems Analysis, December, 1–16. http://pure.iiasa.ac.at/id/eprint/16240/
  • Wang, J., Kuffer, M., & Pfeffer, K. (2019). The role of spatial heterogeneity in detecting urban slums. Computers, Environment and Urban Systems, 73(April 2018), 95–107. https://doi.org/10.1016/j.compenvurbsys.2018.08.007
  • Wurm, M., Taubenböck, H., Weigand, M., & Schmitt, A. (2017). Remote sensing of environment slum mapping in polarimetric SAR data using spatial features. Remote Sensing of Environment, 194, 190–204. https://doi.org/10.1016/j.rse.2017.03.030
  • Yeboah, G., de Albuquerque, J. P., Troilo, R., Tregonning, G., Perera, S., Shifat Ahmed, S. A. K., Ajisola, M., Alam, O., Aujla, N., Azam, S. I., Azeem, K., Bakibinga, P., Chen, Y. F., Choudhury, N. N., Diggle, P. J., Fayehun, O., Gill, P., Griffiths, F., Harris, B., Yusuf, R. … Yusuf, R. (2021). Analysis of openstreetmap data quality at different stages of a participatory mapping process: Evidence from slums in Africa and Asia. ISPRS International Journal of Geo-Information, 10(4), 265. https://doi.org/10.3390/ijgi10040265

Appendix 1.

SQL expressions to GIS attributes selection

Appendix 2.

Acquired by IBGE, we derive features 01 to 04 from table ‘Básico_SP’; feature 05 from table ‘DomicílioRenda_SP’ and feature 06 to 12 from table ‘Domicilio01_UF’

Appendix 3.

Elbow method calculation from different implementation packages. Left: from sklearn package; Right: from yellowbrick.cluster package.

Appendix

Appendix 5.

Visualisation of features using FeatureImpCluster tool.

Appendix 6.

Visualisation of features using FeatureImpCluster tool.