Full article: A machine learning method for Arctic lakes detection in the permafrost areas of Siberia

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Thermokarst lakes are the main components of the vast Arctic and subarctic landscapes. These lakes can serve as geoindicators of permafrost degradation; therefore, proper lake distribution assessment methods are necessary. In this study, we compared four machine learning methods to improve existing lake detection systems. The northern part of Yakutia was selected as the study area owing to its complex environment. We used data from Landsat 8 and spectral indices to take into account the spectral characteristics of the lakes, and MERIT DEM data to take into account the topography. The lowest accuracy was found for the classification and regression trees (CART) method (overall accuracy = 81%). On the other hand, the random forests (RF) classification provided the best results (overall accuracy = 92%), and only this classification coped well in all problematic areas, such as shaded and humid areas, near steep slopes, burn scars, and rivers. The altitude and bands SWIR1 (Short wave infrared 1), SWIR2 (Short wave infrared 2), and Green were the most important. Spectral indices did not have significant impact on the classification results in the specific conditions of the thermokarst lakes environment. 17,700 lakes were identified with the total area of 271.43 km².

KEYWORDS:

Introduction

Permafrost regions cover a quarter of the land surface in the northern hemisphere and contain large amounts of organic carbon (Brown et al., Citation1997; Jorgenson & Grosse, Citation2016; Obu et al., Citation2019). Climate change is expected to significantly degrade permafrost (Anisimov et al., Citation2010; Lawrence & Slater, Citation2005; Romanovsky et al., Citation2017), leading to the formation of thermokarst lakes (Zandt et al., Citation2020). Lakes in Siberia are point sources of carbon dioxide and methane that release long-term carbon stocks into the atmosphere (Hughes-Allen et al., Citation2021; Kallistova et al., Citation2019; Serikova et al., Citation2019). This process initiates positive climate feedback, potentially contributing to a 0.39°C rise in surface air temperature by 2300 (Schneider von Deimling et al., Citation2015; Zandt et al., Citation2020). Thermokarst lakes are one of the main features of the Arctic and Subarctic landscapes of Siberia, Alaska, Canada (Grosse et al., Citation2008) and Qinghai-Tibet Plateau (Chen et al., Citation2021) creating characteristic thermokarst landscapes. The formation of lakes in Siberia significantly impacts regional landscape morphology, hydrology, and biogeochemistry. Such lakes are characterized by landforms that develop as a result of the melting of ice-rich permafrost or massive ground ice (Brown & Grave, Citation1979). The common occurrence of thermokarst lakes has a large impact on the arctic areas, despite the fact that it is a local thermal disturbance of the ground (Grosse et al., Citation2013). The active thermokarst often indicates that permafrost is unstable and warming up (Niu et al., Citation2011; Serreze et al., Citation2000).

Proper lake distribution assessment methods are necessary to accurately determine their location and characteristics as well as to study their temporal and spatial dynamics. After studying these characteristics, it will be possible to ascertain whether thermokarst lakes can serve as geoindicators of permafrost degradation. The concept of geoindicators appears to be particularly well suited to determining changes in morphogenetic and sedimentary environments or geosystems in general, especially in sensitive polar and subpolar systems (Zwoliński et al., Citation2008; Zwoliński, Citation2004).

Scientists have used various sensors to map lakes in permafrost regions. Kripotkin et al. Citation2008 studied the dynamics of thermokarst lakes in continuous and discontinuous cryolithozones of Western Siberia using fusions of different sensors (Landsat 1, 4, 5, and 7, Meteor-3 M, and ERS-2). They used manual interpretation; therefore, their approach cannot be implemented for large areas. Kravtsova et al. (2009, 2011, Kravtsova & Rodinova, Citation2016) studied changes in the size of thermokarst lakes using Landsat, Soyuz-22 sensors, and aerial images. They pointed out many difficulties in mapping lakes in northern regions, caused by the similar spectral characteristics of lakes, burn scars, and shadows. Sentinel 2 satellite imagery has been used since 2015 to map water bodies (Bogoyavlensky, Citation2019; Du et al., Citation2016; Yang et al., Citation2017). It is possible to map lakes smaller than 1 ha owing to Sentinel’s high spatial resolution (Șerban et al., Citation2020), but they cannot be used to analyse surface changes over long periods.

The main methods used in lake mapping are digitizing through visual interpretation, thresholding, standard image classification techniques, edge detection using a single or a combination of multiple bands, algebraic operations (e.g. band ratios and spectral water indices), spectral transformation, and texture analysis (Song et al., Citation2014).

Previous research on this topic is based mainly on the definition of threshold values that best separate the image pixel values of water from land. Grosse et al. (Citation2008) used panchromatic Spot-5 and Ikonos-2 images for threshold assessments. They observed strong reflectance differences between the land surface and the water bodies. However, they highlighted some difficulties in lakes with very shallow water levels, turbid water with high sediment suspension, frozen lakes, deep thermo-erosional valleys, and steep north-facing cliffs or slopes, which resulted in misclassification. They manually corrected errors, which were ineffective for research covering large territories. Morgenstern et al. (Citation2011) used the thresholding method with mid-infrared wavelengths of Landsat data and encountered similar problems. Lindgren et al. (Citation2021) applied the Object-Based Image Analysis (OBIA) classification for lake detection; however, they faced misclassification caused by terrain shadows, rivers, streams, and channels, which were manually removed.

Numerous spectral indices exist, but their suitability for water detection in different settings must be investigated (Palmer et al., Citation2015). The Normalized Difference Vegetation Index (NDVI) is widely used in remote sensing research to capture regional and global vegetation changes (Wang et al., Citation2014). McFeeters (Citation1996) developed the most basic water index, the Normalized Difference Water Index (NDWI), using the green and near-infrared (NIR) bands of the Landsat thematic mapper (TM) to maximize the identification of water bodies. Lu et al. (Citation2011) used the combined difference between NDVI and NDWI to enhance the contrast between water bodies and the surrounding surface features. The NDWI index exhibited some shortcomings in built-up areas. To solve this problem, the NIR range was replaced with the short-wave infrared (SWIR) range, resulting in the Modified Normalized Difference Water Index (MNDWI) index (Xu, Citation2006). However, there are still severe problems owing to shadows in the highlands. Therefore, Feyisa et al. (Citation2014) proposed an automated water extraction index (AWEI) for identifying water bodies, which has two conditions: AWEIsh is primarily designed to remove shadow pixels, while AWEInsh is designed for areas with urban backgrounds.

Machine-learning methods have also been used for the classification of water objects. Both supervised (Acharya et al., Citation2019; Huang et al., Citation2015) and unsupervised (Brezonik et al., Citation2007; Kloiber et al., Citation2002; Olmanson et al., Citation2008) approaches have been tested. In the Arctic regions, machine learning algorithms are mostly used for the permafrost modelling (Deluigi et al., Citation2017; Wang et al., Citation2019) and water body mapping (Dirscherl et al., Citation2020; Nitze et al., Citation2017; Rokni et al., Citation2015; Șerban et al., Citation2020). The results of these studies are characterized by various accuracies and mapping results. Xie et al. (Citation2016) obtained an accuracy of 96% using Landsat TM imagery (30 m resolution), while Pradhan achieved an accuracy of 58% using radar TerraSAR-X (3 m resolution). Nitze et al. (Citation2017) used the random forest method for lake change detection. They optimized the data quality, trying to remove the misclassifications described in the previously mentioned works, however they still faced the problems mainly attributed to the presence of snow, ice, and shadows. Șerban et al. (Citation2020) compared five machine learning methods for the classification of water bodies using Sentinel-2 data. Automatic classifications have proved their efficiency and accuracy in the water body extraction (Acharya et al., Citation2019; Nie et al., Citation2017; Xie et al., Citation2016) and performed better than simple thresholding of the water indices as NDWI (Șerban et al., Citation2020). Bangira et al. (Citation2019) compared thresholding with machine learning classifiers for mapping complex water and determined that machine learning methods were less sensitive to variations such as water turbidity and aquatic plants.

Attempts were made to create global lake databases, both, by combining existing databases (Lehner & Döll, Citation2004) and, by using satellite imagery (Donchyts et al., Citation2016; Verpoorter et al., Citation2014). A recent global high-resolution mapping approach was conducted using Landsat satellite imagery; however, this method does not distinguish between rivers and lakes (Pekel et al., Citation2016). Such data cannot be used to map thermokarst lakes in large territories or areas crossed by river channels.

Most of the related work focus on the classification of water objects only and there is no specific distinction between lakes and other water objects. Such a distinction may be needed when studying thermokarst lakes. Often, rivers must be excluded manually from the dataset. In this study we assumed that rivers differ spectrally from lakes, and are characterized by different topographic condition and morphology, so the distinguish will be possible.

Previous works highlighted that most classification errors are caused by similar spectral features of lakes and other objects such as shadows, burn scars, or rivers. However, lakes often have different spectral characteristics owing to their depth and the presence of vegetation or ice, which causes false positive errors. The main goal of this study was to develop a method for detecting water bodies as lakes in a differentiated environmental setting. It was assumed that machine learning techniques perform better than thresholding in permafrost environment covered by wetlands and peatlands with complex relief. The northern part of Yakutia was selected as the study area owing to its complex environment and poor exploration. It is characterized by a significant variety of terrain and is entirely located in a continuous permafrost zone, which means that most lakes are of thermokarst origin. The main objectives of this study were to: (i) identify the features that best distinguish lakes from other spectrally similar objects, (ii) compare and determine the accuracy of the supervised classification methods for lake detection, and (iii) design a method for detecting lakes for large areas.

Methods

Study area

As the main aim of this research was to develop a method for lake detection in a complex environment, we selected the northern part of Yakutia as the study area. The territory is distinguished by various landforms and landscapes as well as complex geodiversity and biodiversity, thereby making it possible to test the designed method under various environmental conditions. The study area is significant owing to its location in a continuous permafrost zone. Thermokarst lakes are widespread in the study area. The region’s area is more than 137,400 km², and its population is just over 11,000 people. Field research in this area is financially unprofitable and logistically difficult, making it very important to improve the relevant research technologies using remote sensing.

The study area belongs to the Verkhoyansk and Chersky mountain systems (). The climate of the area is harsh and sharply continental (Matveev, Citation1989). Its development is influenced by the geographical location between the Arctic and subarctic climatic zones as well as the isolation of the area by mountain ranges. The main river in the study area is the Yana River. The Yana River basin covers highlands with mountain tundra, rocky deserts, and plateaus with mountain larch forests. The area is dominated by sparse woodlands, mainly Kayander larch (Larix Cajanderi), with birch, dwarf pine, cranberries, and intensive moss-lichen complexes (Matveev, Citation1989). Lichen and moss tundra are common in the north of the district. The mountainous landscape is dominated by cedar elfin and shrubby birch, and the soil surface is covered with lichens (Matveev, Citation1989).

Figure 1. A) the map indicates the location of the study area in Siberia on the background of the permafrost distribution (Obu et al., Citation2020). b) the location of the studied Yana River drainage basin (Global Runoff Data Centre,Citation2020) on digital elevation model (DEM based on Yamazaki et al., Citation2017).

Visual validation was performed on three selected test sites (). The total area and number of lakes is as follows: – test site 1–1.86 km² and 8 lakes, test site 2–5.58 km² and 20 lakes, test site 3–1.24 km² and 8 lakes. The total area of three test sites is 186 km². Each test site has the area of 62 km².

Figure 2. Three selected test sites a) Lowland conditions; b) River and wetland conditions; c) Mountainous area with river-lake system.

Data

An extended processing time is one of the main problems when using machine learning methods over large areas (Nath & Deb, Citation2010). The problem of insufficient computing power has mainly been solved since the release of the Google Earth Engine (GGE) (Google). GEE is a cloud platform dedicated to geographic data processing and analysis (Gorelick et al., Citation2017). GEE is widely used for mapping water bodies on a regional (Mahdianpari et al., Citation2020; Nguyen et al., Citation2019; Wang et al., Citation2020) or even on a global scale (Pekel et al., Citation2016). Thus, all of the presented analyses were performed using the GEE with the code written in JavaScript. The north-eastern part of the Siberian plateau was first imaged in 1999 by Landsat sensors (Gutman et al., Citation2013; Pekel et al., Citation2016), and since that year the number of imaging has been increasing. Analyses were performed using Landsat 8 data, available from 2014 as it was the first possibility of building a cloud free mosaic in the area of research. In the study area, there is no snow cover from June to September (Matveev, Citation1989). Therefore, only images from this period were selected for the mosaic. The images were taken from the USGS Landsat collection 8 Surface Reflection Level 1. Therefore, atmospheric and geometric corrections were unnecessary. All images with cloud cover lower than 30% were filtered from the collection. The collection contained 55 images in total.

In addition to Landsat 8 images, a digital elevation model, the Multi-Error-Removed Improved-Terrain DEM, was used (Yamazaki et al., Citation2017). MERIT DEM is a high-precision global digital elevation model with a spatial resolution of 3 arc seconds (ca 36 m at the study area, 90 m at the equator), obtained by excluding major error components from existing DEMs (NASA SRTM3 DEM, JAXA AW3D DEM, Viewfinder Panoramas DEM).

Preprocessing

An image mosaic covering the entire study area was created, covering the period of June to September (2014). If two images were available for a specific area, an image with a lower cloud cover was selected. A cloud mask created using the CFMask algorithm was applied in the next step (Foga et al., Citation2017).

Based on the literature review mentioned in the introduction section, 18 variables were selected for model training (bands: B1, B2, B3, B4, B5, B6, B7, B10, and B11; spectral indices: NDVI, NDWI, MNDWI1, MNDWI2, AWEIsh, and AWEInsh; geomorphometric parameters: altitude, slope, and aspect).

Spectral indices used in the study were selected according to the Acharya et al. (Citation2018), who tested spectral indices for water body detection in a mountainous environment (). As some mountain shadows cannot be distinguished from lake features using spectral variables, MERIT DEM data were used to create an additional dataset with the geomorphometric variables.

Table 1. Spectral indices used in the study, acc. to Acharya et al. (Citation2018).

Download CSV Display Table

Classification

In the next stage, the training and test data were prepared. Ground truth points were selected based on Joint Research Centre (JRC) Global Surface Water Mapping Layers, v1.3 (Pekel et al,. Citation2016), and visual assessment according to the Landsat data mosaic and high-resolution images available from Google Earth Pro. The number of 400 ground truth points was decided based on the square root of the study area in square kilometres and rounding the result to the nearest hundredth. 200 points were assigned to the lake category and another 200 to the non-lake category. Field validation was not possible because of the inaccessibility of the territory and the large area of research (137,400 km²). The points characterizing the lakes were chosen to reflect all the spectral and spatial characteristics of the lakes, while the rest of the points were chosen with an emphasis on the most difficult to distinguish objects, such as shadows, river bends, and fire scars. Examples of the ground truth are shown in . The entire dataset was randomly divided into training (75%) and test (25%) data.

Figure 3. Examples of ground points in study area. The white colour indicates points in the lake category. Points in the non-lake category are marked in yellow. A Landsat 8 composite of bands 5, 4, and 3 was used as the base map.

We compared four types of popular modelling techniques available in GEE and widely used in remote sensing: classification and regression tree (CART), naive Bayes (NB), random forest (RF), and support vector machine (SVM) (Clemente et al., Citation2020; Shelestov et al., Citation2017; Shetty et al., Citation2021; Yang et al., Citation2021).

Classification and regression trees (Lewis, Citation2000) is a well-proven machine learning algorithm for identifying and classifying objects in the remote sensing field (Shelestov et al., Citation2017). The CART classification does not require expert knowledge, automatically selects useful spectral and ancillary information from data supplied and can be used with continuous and categorical ancillary data (Lawrence & Wright, Citation2001).

Bayes’ theorem was proposed to statistically express the relationship between the conditional probabilities of two events. NB is a particular case of Bayesian networks, where a class node has no parents, and each eigenvector has the same class node as its unique parent. Despite the simplicity of this method, it often performs better than other complex classification methods (Bressan et al., Citation2009).

The RF algorithm is developed by Breiman (Citation2001). It uses machine learning systems and relies on creating many decision trees based on a random set of data. Random forests, which generalize the decision tree concept, are classified as ensemble methods. The final decision is made by a majority vote on the classes indicated by individual robust decision trees. To find the best partitioning at each node in the decision tree, the RF uses a metric called the Gini index. In this context, the Gini index was used to calculate the average Gini decrease, which returns the importance of the variables used during classification (Belgiu & Drăguţ, Citation2016). The average Gini decrease was calculated to assess the importance of all input variables used to train the model.

In recent decades, SVM have become a popular supervised machine learning algorithms for classification and regression (Suthaharan, Citation2016). The SVM is based on the concept of perpendicular distance between decision planes (defined as a decision boundary) and data points. The maximum margin is chosen as the decision boundary. Objects are separated and assigned to different classes based on this distance. A subset of the data, called the reference vector, determines the border position (Suthaharan, Citation2016).

Validation

The classification results were evaluated using a confusion matrix that allows obtaining general measures of statistical accuracy, including precision and recall, F-score, and Cohen’s kappa (K). Precision (P) (or user accuracy) was computed by dividing the number of true pixels by the total number of predicted pixels in the class. Recall (R), often called producer accuracy, was calculated by dividing the number of true pixels in a class prediction by the total number of true samples in the class. To capture the information from P and R simultaneously, the F-score (F1) was calculated, providing their harmonic mean (Jolly, Citation2018):

(1)

F 1 = \frac{2 * (R * P)}{(R + P)}

(1)

The omission error (EO) and commission error (EC) were also calculated. EO returns the number of false negative (FN) (classification of a lake as non-lake) results in relation to all true samples of a class, while EC describes the frequency of all false positive (FP) (classification of a non-lake as lake) results in relation to all predicted samples in a class. All indicators were calculated for each class (lakes and non-lakes).

The overall accuracy (OA) was calculated by dividing the sum of all true positive (TP) and true negative (TN) classification results by the total amount of data (TS). The overall accuracy represents the probability that a test will correctly classify an individual. As the OA decreases, the model performance decreases. Therefore, the expected accuracy (EA) was calculated from TP, TN, FP, FN, and TS (Cohen, Citation1960):

(2)

E A = \frac{(T N + F P) * (T N + F N) + (F N + T P) * (F P + T P)}{T S}

(2)

As a general measure of accuracy, Cohen’s kappa coefficient (K) was calculated using the following formula (Cohen, Citation1960):

(3)

K = \frac{O A - E A}{1 - E A}

(3)

K determines the degree of compliance of multiple measurements of the same variable under different conditions (Cohen, Citation1960; Landis & Koch, Citation1977). A similar approach to accuracy measures was used by Dirscherl et al. (Citation2020), who classified Antarctic supraglacial lakes using the random forest method. All calculated measures of accuracy are between 0 and 1. K, OA, F1, P and R, with higher values represents better performance, while EO and EC with lower values represents better results.Visual validation was performed for the three selected test sites described in the study area section. The lakes in these areas were manually digitized. The areas of the classified lakes, false negatives, and false positives were also calculated.

Results

Model performance

illustrates the calculated values of accuracy and error rates for each type of classification. The omission error (EO) level of the lake class was the highest for the CART classification (0.16), and the EO for the non-lake class was the highest for the CART and NB classifications (0.22). The lowest EO characterized the NB classification for the lake class. The non-lake class had the lowest EO in the RF classification. The commission error (EC) level for the lake class was lowest in the RF classification. In contrast, the non-lake class had the lowest EC in the NB classification. R and P were analysed simultaneously in terms of F-score (F1). The best F1 was characterized by the RF classification, both for the lake and non-lake classes. The worst F1 characterized the CART classification in both classes.

Table 2. Accuracy assessment results for each classification type. The best values are bold.

Download CSV Display Table

The best OA and K were found for RF classification and the worst for CART classification. The RF classification yielded the lowest error in both the lake and non-lake classes.

Based on the visual assessment, it can be seen that the RF classification yielded the best results in the lowland conditions (). Other types of classification significantly overestimate the number of lakes and are characterized by a high error. The CART and SVM classifications were characterized by a considerable error and often classified objects such as shadows and non-vegetated areas as lakes. Areas close to the waterbodies, were also often classified as lakes. False positives significantly underestimated the classification results.

Figure 4. Lake mapping results in lowland conditions on test site 1: False colour - composition of bands 5, 4, and 3, Landsat 8 OLI; RGB - composition of bands 2, 3, and 4, Landsat 8 OLI; CART, classification and regression trees; NB, naive Bayes; RF, random forest; SVM, support vector machine.

It can be concluded that only the RF model coped well in rivers and wetland conditions (). The NB model classified all rivers as lakes. The CART and SVM methods often classified objects such as shadows, bare grounds, and areas close to the waterbodies as lakes. For all types of classifications, the shallow parts of the lakes were characterized by a large number of false negatives.

Figure 5. Lake mapping results in river and wetland conditions on test site 2: false color - composition of bands 5, 4, and 3, Landsat 8 OLI; RGB - composition of bands 2, 3, and 4, Landsat 8 OLI; CART, classification and regression trees; NB, naive Bayes; RF, random forest; SVM, support vector machine.

The CART and SVM classifications were characterized by high errors under mountainous conditions (), especially in highly shaded areas. A large number of false positives characterized shadows in both the classifications. The lowest error in mountain conditions characterized the NB and RF classifications. Both classifications were resistant to the misclassification of shaded mountain slopes.

Figure 6. Lake mapping results in mountainous conditions on test site 3: false color - composition of bands 5, 4, and 3, Landsat 8 OLI; RGB - composition of bands 2, 3, and 4, Landsat 8 OLI; CART, classification and regression trees; NB, naive Bayes; RF, random forest; SVM, support vector machine.

illustrates the classified lake areas on each of the three test sites. In test site 1, the RF classification was characterized by the lowest false positive and false negative errors. The CART and SVM classifications had the highest number of false positives. The NB had the highest number of false negatives. In terms of false positives and false negatives, the CART and NB classifications had the highest values. The RF classification had the lowest error values. Test site 3 had the lowest number of false negatives. Neither of the classifications was characterized by a high number of false negatives at test site 3. The CART classification had the falsest positives RF and NB classifications had the lowest number of false positives Based on the information shown in , it can be concluded that the RF classification is the most accurate, as it had the lowest number of false positives and false negatives in all three sites. Similar results were obtained for the NB classification but only at the test site 3. In contrast, the worst results were obtained for the CART and SVM classifications.

Table 3. Lake area [km²], false positive and false negative error percentage for each of the tested classifications on three test sites. Area value is given in square kilometres while false positives and false negatives are given as percentages. The best values are bold.

Download CSV Display Table

Variable importance

shows the importance of all input variables of the RF model. Altitude had the greatest importance, while other terrain features, such as slope and aspect, had medium importance. Almost all Landsat 8 channels had high importance, except for thermal bands (B10 and B11) and band 1. AWEish was the most significant, while NDVI was the least significant among all the indices. In general, the indices had the lowest overall impact on the classification results.

Figure 7. Importance of the variables based on the Gini index. The variables were divided into three groups based on their characteristics and application.

Rivers, burn scars and shadows were the hardest to distinguish from the lakes (). Noticeable differences between lakes and rivers are observed in bands 2, 3 and 4 of the Landsat images. Burns scars have different spectral characteristics than lakes in thermal channels (B10 and 11). Water indices did not show any significant differences between lakes and rivers. However, these indices did have large spectral differences between water features compared to shadows and burn scars. The NDVI values differed between the lakes and rivers classes, and were higher in the lakes than in the rivers. The altitude varied significantly for the shadows class as most shaded places are in mountainous areas with high slope values. On the other hand, lakes and rivers are in flat areas and are of low altitudes; moreover, rivers are situated at lower altitudes than lakes.

Figure 8. Signatures of the features lakes (0), rivers (1), shadows (2) and burn scars (3)) and other training plots (4). The values of the variables are normalized.

Lakes inventory

According to the RF classification results, 24923 lakes were identified with the total area of 275.84 km². The smallest identified lake had an area of 900 m². We assumed the smallest lake area according to Pekel et al. (Citation2016), who reported issues with the detection of water bodies smaller than 30 × 30 m (900 m²). However, after the classification we decided to raise this value to 0.5 ha, due to the high value of the false positives of the smaller lakes. The lakes smaller than this value has been removed from the dataset. After the data cleaning the total area of lakes has been 271.43 km², with 17,770 identified lakes. The largest lake had an area of 9.95 km². The average lake area was 0.015 km². Most of the lakes were located in the central and north part of the research area (). Lakes were located mainly along rivers. Mountain areas and lowlands in the south were not characterized by a high density of lakes.

Figure 9. Lakes distribution plot, according to the RF classification results. a) altitude, b) longitude, c) latitude.

Discussion

Model performance

The statistical accuracy metrics revealed promising workflow functionality. However, some classification methods showed lower overall accuracy than others. The lowest accuracy scores were obtained using CART and SVM. In other studies, different classifiers showed various results depending on the subject and location of study. In the assessment of machine learning classifiers for global lake ice cover mapping, the RF (96% accuracy) method showed advantage over SVM (79% accuracy) and the SVM turned out to be the least accurate method among the tested (Wu et al., Citation2021). However, the SVM proved to be suitable for water quality assessment (Bouamar & Ladjal, Citation2007; Danades et al., Citation2016). Misra et al. (Citation2018) used SVM for the shallow water bathymetry mapping, with the accuracy of 80%. The CART classification has been used with success (92%) on urban lake areas mapping, which is contrary to our results. This is probably due to a completely different specificity of the study area (Zhang et al., Citation2018). The main disadvantage of the CART is their sensitivity to training data. Thus, a slight change in the training dataset could significantly affect the results. Such conclusions were confirmed by Yasmin et al. (Citation2019), who compared the CART and SVM methods for detecting water bodies using Landsat 8 OLI. Our results are in line with by older studies comparing RF and CART, where RF has been superior over CART (Oliveira et al., Citation2012; Peters et al., Citation2007; Vorpahl et al., Citation2012). Rithin Paul Reddy et al. (Citation2020) assessed multiple classifiers and the NB method identified water regions better than others, contrary to our results.

In our study, the RF method was found to be highly resistant to all forms of landscapes and factors causing errors. This study justified the use of RF models based on multiple variables to extract water bodies from massive remote sensing data. This is supported by other studies on the mapping of water bodies in China (Deng et al., Citation2017) and Australia (Tulbure & Broich, Citation2019), where the RF method was more than 90% accurate, as was the case in this study. Șerban et al. (Citation2020) tested various machine learning techniques for the thermokarst lakes detection on north-eastern Qinghai-Tibet Plateau. They report that the maximum likelihood classification gave the most accurate results, nevertheless the RF classification was characterized by the accuracy higher than 90%. Our studies also have similar accuracy results to the land cover studies using RF method (Deluigi et al., Citation2017; Gislason et al., Citation2006; Shi et al., Citation2019; Wang et al., Citation2019).

Despite the high accuracy, a visual check of the classification results is necessary, particularly in areas close to rivers and multichannel rivers. Visual interpretations of the results revealed some deficiencies in the classification methods in problematic places (rivers, shadows, burn scars). According to the visual assessment, only the RF classification was not characterized by a large error in distinguishing between lakes and rivers. The other classification results were characterized by considerable errors in cloud-shaded areas and mountain slopes, areas with high moisture, and areas with burn scars.

Variable importance

The Gini index was calculated to determine the variables that influenced the classification results the most. Although some variables returned lower significance than others, we decided not to reduce the number of input variables while training the model. Each of the explanatory variables has different characteristics, therefore the removal could result in worse results, especially during the classification of the objects with similar spectral signatures. The input variables were selected based on a careful literature review (Acharya et al., Citation2018; Dirscherl et al., Citation2020; Feyisa et al., Citation2014; Kravtsova & Rodinova, Citation2016; Palmer et al., Citation2015). As the primary goal of this study was to develop an approach that is portable in space and time, a wider range of input variables provides more flexibility to spatially classify independent regions.

Among the raw Landsat 8 channels, the SWIR 2 (band 7) and SWIR 1 (band 6) bands had the highest importance values. These bands are commonly used to assess the soil and vegetation moisture (Bao et al., Citation2018; Khellouk et al., Citation2020). However, the indices built using these bands did not reach such high values, which is contrary to other studies (Acharya et al., Citation2018; Yang et al., Citation2017; Zhai et al., Citation2015). Most of the indices uses only two bands (NIR and Green) and for this reason a shallow thaw lake, with low capability of green spectral band can be underestimated (Szabó et al., Citation2020; Zhao et al., Citation2020). It may explain why the water indices did not work well for mapping thermokarst lakes, as these types of the water bodies are often shallow. However, Chen et al. (Citation2021) prepared a dataset of permafrost lakes across the Qinghai-Tibetan Plateau permafrost region using the NDWI index with the satisfactory results. Our study area, especially in the lowlands, is characterized by high soil moisture. Depending on the season, these areas are often flooded, which may lead to large errors in the classification. According to the Șerban et al. (Citation2020), the NIR band, which is used in most of the indices is prone to commission errors as clusters of wet soil, shadows and clouds can be misinterpreted as water. Muster et al. (Citation2013) and Wangchuk and Bolch (Citation2020) claim that the shadows problem can be solved with the R, G, B bands, which explains the great importance of these bands in our study. However apart from these bands we observed high spectral difference between shadows and waterbodies on the SWIR bands and spectral indices built on these bands (MNDWI). The high importance of the SWIR bands was also confirmed by the study of water bodies with Sentinel 2 images (Du et al., Citation2016). RGB bands of the Landsat images differed spectrally in the lakes and rivers class. This could be caused by different water clarity and turbidity in lakes and rivers (Brezonik et al., Citation2007; Zhao et al., Citation2011). This is also confirmed by higher NDVI values, which may indicate greater eutrophication of lakes than rivers, more developed aquatic vegetation, and thus different spectral values. The potential of using NDVI for lake detection was previously confirmed by multiple studies in various environments (Fan et al., Citation2020; Han & Niu, Citation2020; Kiage & Walker, Citation2009; Propastin, Citation2008).

Mapping lakes remains uniquely challenging in mountainous regions, especially because of the usually small size of lakes (Wangchuk & Bolch, Citation2020). Compared to other variables, altitude was far more important. This is most likely due to the organization of the runoff of water from the thawing of permafrost on the mountain slopes, while, on the flat areas, the water after permafrost thawing remains in situ. Polishchuk et al. (Citation2018) highlighted that the number and proportion of lakes are strongly controlled by variations of topography, which confirms the importance of altitude in the classification. Most of the lakes in the region have a thermokarst origin and are placed in shallow depressions called alas, formed by permafrost subsidence in the Yedoma regions. Yedoma is an ice-rich permafrost deposit containing large syngenetic ice wedges, which accumulated in regions of Eurasia, Alaska, and Northwest Canada during the Pleistocene (Strauss et al., Citation2013). These areas of Yedoma are strongly affected by thermokarst processes, which results in the formation of lakes.

The difficult aspect in this study was to find differences between river and lakes that could help in the classification. We assumed that rivers, despite similar spectral characteristics as lakes, will be distinguishable by taking into account the topographic factor. The results indicated that the rivers are situated at lower elevations than the lakes, which was significant in the problem of distinction between these two classes.

Lakes inventoryIn our study, we assumed that the minimum lake area that can be mapped using Landsat imagery is the pixel size of this sensor (30 × 30 m). The spatial resolution of the Landsat data (30 m) may have led to the exclusion of many small lakes and, therefore, underestimated the water extent (Șerban et al., Citation2020). Many studies used different area thresholds. Polishchuk et al., (2008), specified that the smallest identifiable lakes were those of 5.55 Landsat pixels in size (5,000 m²). Other studies assumptions strongly varied from 1,000 m² (Muster et al., Citation2013) to 200,000 m² (Morgenstern et al., Citation2008). According to Smith et al. (Citation2005), the smallest reliable lake area to be mapped is 400,000 m² to avoid seasonal variability. Our preliminary results show that lakes below 5,000 m² are often characterized by high classification error and, at the same time prone to seasonal variability, but the topic needs to be explored.

Our inventory showed that the spatial distribution of lakes in the study area is not uniform. Grosse et al. (Citation2008) showed that the thermokarst lakes distribution is strongly connected to hydrological and geomorphological parameters. We found high density of lakes at the lowlands, in the valleys and bogs of the central part of the region. Most lakes are located along rivers or its interfluves. These areas, with high density of thermokarst lakes, are probably highly vulnerable to the thermokarst processes and are a part of the thermokarst landscapes in the region (Olefeldt et al., Citation2016). Mountain areas are characterized by a much lower lake density, which is in line with the results in Qinghai – Tibet Plateau, where the number of lakes is considerably smaller in mountainous areas (Niu et al., Citation2011).

The field mapping and monitoring of lakes in these areas are complicated owing to their large numbers, remoteness, and inaccessibility. Consequently, remote sensing-based approaches are highly preferred over field-based approaches. In recent years, remote sensing methods have rapidly developed owing to the free availability of data and extensive spatial coverage. However, remote sensing approaches have certain drawbacks. Limitations, in particular, include their inability to reliably map lakes over large areas owing to many disruptive factors, such as shadows from mountains and clouds, lake turbidity, cloud cover, areas with high moisture, burn scars, and similar spectral characteristics of lakes and rivers. This study demonstrated that the errors caused by these factors can be overcome.

In the future, using Sentinel 2 data or a data fusion between Sentinel 2 and Landsat 8 can help identify smaller lakes. On the other hand, Sentinel 2 does not have a time series as long as Landsat, so analysing lake dynamics over a long period would not be possible. More training data can be used during model training in areas with high errors to improve classification accuracy. According to the Millard and Richardson (Citation2015) who investigated the importance of training data in RF image classification, as many training and validation sample points as possible should be collected. More sample points should be collected from mountainous areas than from lowland areas. Classification methods can be adapted for time-series analyses when annual DEM models are available. The investigated area is strongly exposed to thermokarst processes, therefore the changes in the relief are dynamically shaped, even at annual intervals. The relief is exposed to strong land subsidence or landslides, which may shape the distribution of lakes as well as changes in their surface. High resolution data is required to accurately identify these changes. To analyse the temporal dynamics of lakes, data sources that consider annual changes in hypsometry are required.

Conclusion

This study provides an adaptation of known classification methods for lake detection in a very specific territory with cold climate, such as a permafrost area, using Landsat and MERIT DEM data. Based on the results of this case study, the following conclusions can be drawn:

•Altitude and bands 7 (SWIR2), 6 (SWIR1), and 3 (Green)of Landsat 8 OLI had the greatest impact on the lakes classification results.

•Most classification errors were false positive (classification of a lake as non-lake) rather than false negative (classification of a non-lake as lake).

•The CART, NB, and SVM classifications were characterized by high error in shaded and moist areas, near steep slopes, burn scars, and rivers. Only the RF classification coped with these limitations and provided the best results.

•The greatest error characterized the CART classification owing to the large number of false-positive errors.

•Future developments may involve improving the RF model with more training data and using different variables (spectral indices based on the SWIR bands), with a greater spatial resolution.

The RF technique has exploratory potential and can be used to map and observe changes in permafrost lakes on a definite time-space scale in arctic and subarctic permafrost areas.

Description of author’s responsibilities

JP: conception of the paper; JP, JN, and ZZ: designed the analysis; JP: collected the data and performed the analysis; JP, JN, and ZZ: data interpretation; JP: wrote the paper; JN and ZZ: critical revision of the paper; ZZ: final approval.

Acknowledgments

The authors would like to thank the reviewers very much for their comprehensive and detailed comments, which helped to improve the paper.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The data and code that support the findings of this study are available from the corresponding author, upon reasonable request. Data used in this study were derived from the following resources available in the public domain, https://developers.google.com/earth-engine/datasets.

Additional information

Funding

The work was supported by the Initiative of Excellence - Research University .

References

Acharya, T. D., Subedi, A., & Lee, D. H. (2018). Evaluation of water indices for surface water extraction in a Landsat 8 scene of Nepal. Sensors, 18(8), 2580. https://doi.org/10.3390/s18082580
PubMed Web of Science ®Google Scholar
Acharya, T. D., Subedi, A., & Lee, D. H. (2019). Evaluation of machine learning algorithms for surface water extraction in a Landsat 8 scene of Nepal. Sensors, 19(12), 2769. https://doi.org/10.3390/s19122769
PubMed Web of Science ®Google Scholar
Anisimov, O. A., Belolutskaya, M. A., Grigoriev, M. N., Instanes, A., Kokorev, V. A., Oberman, N. G., Reneva, S. A., Strelchenko, Y. G., Streletskiy, D., & Shiklomanov, N. I. (2010). Major natural and social-economic consequences of climate change in the permafrost region: predictions based on observations and modeling. Greenpeace, Moscow, Russia.
Google Scholar
Bangira, T., Alfieri, S. M., Menenti, M., & van Niekerk, A. (2019). Comparing thresholding with machine learning classifiers for mapping complex water. Remote Sensing, 11(11), 1351. https://doi.org/10.3390/rs11111351
Web of Science ®Google Scholar
Bao, Y., Lin, L., Wu, S., Kwal Deng, K. A., & Petropoulos, G. P. (2018). Surface soil moisture retrievals over partially vegetated areas from the synergy of Sentinel-1 and Landsat 8 data using a modified water-cloud model. International Journal of Applied Earth Observation and Geoinformation, 72, 76–18. https://doi.org/10.1016/j.jag.2018.05.026
Web of Science ®Google Scholar
Belgiu, M., & Drăguţ, L. (2016). Random forest in remote sensing: A review of applications and future directions. Isprs Journal of Photogrammetry and Remote Sensing, 114, 24–31. https://doi.org/10.1016/j.isprsjprs.2016.01.011
Web of Science ®Google Scholar
Bogoyavlensky, V. I. (2019). Innovative technologies and results of studying processes of natural and man-made degassing of the earth in the lithosphere-cryosphere-hydrosphere-atmosphere system. Third International Conference on Geology of the Caspian Sea and Adjacent Areas, 2019, 1–5. Baku, Azerbaijan. https://doi.org/10.3997/2214-4609.201952015
Google Scholar
Bouamar, M., & Ladjal, M. (2007). Evaluation of the performances of ANN and SVM techniques used in water quality classification. 2007 14th IEEE International Conference on Electronics, Circuits and Systems , 1047–1050. Marrakech, Morocco.
Google Scholar
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
Web of Science ®Google Scholar
Bressan, G. M., Oliveira, V. A., Hruschka, E. R., Jr., & Nicoletti, M. C. (2009). Using Bayesian networks with rule extraction to infer the risk of weed infestation in a corn-crop. Engineering Applications of Artificial Intelligence, 22(4–5), 579–592. https://doi.org/10.1016/j.engappai.2009.03.006
Web of Science ®Google Scholar
Brezonik, P. L., Olmanson, L. G., Bauer, M. E., & Kloiber, S. M. (2007). Measuring water clarity and quality in Minnesota lakes and rivers: A census-based approach using remote-sensing techniques. Cura Reporter, 37(3), 3–313.
Google Scholar
Brown, J., Ferrians, O. J., Jr., Heginbottom, J. A., & Melnikov, E. S. (1997). Circum-arctic map of permafrost and ground ice conditions.
Google Scholar
Brown, J., & Grave, N. A. (1979). Physical and thermal disturbance and protection of permafrost. Cold Regions Research and Engineering Lab. https://apps.dtic.mil/sti/citations/ADA069405
Google Scholar
Chen, X., Mu, C., Jia, L., Li, Z., Fan, C., Mu, M., Peng, X., & Wu, X. (2021). High-resolution dataset of thermokarst lakes on the Qinghai-Tibetan Plateau. Earth System Science Data Discussions, 1–23.
Google Scholar
Clemente, J. P., Fontanelli, G., Ovando, G. G., Roa, Y. L. B., Lapini, A., & Santi, E. (2020). Google Earth Engine: Application of algorithms for remote sensing of crops in Tuscany (Italy). 2020 IEEE Latin American GRSS ISPRS Remote Sensing Conference (LAGIRS), 195–200. https://doi.org/10.1109/LAGIRS48042.2020.9165561
Google Scholar
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104
Web of Science ®Google Scholar
Danades, A., Pratama, D., Anggraini, D., & Anggriani, D. (2016). Comparison of accuracy level K-nearest neighbor algorithm and support vector machine algorithm in classification water quality status. 6th International Conference on System Engineering and Technology (ICSET) , 137–141. Bandung, Indonesia.
Google Scholar
Deluigi, N., Lambiel, C., & Kanevski, M. (2017). Data-driven mapping of the potential mountain permafrost distribution. The Science of the Total Environment, 590–591, 370–380. https://doi.org/10.1016/j.scitotenv.2017.02.041
PubMed Web of Science ®Google Scholar
Deng, Y., Jiang, W., Tang, Z., Li, J., Lv, J., Chen, Z., & Jia, K. (2017). Spatio-temporal change of lake water extent in Wuhan urban agglomeration based on Landsat images from 1987 to 2015. Remote Sensing, 9(3), 270. https://doi.org/10.3390/rs9030270
Web of Science ®Google Scholar
Dirscherl, M., Dietz, A. J., Kneisel, C., & Kuenzer, C. (2020). Automated mapping of Antarctic supraglacial lakes using a machine learning approach. Remote Sensing, 12(7), 1203. https://doi.org/10.3390/rs12071203
Web of Science ®Google Scholar
Donchyts, G., Baart, F., Winsemius, H., Gorelick, N., Kwadijk, J., & van de Giesen, N. (2016). Earth’s surface water change over the past 30 years. Nature climate change, 6(9), 810–813. https://doi.org/10.1038/nclimate3111
Web of Science ®Google Scholar
Du, Y., Zhang, Y., Ling, F., Wang, Q., Li, W., & Li, X. (2016). Water bodies’ mapping from Sentinel-2 Imagery with modified normalized difference water index at 10-m spatial resolution produced by sharpening the SWIR band. Remote Sensing, 8(4), 354. https://doi.org/10.3390/rs8040354
Web of Science ®Google Scholar
Fan, X., Liu, Y., Wu, G., & Zhao, X. (2020). Compositing the minimum NDVI for daily water surface mapping. Remote Sensing, 12(4), 700. https://doi.org/10.3390/rs12040700
Web of Science ®Google Scholar
Feyisa, G. L., Meilby, H., Fensholt, R., & Proud, S. R. (2014). Automated water extraction index: A new technique for surface water mapping using Landsat imagery. Remote Sensing of Environment, 140, 23–35. https://doi.org/10.1016/j.rse.2013.08.029
Web of Science ®Google Scholar
Foga, S., Scaramuzza, P. L., Guo, S., Zhu, Z., Dilley, R. D., Beckmann, T., Schmidt, G. L., Dwyer, J. L., Joseph Hughes, M., & Laue, B. (2017). Cloud detection algorithm comparison and validation for operational Landsat data products. Remote Sensing of Environment, 194, 379–390. https://doi.org/10.1016/j.rse.2017.03.026
Web of Science ®Google Scholar
Gislason, P. O., Benediktsson, J. A., & Sveinsson, J. R. (2006). Random Forests for land cover classification. Pattern recognition letters, 27(4), 294–300. https://doi.org/10.1016/j.patrec.2005.08.011
Web of Science ®Google Scholar
Global Runoff Data Centre, G. (2020). GRDC (2020): Major river basins of the world. Federal Institute of Hydrology (BfG).
Google Scholar
Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., & Moore, R. (2017). Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment, 202, 18–27. https://doi.org/10.1016/j.rse.2017.06.031
Web of Science ®Google Scholar
Grosse, G., Jones, B., & Arp, C. (2013). 8.21 Thermokarst lakes, drainage, and drained basins. In J.F. Shroder (Ed.), Treatise on Geomorphology (pp. 325–353). Academic Press. https://doi.org/10.1016/B978-0-12-374739-6.00216-5
Google Scholar
Grosse, G., Romanovsky, V., Walter, K., Morgenstern, A., Lantuit, H., Zimov, S., Dreger, S. C., Grosse, G., Henneberger, C., Grantyn, R., Just, I., & Ahnert-Hilger, G. 2008. Distribution of thermokarst lakes and ponds at three yedoma sites in Siberia Ninth International Conference on Permafrost, 23, 1115–1126. Fairbanks, USA. https://doi.org/10.1096/fj.08-116855
Google Scholar
Gutman, G., Huang, C., Chander, G., Noojipady, P., & Masek, J. G. (2013). Assessment of the NASA–USGS global land survey (GLS) datasets. Remote Sensing of Environment, 134, 249–265. https://doi.org/10.1016/j.rse.2013.02.026
Web of Science ®Google Scholar
Han, Q., & Niu, Z. (2020). Construction of the long-term global surface water extent dataset based on water-NDVI spatio-temporal parameter set. Remote Sensing, 12(17), 2675. https://doi.org/10.3390/rs12172675
Web of Science ®Google Scholar
Huang, X., Xie, C., Fang, X., & Zhang, L. (2015). Combining pixel- and object-based machine learning for identification of water-body types from urban high-resolution remote-sensing imagery. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 8(5), 2097–2110. https://doi.org/10.1109/JSTARS.2015.2420713
Web of Science ®Google Scholar
Hughes-Allen, L., Bouchard, F., Laurion, I., Séjourné, A., Marlin, C., Hatté, C., Costard, F., Fedorov, A., & Desyatkin, A. (2021). Seasonal patterns in greenhouse gas emissions from thermokarst lakes in Central Yakutia (Eastern Siberia). Limnology and Oceanography, 66(S1), S98–116. https://doi.org/10.1002/lno.11665
Google Scholar
Jolly, K. (2018). Machine Learning with scikit-learn Quick Start Guide: Classification, regression, and clustering techniques in Python. Packt Publishing Ltd.
Google Scholar
Jorgenson, M. T., & Grosse, G. (2016). Remote sensing of landscape change in permafrost regions. Permafrost and Periglacial Processes, 27(4), 324–338. https://doi.org/10.1002/ppp.1914
Web of Science ®Google Scholar
Kallistova, A. Y., Savvichev, A. S., Rusanov, I. I., & Pimenov, N. V. (2019). Thermokarst lakes, ecosystems with intense microbial processes of the methane cycle. Microbiology (Reading, England), 88(6), 649–661. https://doi.org/10.1134/S0026261719060043
Web of Science ®Google Scholar
Khellouk, R., Barakat, A., Boudhar, A., Hadria, R., Lionboui, H., El Jazouli, A., Rais, J., El Baghdadi, M., & Benabdelouahab, T. (2020). Spatiotemporal monitoring of surface soil moisture using optical remote sensing data: A case study in a semi-arid area. Journal of Spatial Science, 65(3), 481–499. https://doi.org/10.1080/14498596.2018.1499559
Web of Science ®Google Scholar
Kiage, L. M., & Walker, N. D. (2009). Using NDVI from MODIS to monitor duckweed bloom in Lake Maracaibo, Venezuela. Water Resources Management, 23(6), 1125–1135. https://doi.org/10.1007/s11269-008-9318-9
Web of Science ®Google Scholar
Kloiber, S. M., Brezonik, P. L., Olmanson, L. G., & Bauer, M. E. (2002). A procedure for regional lake water clarity assessment using Landsat multispectral data. Remote Sensing of Environment, 82(1), 38–47. https://doi.org/10.1016/S0034-4257(02)00022-6
Web of Science ®Google Scholar
Kravtsova, V. I., & Rodinova, T. V. (2016). Issledovaniye dinamiki ploshchadi i kolichestva termokarstovykh ozer v razlichnykh rayonakh kriolitozony Rossii po kosmicheskim snimkam [Study of the dynamics of the area and the number of thermokarst lakes in various areas of the permafrost zone of Russia based on space images]. Earth Cryosphere, 20(1). https://elibrary.ru/item.asp?id=25664814
Google Scholar
Kripotkin, S. N., Polnshchuk, Y. M., & Bryksina, N. A. (2008). Dinamika Ploshchadey Termokarstovykh Ozer v Sploshnoy i Preryvistoy Kriolitozonakh Zapadnoy Sibiri v Usloviyakh Global’nogo Potepleniya [Dynamics of Areas of Thermokarst Lakes in Continuous and Discontinuous Permafrost Zones of Western Siberia under the Conditions of Global Warming]. Vestnik Tomskogo Gosudarstvennogo Universiteta, 311.
Google Scholar
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174. https://doi.org/10.2307/2529310
PubMed Web of Science ®Google Scholar
Lawrence, D. M., & Slater, A. G. (2005). A projection of severe near-surface permafrost degradation during the 21st century. Geophysical Research Letters, 32(24). https://doi.org/10.1029/2005GL025080
Web of Science ®Google Scholar
Lawrence, R. L., & Wright, A. (2001). Rule-based classification systems using classification and regression tree (CART) analysis. Photogrammetric Engineering and Remote Sensing, 67(10), 1137–1142.
Web of Science ®Google Scholar
Lehner, B., & Döll, P. (2004). Development and validation of a global database of lakes, reservoirs and wetlands. Journal of Hydrology, 296(1–4), 1–22. https://doi.org/10.1016/j.jhydrol.2004.03.028
Web of Science ®Google Scholar
Lewis, R. J. (2000). An introduction to classification and regression tree (CART) analysis. Annual Meeting of the Society for Academic Emergency Medicine in San Francisco, 14, California, San Francisco, USA.
Google Scholar
Lindgren, P. R., Farquharson, L. M., Romanovsky, V. E., & Grosse, G. (2021). Landsat-based lake distribution and changes in western Alaska permafrost regions between the 1970s and 2010s. Environmental Research Letters, 16(2), 025006. https://doi.org/10.1088/1748-9326/abd270
Web of Science ®Google Scholar
Lu, S., Wu, B., Yan, N., & Wang, H. (2011). Water body mapping method with HJ-1A/B satellite imagery. International Journal of Applied Earth Observation and Geoinformation, 13(3), 428–434. https://doi.org/10.1016/j.jag.2010.09.006
Web of Science ®Google Scholar
Mahdianpari, M., Jafarzadeh, H., Granger, J. E., Mohammadimanesh, F., Brisco, B., Salehi, B., Homayouni, S., & Weng, Q. (2020). A large-scale change monitoring of wetlands using time series Landsat imagery on Google Earth Engine: A case study in Newfoundland. GIScience & Remote Sensing, 57(8), 1102–1124. https://doi.org/10.1080/15481603.2020.1846948
Web of Science ®Google Scholar
Matveev, I. A. (1989). Agricultural atlas of the Republic Sakha (Yakutia) (Moscow: GUGK).
Google Scholar
McFeeters, S. K. (1996). The use of the Normalized Difference Water Index (NDWI) in the delineation of open water features. International Journal of Remote Sensing, 17(7), 1425–1432. https://doi.org/10.1080/01431169608948714
Web of Science ®Google Scholar
Millard, K., & Richardson, M. (2015). On the importance of training data sample selection in random forest image classification: A case study in peatland ecosystem mapping. Remote Sensing, 7(7), 8489–8515. https://doi.org/10.3390/rs70708489
Web of Science ®Google Scholar
Misra, A., Vojinovic, Z., Ramakrishnan, B., Luijendijk, A., & Ranasinghe, R. (2018). Shallow water bathymetry mapping using Support Vector Machine (SVM) technique and multispectral imagery. International Journal of Remote Sensing, 39(13), 4431–4450. https://doi.org/10.1080/01431161.2017.1421796
Web of Science ®Google Scholar
Morgenstern, A., Grosse, G., Günther, F., Fedorova, I., & Schirrmeister, L. (2011). Spatial analyses of thermokarst lakes and basins in Yedoma landscapes of the Lena Delta. The Cryosphere, 5(4), 849–867. https://doi.org/10.5194/tc-5-849-2011
Web of Science ®Google Scholar
Morgenstern, A., Grosse, G., & Schirrmeister, L. (2008). Genetic, morphological, and statistical characterization of lakes in the permafrost-dominated Lena Delta 9th International Conference on Permafrost Fairbanks, Alaska.
Google Scholar
Muster, S., Heim, B., Abnizova, A., & Boike, J. (2013). Water body distributions across scales: A remote sensing based comparison of three Arctic tundra wetlands. Remote Sensing, 5(4), 1498–1523. https://doi.org/10.3390/rs5041498
Web of Science ®Google Scholar
Nath, R. K., & Deb, S. K. (2010). Water-body area extraction from high resolution satellite images-an introduction, review, and comparison. International Journal of Image Processing (IJIP), 3(6), 265–384.
Google Scholar
Nguyen, U. N. T., Pham, L. T. H., & Dang, T. D. (2019). An automatic water detection approach using Landsat 8 OLI and Google Earth Engine cloud computing to map lakes and reservoirs in New Zealand. Environmental Monitoring and Assessment, 191(4), 235. https://doi.org/10.1007/s10661-019-7355-x
PubMed Web of Science ®Google Scholar
Nie, Y., Sheng, Y., Liu, Q., Liu, L., Liu, S., Zhang, Y., & Song, C. (2017). A regional-scale assessment of Himalayan glacial lake changes using satellite observations from 1990 to 2015. Remote Sensing of Environment, 189, 1–13. https://doi.org/10.1016/j.rse.2016.11.008
Web of Science ®Google Scholar
Nitze, I., Grosse, G., Jones, B. M., Arp, C. D., Ulrich, M., Fedorov, A., & Veremeeva, A. (2017). Landsat-based trend analysis of lake dynamics across northern permafrost regions. Remote Sensing, 9(7), 640. https://doi.org/10.3390/rs9070640
Web of Science ®Google Scholar
Niu, F., Lin, Z., Liu, H., & Lu, J. (2011). Characteristics of thermokarst lakes and their influence on permafrost in Qinghai–Tibet Plateau. Geomorphology, 132(3), 222–233. https://doi.org/10.1016/j.geomorph.2011.05.011
Web of Science ®Google Scholar
Obu, J., Westermann, S., Barboux, C., Bartsch, A., Delaloye, R., Grosse, G., & Wiesmann, A. (2020). ESA permafrost climate change initiative (Permafrost_cci): Permafrost active layer thickness for the northern hemisphere, v2. 0. Centre for Environmental Data Analysis. https://catalogue.ceda.ac.uk/uuid/67a3f8c8dc914ef99f7f08eb0d997e23
Google Scholar
Obu, J., Westermann, S., Bartsch, A., Berdnikov, N., Christiansen, H. H., Dashtseren, A., Delaloye, R., Elberling, B., Etzelmüller, B., Kholodov, A., Khomutov, A., Kääb, A., Leibman, M. O., Lewkowicz, A. G., Panda, S. K., Romanovsky, V., Way, R. G., Westergaard-Nielsen, A., Wu, T., etal Zou, D (2019) Northern Hemisphere permafrost map based on TTOP modelling for 2000–2016 at 1 km2 scale Earth-Science Reviews 193 299–316 https://doi.org/10.1016/j.earscirev.2019.04.023
Web of Science ®Google Scholar
Olefeldt, D., Goswami, S., Grosse, G., Hayes, D., Hugelius, G., Kuhry, P., McGuire, A. D., Romanovsky, V. E., Sannel, A. B. K., Schuur, E. A. G., & Turetsky, M. R. (2016). Circumpolar distribution and carbon storage of thermokarst landscapes. Nature Communications, 7(1), 13043. https://doi.org/10.1038/ncomms13043
PubMedGoogle Scholar
Oliveira, S., Oehler, F., San-Miguel-Ayanz, J., Camia, A., & Pereira, J. M. (2012). Modeling spatial patterns of fire occurrence in Mediterranean Europe using Multiple Regression and Random Forest. Forest Ecology and Management, 275, 117–129. https://doi.org/10.1016/j.foreco.2012.03.003
Web of Science ®Google Scholar
Olmanson, L. G., Bauer, M. E., & Brezonik, P. L. (2008). A 20-year Landsat water clarity census of Minnesota’s 10,000 lakes. Remote Sensing of Environment, 112(11), 4086–4097. https://doi.org/10.1016/j.rse.2007.12.013
Web of Science ®Google Scholar
Palmer, S. C. J., Kutser, T., & Hunter, P. D. (2015). Remote sensing of inland waters: Challenges, progress and future directions. Remote Sensing of Environment, 157, 1–8. https://doi.org/10.1016/j.rse.2014.09.021
Web of Science ®Google Scholar
Pekel, J. -F., Cottam, A., Gorelick, N., & Belward, A. S. (2016). High-resolution mapping of global surface water and its long-term changes. Nature, 540(7633), 418–422. https://doi.org/10.1038/nature20584
PubMed Web of Science ®Google Scholar
Peters, J., De Baets, B., Verhoest, N. E., Samson, R., Degroeve, S., De Becker, P., & Huybrechts, W. (2007). Random forests as a tool for ecohydrological distribution modelling. Ecological modelling, 207(2–4), 304–318. https://doi.org/10.1016/j.ecolmodel.2007.05.011
Web of Science ®Google Scholar
Polishchuk, Y., Bogdanov, A. N., Muratov, I. N., Polishchuk, Vy., Lim, A., Manasypov, R. M., Shirokova, L. S., & Pokrovsky, O. S. (2018). Minor contribution of small thaw ponds to the pools of carbon and methane in the inland waters of the permafrost-affected part of the Western Siberian Lowland. Environmental Research Letters, 13(4), 045002.
Web of Science ®Google Scholar
Propastin, P. A. (2008). Simple model for monitoring Balkhash Lake water levels and Ili River discharges: Application of remote sensing. Lakes & Reservoirs: Research & Management, 13(1), 77–81. https://doi.org/10.1111/j.1440-1770.2007.00354.x
Google Scholar
Rithin Paul Reddy, K., Srija, S. S., Karthi, R., & Geetha, P. (2020). Evaluation of water body extraction from satellite images using open-source tools. In S.M. Thampi, L. Trajkovic, S. Mitra, P. Nagabhushan, J. Mukhopadhyay, J.M. Corchado, S. Berretti, & D. Mishra (Eds.), Intelligent Systems, Technologies and Applications (pp. 129–140). Springer. https://doi.org/10.1007/978-981-13-6095-4_10
Google Scholar
Rokni, K., Ahmad, A., Solaimani, K., & Hazini, S. (2015). A new approach for surface water change detection: Integration of pixel level image fusion and image classification techniques. International Journal of Applied Earth Observation and Geoinformation, 34, 226–234. https://doi.org/10.1016/j.jag.2014.08.014
Web of Science ®Google Scholar
Romanovsky, V., Isaksen, K., Drozdov, D., Anisimov, O., Instanes, A., Leibman, M., McGuire, A. D., Shiklomanov, N., Smith, S., & Walker, D. (2017). . Snow, Water, Ice and Permafrost in the Arctic (SWIPA), 2017 (Oslo: AMAP), 65–102.
Google Scholar
Schneider von Deimling, T., Grosse, G., Strauss, J., Schirrmeister, L., Morgenstern, A., Schaphoff, S., Meinshausen, M., & Boike, J. (2015). Observation-based modelling of permafrost carbon fluxes with accounting for deep carbon deposits and thermokarst activity. Biogeosciences, 12(11), 3469–3488. https://doi.org/10.5194/bg-12-3469-2015
Web of Science ®Google Scholar
Șerban, R. -D., Jin, H., Șerban, M., Luo, D., Wang, Q., Jin, X., & Ma, Q. (2020). Mapping thermokarst lakes and ponds across permafrost landscapes in the Headwater Area of Yellow River on northeastern Qinghai-Tibet Plateau. International Journal of Remote Sensing, 41(18), 7042–7067. https://doi.org/10.1080/01431161.2020.1752954
Web of Science ®Google Scholar
Serikova, S., Pokrovsky, O. S., Laudon, H., Krickov, I. V., Lim, A. G., Manasypov, R. M., & Karlsson, J. (2019). High carbon emissions from thermokarst lakes of Western Siberia. Nature Communications, 10(1), 1–7. https://doi.org/10.1038/s41467-019-09592-1
PubMedGoogle Scholar
Serreze, M. C., Walsh, J. E., Chapin, F. S., Osterkamp, T., Dyurgerov, M., Romanovsky, V., Oechel, W. C., Morison, J., Zhang, T., & Barry, R. G. (2000). Observational evidence of recent change in the northern high-latitude environment. Climatic Change, 46(1), 159–207. https://doi.org/10.1023/A:1005504031923
Web of Science ®Google Scholar
Shelestov, A., Lavreniuk, M., Kussul, N., Novikov, A., & Skakun, S. (2017). Exploring Google Earth Engine platform for big data processing: Classification of multi-temporal satellite imagery for crop mapping. Frontiers in Earth Science, 5, 17. https://doi.org/10.3389/feart.2017.00017
Web of Science ®Google Scholar
Shetty, S., Gupta, P. K., Belgiu, M., & Srivastav, S. K. (2021). Assessing the effect of training sampling design on the performance of machine learning classifiers for land cover mapping using multi-temporal remote sensing data and Google Earth Engine. Remote Sensing, 13(8), 1433. https://doi.org/10.3390/rs13081433
Web of Science ®Google Scholar
Shi, L., Ling, F., Foody, G. M., Chen, C., Fang, S., Li, X., Zhang, Y., & Du, Y. (2019). Permanent disappearance and seasonal fluctuation of urban lake area in Wuhan, China monitored with long time series remotely sensed images from 1987 to 2016. International Journal of Remote Sensing, 40(22), 8484–8505. https://doi.org/10.1080/01431161.2019.1612119
Web of Science ®Google Scholar
Smith, L. C., Sheng, Y., MacDonald, G. M., & Hinzman, L. D. (2005). Disappearing Arctic Lakes. Science, 308(5727), 1429–1429. https://doi.org/10.1126/science.1108142
PubMed Web of Science ®Google Scholar
Song, C., Huang, B., Ke, L., & Richards, K. S. (2014). Remote sensing of alpine lake water environment changes on the Tibetan Plateau and surroundings: A review. Isprs Journal of Photogrammetry and Remote Sensing, 92, 26–37. https://doi.org/10.1016/j.isprsjprs.2014.03.001
Web of Science ®Google Scholar
Strauss, J., Schirrmeister, L., Grosse, G., Wetterich, S., Ulrich, M., Herzschuh, U., & Hubberten, H. -W. (2013). The deep permafrost carbon pool of the Yedoma region in Siberia and Alaska. Geophysical Research Letters, 40(23), 6165–6170. https://doi.org/10.1002/2013GL058088
PubMed Web of Science ®Google Scholar
Suthaharan, S. (2016). Support Vector Machine. In S. Suthaharan (Ed.), Machine learning models and algorithms for big data classification: Thinking with examples for effective learning (pp. 207–235). Springer US. https://doi.org/10.1007/978-1-4899-7641-3_9
Google Scholar
Szabó, L., Deák, B., Bíró, T., Dyke, G. J., & Szabó, S. (2020). NDVI as a proxy for estimating sedimentation and vegetation spread in artificial lakes—monitoring of spatial and temporal changes by using satellite images overarching three decades. Remote Sensing, 12(9), 1468. https://doi.org/10.3390/rs12091468
Web of Science ®Google Scholar
Tulbure, M. G., & Broich, M. (2019). Spatiotemporal patterns and effects of climate and land use on surface water extent dynamics in a dryland region with three decades of Landsat satellite data. The Science of the Total Environment, 658, 1574–1585. https://doi.org/10.1016/j.scitotenv.2018.11.390
PubMed Web of Science ®Google Scholar
Verpoorter, C., Kutser, T., Seekell, D. A., & Tranvik, L. J. (2014). A global inventory of lakes based on high-resolution satellite imagery. Geophysical Research Letters, 41(18), 6396–6402. https://doi.org/10.1002/2014GL060641
Web of Science ®Google Scholar
Vorpahl, P., Elsenbeer, H., Märker, M., & Schröder, B. (2012). How can statistical models help to determine driving factors of landslides? Ecological modelling, 239, 27–39. https://doi.org/10.1016/j.ecolmodel.2011.12.007
Web of Science ®Google Scholar
Wangchuk, S., & Bolch, T. (2020). Mapping of glacial lakes using Sentinel-1 and Sentinel-2 data and a random forest classifier: Strengths and challenges. Science of Remote Sensing, 2, 100008. https://doi.org/10.1016/j.srs.2020.100008
Google Scholar
Wang, Y., Li, Z., Zeng, C., Xia, G. -S., & Shen, H. (2020). An urban water extraction method combining deep learning and Google Earth Engine. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 13, 769–782. https://doi.org/10.1109/JSTARS.2020.2971783
Web of Science ®Google Scholar
Wang, F., Wang, X., Zhao, Y., & Yang, Z. (2014). Temporal variations of NDVI and correlations between NDVI and hydro-climatological variables at Lake Baiyangdian, China. International Journal of Biometeorology, 58(7), 1531–1543. https://doi.org/10.1007/s00484-013-0758-4
PubMed Web of Science ®Google Scholar
Wang, T., Yang, D., Fang, B., Yang, W., Qin, Y., & Wang, Y. (2019). Data-driven mapping of the spatial distribution and potential changes of frozen ground over the Tibetan Plateau. The Science of the Total Environment, 649, 515–525. https://doi.org/10.1016/j.scitotenv.2018.08.369
PubMed Web of Science ®Google Scholar
Wu, Y., Duguay, C. R., & Xu, L. (2021). Assessment of machine learning classifiers for global lake ice cover mapping from MODIS TOA reflectance data. Remote Sensing of Environment, 253, 112206. https://doi.org/10.1016/j.rse.2020.112206
Web of Science ®Google Scholar
Xie, H., Luo, X., Xu, X., Pan, H., & Tong, X. (2016). Evaluation of Landsat 8 OLI imagery for unsupervised inland water extraction. International Journal of Remote Sensing, 37(8), 1826–1844. https://doi.org/10.1080/01431161.2016.1168948
Web of Science ®Google Scholar
Xu, H. (2006). Modification of normalised difference water index (NDWI) to enhance open water features in remotely sensed imagery. International Journal of Remote Sensing, 27(14), 3025–3033. https://doi.org/10.1080/01431160600589179
Web of Science ®Google Scholar
Yamazaki, D., Ikeshima, D., Neal, J. C., O’Loughlin, F., Sampson, C. C., Kanae, S., & Bates, P. D. (2017). MERIT DEM: A new high-accuracy global digital elevation model and its merit to global hydrodynamic modeling. H12C-04. https://ui.adsabs.harvard.edu/abs/2017AGUFM.H12C.04Y
Google Scholar
Yang, Y., Yang, D., Wang, X., Zhang, Z., & Nawaz, Z. (2021). Testing accuracy of land cover classification algorithms in the Qilian Mountains based on GEE cloud platform. Remote Sensing, 13(24), 5064. https://doi.org/10.3390/rs13245064
Web of Science ®Google Scholar
Yang, X., Zhao, S., Qin, X., Zhao, N., & Liang, L. (2017). Mapping of urban surface water bodies from Sentinel-2 MSI imagery at 10 m resolution via NDWI-based image sharpening. Remote Sensing, 9(6), 596. https://doi.org/10.3390/rs9060596
Web of Science ®Google Scholar
Yasmin, F., Sarowar Sattar, A. H. M., & Kumar Paul, M. (2019). Water bodies identification in Landsat 8 OLI image using machine learning. 2019 22nd International Conference on Computer and Information Technology (ICCIT), 1–6. https://doi.org/10.1109/ICCIT48885.2019.9038562
Google Scholar
Zandt, M. H., Liebner, S., & Welte, C. U. (2020). Roles of thermokarst lakes in a warming world. Trends in Microbiology, 28(9), 769–779. https://doi.org/10.1016/j.tim.2020.04.002
PubMed Web of Science ®Google Scholar
Zhai, K., Wu, X., Qin, Y., & Du, P. (2015). Comparison of surface water extraction performances of different classic water indices using OLI and TM imageries in different situations. Geo-Spatial Information Science, 18(1), 32–42. https://doi.org/10.1080/10095020.2015.1017911
Web of Science ®Google Scholar
Zhang, W., Tan, G., Zheng, S., Sun, C., Kong, X., & Liu, Z. (2018). Land cover change detection in urban lake areas using multi-temporary very high spatial resolution aerial images. Water, 10(2), 1. https://doi.org/10.3390/w10020001
PubMed Web of Science ®Google Scholar
Zhao, D., Cai, Y., Jiang, H., Xu, D., Zhang, W., & An, S. (2011). Estimation of water clarity in Taihu Lake and surrounding rivers using Landsat imagery. Advances in Water Resources, 34(2), 165–173. https://doi.org/10.1016/j.advwatres.2010.08.010
Web of Science ®Google Scholar
Zhao, Y., Shen, Q., Wang, Q., Yang, F., Wang, S., Li, J., Zhang, F., & Yao, Y. (2020). Recognition of water colour anomaly by using Hue Angle and Sentinel 2 image. Remote Sensing, 12(4), 716. https://doi.org/10.3390/rs12040716
Web of Science ®Google Scholar
Zwoliński, Z. (2004). Geoindicators. In <.I.I.A.I.C.U.<. Goudie (Ed.), In: Encyclopedia of Geomorphology (pp. 418–420). Routledge.
Google Scholar
Zwoliński, Z., Kostrzewski, A., & Rachlewicz, G. (2008). . Environmental Changes and Geomorphic Hazards (New Delhi: Bookwell), 23–36.
Google Scholar

A machine learning method for Arctic lakes detection in the permafrost areas of Siberia

ABSTRACT

Introduction