932
Views
0
CrossRef citations to date
0
Altmetric
Research Article

D-FusionNet: road extraction from remote sensing images using dilated convolutional block

, , , , , , , & show all
Article: 2270806 | Received 27 Jun 2023, Accepted 10 Oct 2023, Published online: 25 Oct 2023

ABSTRACT

Deep learning techniques have been applied to extract road areas from remote sensing images, leveraging their efficient and intelligent advantages. However, the contradiction between the effective receptive field and coverage range, as well as the conflict between the depth of the network and the density of geographic information, hinders further improvement in extraction accuracy. To address these challenges, we propose a novel semantic segmentation network called D-FusionNet. D-FusionNet integrates the Dilated Convolutional Block (DCB) module, which serves as a technique for expanding the receptive field and mitigating feature loss, resembling a residual mechanism during encoding. We evaluate the extraction capability of D-FusionNet using GF-2 (Gaofen-2) satellite datasets and Massachusetts aerial photography datasets. The experimental results demonstrate that D-FusionNet performs well in road extraction tasks. Compared to FCN, UNet, LinkNet, D-LinkNet, and FusionNet, D-FusionNet achieves an average improvement of 5.35% in F1-score, 7.12% in IoU (Intersection over Union), and 5.61% in MCC (Matthews Correlation Coefficient) on the GF-2 dataset. For the Massachusetts dataset, there is an average improvement of 2.48% in F1-score, 3.25% in IoU, and 2.25% in MCC. This study provides valuable support for road extraction from remote sensing images.

1. Introduction

The quick growth of road infrastructure has brought new challenges for extracting and updating the road network information (Dai et al. Citation2020). The traditional human-computer interaction vectorized road information extraction method has the limitations of long time consumption and a high cost (Jia et al. Citation2021) is increasingly failing to meet production needs. Therefore, this calls for the immediate development of intelligent and efficient road network extraction algorithms (Liu et al. Citation2020). Convolutional Neural Networks (CNNs) have become a widely used method in the field of road extraction from remote sensing images with the development of segmentation network models such as FCN (Long, Shelhamer, and Darrell Citation2015), U-Net (Ronneberger, Fischer, and Brox Citation2015), LinkNet (Chaurasia and Culurciello Citation2017), and DeepLab (Chen et al. Citation2014, Citation2017, Citation2018). These models have improved the accuracy and efficiency of road extraction, capturing wide interest.

Recently, the focus of road extraction from remote sensing images has shifted from applying various CNN models to optimizing network structures based on specific requirements of road scenes (Abdollahi et al. Citation2020): (1) The Generative Adversarial Networks (GAN) can effectively alleviate the overfitting phenomenon caused by the lack of road data (Goodfellow et al. Citation2014; Mansourifar et al. Citation2022). For example, based on the traditional GAN, Chen et al. (Citation2022) proposed an improved conditional GAN (NIGAN, neighborhood probability enhancement and improved conditional generative adversarial network) framework for extracting information from sparsely sampled mountainous roads. Alternatively, to reduce the network’s dependence on high-quality datasets during training, Chen et al. (Citation2022) put forward a semi-weak GAN (SW-GAN) approach. (2) The proper utilization of attention mechanisms in networks can provide an additional boost to the network’s focus on road areas (Vaswani et al. Citation2017). For instance, Li et al. (Citation2022) introduced a cascaded attention-enhanced architecture that utilized spatial attention residual blocks to address the problem of jagged road boundaries, which preserved and refined the boundary information while capturing road morphology more flexibly. In a similar vein, Dai et al. (Citation2023) designed a road augmentation module (RAM) based on deformable convolution and proposed the Road-Augmented Deformable Attention Network (RADANet) to accomplish road extraction tasks. (3) Graph convolution takes into account both the neighborhood and self-information of nodes, expanding the receptive field without sacrificing location information. This approach effectively addresses the challenges posed by the loss of location information and the need for global context information. For instance, Cui et al. (Citation2022) integrated the outputs of graph convolutional networks (GCNs) and deep convolutional neural networks (DCNNs) through parallel computation, effectively tackling the incomplete and discontinuous problems encountered in road extraction. This idea is also reflected in the work of Zhou et al. (Citation2021), where they cleverly designed a novel road extraction framework called a Split Depth-Wise (DW) Separable Graph Convolutional Network (SGCN). The SGCN achieved outstanding results on the Massachusetts roads dataset as well as their own dataset. (4) Additionally, optimizing road extraction results can be achieved by changing the hierarchical connectivity of the network, incorporating residual structures, or utilizing operations such as dilated convolutions. For example, Cheng et al. (Citation2017) proposed a method for automatic road detection and centerline extraction based on a cascade end-to-end convolutional neural network. Zhou et al. (Citation2018) employed pre-training and dilated convolutions to enhance the feature extraction capability of road networks. Zhang et al. (Citation2018) developed a deep residual U-Net model that reduced feature loss through the use of dense connections and residual structures. He et al. (Citation2019) improved road extraction performance by introducing the Atrous Spatial Pyramid Pooling (ASPP) module. Zhang and Wang (Citation2019) adopted the JointNet model for segmenting roads and buildings, which employed dense connections to enhance interconnectivity between different layer feature maps and utilized dilated convolutions to improve operational efficiency. Abdollahi et al. (Citation2022) addressed the challenge of road disconnection by combining boundary learning with a recurrent residual convolutional neural network, taking into account road connectivity. Yang et al. (Citation2022) designed a denser network (SDUNet) that introduced DULR modules to capture richer road features.

However, there are two challenges that require further exploration in road extraction tasks. The first challenge is a contradiction between the effective receptive field and coverage range (Jiang, Zhong, and Zhang Citation2023). CNNs are mainly designed for multi-segmentation tasks, such as U-Net for biomedical image segmentation, FCN, SegNet, and DeepLab for segmentation tasks on public datasets (Pascal VOC). When utilizing CNNs to extract roads from remote sensing images, the limited receptive fields for feature encoding may limit the learning of road shape features, consequently impacting the accuracy of extraction (Liu et al. Citation2019). Another challenge arises from the conflict between the depth of the network and the density of geographic information (Lu et al. Citation2022). Encoder may potentially blur the distinction between the foreground and background in remote sensing images. Therefore, it is necessary to employ feature-preserving methods during the encoding process and select a more compact network to minimize the loss of features.

To address these two challenges, a new semantic segmentation network (D-FusionNet) is proposed to extract the road networks. The main contributions of the study are summarized as follows:

  1. The Dilated Convolutional Block (DCB) module was utilized as a feature buffering mechanism. Five DCBs were constructed with dilation rates of {1, 2, 4, 8} to fuse features from different layers and achieve a dense feature representation. By enhancing the network’s ability to detect road regions and reducing feature loss during forward propagation, the DCB module improves the performance of segmentation networks.

  2. A novel semantic segmentation network called D-FusionNet was proposed for road extraction. D-FusionNet combines the DCB and FusionNet to expand the receptive field and minimize feature loss, showing the better performance than the other CNN models for road extraction from remote sensing images.

  3. To evaluate the extraction performance of our network, we utilized two datasets: GF-2 dataset and Massachusetts dataset. Our experiments encompassed ablation and comparative studies. In the discussion, we examined four potential factors, both qualitatively and quantitatively, that could impact road extraction. Additionally, we conducted an evaluation of D-FusionNet’s spatial transferability.

2. Methods

2.1 Dataset

To evaluate the performance of the proposed D-FusionNet model, both in-house GF-2 datasets and publicly available Massachusetts datasets were collected in this study. The GF-2 satellite, which is part of the China High-resolution Earth Observation System, carries various of remote sensing payloads such as panchromatic and multispectral. We selected three GF-2 images from Luxian County in the south of the Sichuan Basin, China, to extract the road information. The dataset was manually annotated and consisted of 325 train samples and 9 test samples (). Detailed parameters of these images are provided in . The area includes low mountains, large hills, medium hills with narrow valleys, low hills with wide valleys, and valley terraces. The terrain is higher in the northeast and lower in the southwest, with a height difference of 539.7 meters. Luxian County has well-developed transportation facilities. The Massachusetts dataset is an aerial drone dataset widely used in road extraction research from remote sensing images. It includes 1108 train samples, 14 validation samples, and 49 test samples. This dataset is known for its high-quality images and labels (Mnih Citation2013).

Figure 1. The GF-2 dataset (a to j represent data from the GF-2 dataset. We used white color to depict the annotated road areas by overlaying road labels onto image tiles).

Figure 1. The GF-2 dataset (a to j represent data from the GF-2 dataset. We used white color to depict the annotated road areas by overlaying road labels onto image tiles).

Table 1. The parameters of GF-2 images used in this study.

2.2 Dilated convolutional block

Roads in remote sensing images typically appear as thin linear structures with gray or earthy-yellow colors. They exhibit consistent texture features and radiation characteristics due to the use of hard surface materials like asphalt and cement, as well as their relative flatness compared to the surrounding environment. To effectively capture the road features, exploring different convolution approaches is a valuable perspective. With the advancement of CNNs, convolutional kernels have become increasingly diverse. For example, in the GCN (Global Convolutional Network), the conventional n × n convolution kernel is split into two strip kernels with sizes of 1 × n and n × 1 (Peng et al. Citation2017). MobileNet (v1, v2) (Howard et al. Citation2017; Sandler et al. Citation2018) adopts depthwise separable convolution, which preserves features while significantly reducing the number of parameters to 1/8 of those used in ordinary convolutions. Among various kernels, dilated convolution is widely applied in segmentation tasks that require capturing large-scale targets. For instance, it is utilized in networks such as ConDINet++ (Yang et al. Citation2022) and D-LinkNet (Yu, Koltun, and Funkhouser Citation2017).

Dilated convolution enhances the receptive field by adjusting the spacing between effective units of the convolutional kernel in comparison to regular convolution. The calculation of equivalent receptive field of n-th dilated convolution is expressed as (Krizhevsky, Sutskever, and Hinton Citation2012):

(1) rn=rn1+(K1)n=1n1SiK=(k1)(d1)+k(1)

where rn and rn1 are the receptive field of the current and previous layers, respectively; Kis the size of the equivalent convolutional kernel of the current layer, which can be calculated by the convolutional kernel size k and dilation rate d; Si is the stride of layer i. When considering dilated convolution as a regular convolution with multiple “hole,” it still adheres to the principles of convolution calculation (Simonyan and Zisserman Citation2014):

(2) (IK)(i,j)=mnI(i+m,j+n)K(m,n)(2)

where I is an input feature map at position (i,j); and K(m,n) is the weight of the convolutional kernel at a corresponding position on a certain dimension.

However, the features obtained solely through a single dilated convolution are discontinuous, because the “holes” in dilated convolution cannot be learned. Therefore, in the study, we employ DCB consisting of four dilated convolutions with different dilation rates. The module is structured in a series-parallel combination (). When the feature map enters into DCB’s working range, dilated convolutions with dilation rates of {1, 2, 4, 8} and kernel sizes of {3, 5, 9, 17} will generate receptive fields of {3, 7, 15, 31}, enabling the detection and capture of more extensive features. It is worth noting that the DCB has already been introduced in D-LinkNet and is not a novel module.

Figure 2. Dilated convolutional block (the green square represents the learning region of dilated convolution, the white square represents the “hole” of dilated convolution, which refers to the gaps between the convolutional units. The blue shadows at the bottom of the figure depict the receptive field).

Figure 2. Dilated convolutional block (the green square represents the learning region of dilated convolution, the white square represents the “hole” of dilated convolution, which refers to the gaps between the convolutional units. The blue shadows at the bottom of the figure depict the receptive field).

To ensure that the feature maps extracted by each dilated convolution in DCB can be integrated, padding is introduced when performing the dilated convolution (Yu and Koltun Citation2015):

(3) O=I+2pk(k1)(d1)s+1(3)

where I denotes the input size; O denotes the output size; k denotes the kernel size; d denotes the dilation rate; s denotes the stride; and p denotes the padding. The output size of the feature maps can be guaranteed to be the same with input size when the size of the padding is consistent with the dilation rate.

2.3 D-FusionNet model

D-FusionNet, a CNN based on FusionNet model, is utilized for road extraction. FusionNet is a type of UNet-based CNNs with denser connections, which prevents feature degradation during the feature extraction process (Quan, Hildebrand, and Jeong Citation2021). In D-FusionNet, feature maps of different levels are directly into the feature map dimension, without cutting and merging channel dimensions, and the DCB is embedded into the D-FusionNet to integrate feature information, with dilation rates of {1,2,4,8} select to achieve dense semantic information expression.

illustrates the structure of D-FusionNet. The network comprises an encoder and decoder separated by a bridge layer, which facilitates downsampling and upsampling processes. In the encoder, the image undergoes compression into a smaller size through four downsampling layers and four internally embedded DCB modules (He et al. Citation2016). Each downsampling layer consists of a series of convolutional operations and a residual structure. These layers not only extract features but also compress the input image size. During the processing of feature maps, D-FusionNet integrates the features collected by the current downsampling layer into the DCB for secondary processing and integration. The resulting densely packed features are then transmitted to the next downsampling layer. Bridge layer functions similarly to the downsampling but solely expands the channel dimension without altering the size. In the decoder, image restoration is achieved through a four-time upsampling operation, which includes deconvolution, feature map merging, residual structure, and convolutions (Zeiler et al. Citation2010; Zeiler, Taylor, and Fergus Citation2011). During upsampling of feature maps, deconvolution progressively increases their size, restoring them to a similar size with the input image.

Figure 3. Structure of D-FusionNet. (The encoder-decoder in this figure is composed of different modules, represented by the rectangular blocks of various colors. The dashed lines indicate the specific operations performed by each block, while the solid lines represent the connections within the network. The rectangles of different colors correspond to the various standard operations in CNNs, and the group of gray rectangles represents the current feature map).

Figure 3. Structure of D-FusionNet. (The encoder-decoder in this figure is composed of different modules, represented by the rectangular blocks of various colors. The dashed lines indicate the specific operations performed by each block, while the solid lines represent the connections within the network. The rectangles of different colors correspond to the various standard operations in CNNs, and the group of gray rectangles represents the current feature map).

In theory, the DCB in D-FusionNet can optimize the network in three ways. Firstly, it gradually expands the receptive field, mitigating the grid effect produced by conventional dilated convolutions. Secondly, the “hole” in DCB does not possess learning ability, resulting in higher parameter efficiency compared to methods that utilize large convolutions to expand the receptive field. Lastly, the DCB acts as a feature buffer zone and enhances the connection between feature layers through a special series-parallel mechanism, which may reduce the feature loss in data encoding.

2.4 Evaluation criteria

Three performance indexes, namely F1-Score, MCC, and IoU, were utilized to evaluate the effectiveness of the proposed method. The F1-score represents the harmonic mean of precision and recall, while IoU measures the intersection-over-union ratio between the ground truth and extracted results. MCC is a correlation coefficient that measures the differences between predicted results and labels. These indexes can be expressed as (Martin, Fowlkes, and Malik Citation2004; Taran et al. Citation2018):

(4) precision=TPTP+FP(4)
(5) recall=TPTP+FN(5)
(6) F1=2precision×recallprecision+recall(6)
(7) IoU=TPTP+FP+FN(7)
(8) MCC=TP×TNFP×FN(TP+FP)×(TP+FN)×(TN+FP)\break×(TN+FN)(8)

where TP represents the number of pixels extracted as road and labeled as road; TN represents the number of pixels extracted as background and labeled as background; FP represents the number of pixels extracted as road but labeled as background; and FN represents the number of pixels extracted as background but labeled as road.

2.5 Experiment procedures

The study utilized PyTorch to establish the experimental environment, encompassing network structure, data loading, training, and testing. The experiments were carried out on a GPU with CUDA support to improve computational efficiency. The server utilized was an NVIDIA GeForce RTX 3090 with 24GB of GPU memory. The approximate process of performing road extraction task using D-FusionNet is illustrated in .

Figure 4. The flowchart of the experimental process (The sources of a-1 and b-1 are the Massachusetts dataset and the GF-2 dataset, respectively. The extraction results can be seen in a-2 and b-2. c demonstrates the Spatial transferability of D-FusionNet).

Figure 4. The flowchart of the experimental process (The sources of a-1 and b-1 are the Massachusetts dataset and the GF-2 dataset, respectively. The extraction results can be seen in a-2 and b-2. c demonstrates the Spatial transferability of D-FusionNet).

The Adam optimization method (Kingma and Ba Citation2014) and the loss function of cross-entropy were employed to improve the efficiency and accuracy of the processing in this study. The Adam optimization method is a commonly used adaptive algorithm in deep learning that combines momentum and adaptive learning rates. Its core idea is to use first and second moment estimates of gradients to adjust learning rates and momentum parameters for more efficient updates of model parameters. Cross-entropy can measure the difference between the labels of dataset and the predicted value of the network. The loss under a batch can be expressed as (Ho and Samuel Citation2020):

(9) Loss=1Mi=1M{yilogai+(1yi)log(1ai)}(9)

where Mis the batch size; yi is the true; and ai is the prediction.

The ReLU (Rectified Linear Unit) and ELU (Exponential Linear Unit) were selected as the activation functions to avoid overfitting that may occur in the training process. The ReLU activation function enhances the nonlinear relationship between the network layers and realizes the unilateral inhibition of data transmission, and finally achieves the effect of sparse activation, which is calculated as:

(10) f(x)=max(0,x)(10)

where x represents the input and corresponds to the value of each unit in the feature map generated by the previous layer. ELU is an activation function proposed on the basis of ReLU and sigmoid. The left side of the function has soft saturation so that the activation mean tends to be balanced. Therefore, the influence of bias can be reduced, and the gradient will be closer to the natural gradient:

(11) f(x)=xx>0α(ex1)x0(11)

where the parameter α is a hyperparameter of ELU, which determines the saturation point for the negative values.

3. Results

3.1 Results of ablation experiments

Ablation experiments are commonly used in CNNs to analyze the influences of different module components. In this study, two sets of ablation experiments were conducted to assess the effectiveness of proposed D-FusionNet. This was done by modifying the dilation rates and embed positions of the DCB module in the network.

3.1.1 Dilation rate

The varied dilation rate of DCBs in the encoder was tested to analyze its effect on road extraction when using the proposed D-FusionNet. The controlling models were formed by varying the dilation rate based on D-FusionNet. Two sets of comparison models were created by using dilation rates of {2,4,8,16} and {4,8,16,32}, respectively. Considering the hierarchical priority of DCB data flow, we inverted the dilation rates in the above two models as well as those in D-FusionNet, and resulted in three new comparison models. presents the accuracy of comparative models after modifying the dilation rates, and specifies the degree of decline compared to D-FusionNet. Additionally, we utilized Grad-CAM (Gradient-weighted Class Activation Mapping) to illustrate the attention of D-FusionNet and other ablation models on road regions in different datasets () (Selvaraju et al. Citation2017; Zhou et al. Citation2016).

Figure 5. Grad-CAM analysis of ablation models in dilation rate experiment (ablation I, II, III, IV and V represents the ablation model by replacing the dilation rate in DCB with {2, 4, 8, 16}, {4,8,16,32}{32,16,8,4}{16,8,4,2}{8,4,2,1}, respectively).

Figure 5. Grad-CAM analysis of ablation models in dilation rate experiment (ablation I, II, III, IV and V represents the ablation model by replacing the dilation rate in DCB with {2, 4, 8, 16}, {4,8,16,32}{32,16,8,4}{16,8,4,2}{8,4,2,1}, respectively).

Table 2. Ablation experiments by modifying the dilation rates.

The results reveal that the DCBs with {1,2,4,8} dilation rates demonstrate the best performance, while modifying the dilation rate of DCBs leads to a decrease in F1-score, IoU, and MCC. In the GF-2 dataset, F1-score, IoU, and MCC are evenly decreased by 0.73%, 1.02%, and 0.76%, respectively. The largest decrease occurs when the dilation rate of DCBs is increased by two times and reversed in D-FusionNet, resulting in a decrease of 1.36%, 1.91%, and 1.43% for F1-score, IoU, and MCC, respectively (the fifth row in ). The smallest decrease occurs when the dilation rate of DCBs in D-FusionNet is reversed without changing the dilation rate, resulting in a decrease of 0.26%, 0.36%, and 0.27% for F1-score, IoU, and MCC, respectively (the last row in ). Similar phenomena are also observed in Massachusetts dataset.

3.1.2 Embedding positions

The influence of DCB embedding position on the network’s extraction capability was analyzed in this study. In the experiment, the dilation rate of DCBs was set as {1,2,4,8}. Initially, the DCB in the first downsampling layer of D-FusionNet was removed while preserving the DCBs in other layers. Subsequently, the module from the second, third, fourth, and bridge layers underwent the same removal process to create four corresponding controlling models. The accuracy of comparative models by modifying the embedding positions and the degree of decline compared to D-FusionNet are illustrated in and .

Figure 6. The Grad-CAM analysis of ablation models in embedding positions experiment. Ablation I, II, III, IV, and V represent the ablation model by removing the downsampling performed after the first, second, third, fourth, and fifth iteration.

Figure 6. The Grad-CAM analysis of ablation models in embedding positions experiment. Ablation I, II, III, IV, and V represent the ablation model by removing the downsampling performed after the first, second, third, fourth, and fifth iteration.

Table 3. Ablation experiments by modifying the embedding positions.

The results indicate that the DCB embedding position at {1,2,3,4,5} achieves the best performance. Taking the Massachusetts dataset as an example, F1-score, IoU, and MCC are evenly decreased by 0.63%, 0.84%, and 0.37%, respectively. The largest drop occurs when removing the DCB in the second layer, resulting in a decrease of 0.9%, 1.20%, and 0.73% for F1-score, IoU, and MCC, respectively (the third row in ). The smallest decrease occurs when the DCB in the third layer, resulting in a decrease of 0.25%, 0.34%, and 0.16% for F1-score, IoU, and MCC, respectively (the fourth row in ). provides a visual representation of the attention level of the road area for both D-FusionNet and the control model.

3.2 Comparing D-FusionNet with other CNNs

This study compares the performance of D-FusionNet with several other CNNs, including FCN, UNet, LinkNet, D-LinkNet, and FusionNet. UNet and FusionNet are the complex models based on FCN, while LinkNet and D-LinkNet are the lightweight models. shows the comparisons of D-FusionNet with the other five networks. The results indicate that D-FusionNet outperforms other networks in all three evaluation indexes.

Table 4. Comparisons among FCN, UNet, LinkNet, D-LinkNet, FusionNet, and proposed D-FusionNet models.

It can be seen in that D-FusionNet has the best performance in road extraction. For the GF-2 dataset, the comparisons with other five modes show that the F1-score, IoU, and MCC have an average improvement of 5.35%, 7.12%, and 5.61%, respectively. For the Massachusetts dataset, the average improvement is 2.48%, 3.25%, and 2.25%, respectively.

Taking GF-2 dataset as an example, the largest and smallest improvements were analyzed. The largest improvement is between D-FusionNet and FCN, where improvements are 11.55%, 15.03%, and 12.08%, respectively. Conversely, the smallest improvement is between D-FusionNet and FusionNet, where improvements are 0.66%, 0.93%, and 0.68%, respectively. The FCN has the sparse network connection, while FusionNet has the dense network connection. It is speculated that dense network connections may enhance the learning ability of the network toward road extraction. Additionally, the main difference between D-FusionNet and FusionNet lies in the utilization of the DCB module. Therefore, the incorporation of DCB improves the extraction results.

4. Discussion

After conducting comparison of D-FusionNet and other CNNs, we summarized and analyzed four distinct experimental phenomena. In the following sections, we will present a detailed analysis and explanation of each phenomenon to improve the overall understanding of the road extraction tasks and D-FusionNet.

4.1 Impact of dense connection mode on road extraction

The comparisons between D-FusionNet and two different connection density networks (FCN and UNet) were discussed in this study, as shown in . Both highlighted roads (red rectangle) and evaluation parameters (TP, FN, TN, and FP) indicate that the denser networks have the better performance of road recognition. This phenomenon can be attributed to the fact that a dense connection pattern can effectively minimize feature loss and enhance extraction accuracy. FCN adopts a sparse connection mode and gradually merges features from the last two downsampling layers during the upsampling procedure. Starting from UNet, CNNs gradually adopt a dense connection pattern with multiple skip connections between the encoder and decoder. This dense connection approach, along with the symmetric encoder-decoder structure, has evolved into modern CNN frameworks, including FusionNet. Based on UNet, FusionNet introduces a residual structure, which enables a connection within the encoder and decoder to reduce the feature loss. The proposed D-FusionNet in this study follows the same approach. During the encoding process, D-FusionNet expands the receptive field using DCBs. This integration further enhances the network’s density. As mentioned, in Section 2.2, the feature buffering mechanism of DCB effectively decreases the feature loss during the forward propagation, improving road integrity and feature extraction accuracy in practical applications.

Figure 7. The comparisons of road extraction among FCN, UNet and D-FusionNet models (for each model, the left shows the extracted road and the right shows the evaluation parameters: TP with green, FN with red, TN with black, and FP with blue. The red rectangles highlight the comparisons among FCN, UNet and D-FusionNet models).

Figure 7. The comparisons of road extraction among FCN, UNet and D-FusionNet models (for each model, the left shows the extracted road and the right shows the evaluation parameters: TP with green, FN with red, TN with black, and FP with blue. The red rectangles highlight the comparisons among FCN, UNet and D-FusionNet models).

4.2 Impact of parameter number on road extraction

By incorporating the DCB in D-FusionNet, the feature extraction accuracy is improved, while it also leads to an increase in network parameters. The number of additional parameters associated with DCBs rises significantly with their depth, as shown the following equation:

(12) P=k2CiCo(12)

where Ci represents the number of input channels; CO represents the number of output channels; and k represents the kernel size. The DCB module is equivalent to the parameter number brought by four 3 × 3 ordinary convolutions. The parameter number of DCB can be further expressed as:

(13) PDCB=36C2(13)

where C represents the channels of the feature map generated by the last downsampling. The first DCB receives a feature map with 64 channels, which will generate 36642 extra parameters. As the network depth increases, the number of additional parameters grows to four times or more. We compared the training time among different models. On the Massachusetts dataset, D-FusionNet completes one epoch in approximately 10.5 minutes, while D-LinkNet only takes 1.2 minutes, and LinkNet finishes in just 45 seconds. On the GF-2 dataset, D-FusionNet completes one epoch in approximately 1.22 minutes, while D-LinkNet only requires 18 seconds, and LinkNet finishes in just 11 seconds. Therefore, in order to mitigate feature loss and expand the receptive field, D-FusionNet sacrifices speed.

The impact of parameter number on road extraction was analyzed by comparing D-FusionNet with two lightweight models (LinkNet and D-LinkNet), as shown in . Both LinkNet and D-LinkNet retain the encoder-decoder structure and utilize ResNet as the encoder to reduce parameters, achieving the lightweight segmentation. However, when compared to the parameter-rich D-FusionNet, the performance of LinkNet and D-LinkNet on multi-source platform datasets is not ideal. This improvement may be attributed to the parameter. Therefore, striking a balance between training cost and accuracy improvement is crucial for road extraction from remote sensing images.

Figure 8. The comparisons of road extraction among LinkNet, D-LinkNet and D-FusionNet models (for each model, the left shows the extracted road and the right shows the evaluation parameters: TP with green, FN with red, TN with black, and FP with blue. The red rectangles highlight the comparisons among LinkNet, D-LinkNet and D-FusionNet models).

Figure 8. The comparisons of road extraction among LinkNet, D-LinkNet and D-FusionNet models (for each model, the left shows the extracted road and the right shows the evaluation parameters: TP with green, FN with red, TN with black, and FP with blue. The red rectangles highlight the comparisons among LinkNet, D-LinkNet and D-FusionNet models).

4.3 Impact of DCBs on road extraction

D-FusionNet enhances the performance of FusionNet through the introduction of the DCB module. Comparisons of road extraction between FusionNet and D-FusionNet were carried out to discuss the impact of DCBs on road extraction (). D-FusionNet exhibits the smaller number of FP compared to FusionNet, indicating that the main error in FusionNet is from the road misclassifications as background. Analysis of reveals that the improvements of D-FusionNet primarily focus on enhancing road integrity and capturing a wider range of roads. Unlike FusionNet, D-FusionNet demonstrates the ability to connect road intersection points. This can be attributed to the feature buffering mechanism of the DCB, which effectively preserves detailed information in the image. FusionNet faces challenges in extracting some large continuous roads, whereas D-FusionNet excels in compensating for discontinuous roads. It is due to the larger receptive field of DCB, enabling D-FusionNet to explore road areas from a broader perspective.

Figure 9. The comparisons of road extraction between FusionNet and D-FusionNet models (For the each model, the left shows the extracted road and the right shows the evaluation parameters: TP with green, FN with red, TN with black, and FP with blue. The yellow rectangles highlight the comparisons among LinkNet, D-LinkNet and D-FusionNet).

Figure 9. The comparisons of road extraction between FusionNet and D-FusionNet models (For the each model, the left shows the extracted road and the right shows the evaluation parameters: TP with green, FN with red, TN with black, and FP with blue. The yellow rectangles highlight the comparisons among LinkNet, D-LinkNet and D-FusionNet).

4.4 Impact of data quality on road extraction

An analysis of the extraction results of D-FusionNet reveals that it produces more FP and FN in areas with high vegetation or dense urban buildings, leading to a decrease in overall accuracy (). This phenomenon can be attributed to object occlusion or the similarity in radiation characteristics between roads and surrounding objects. In such cases, D-FusionNet may struggle to distinguish between foreground and background, resulting in the failure to recognize roads in the image. Additionally, there is a possibility of incorrectly identifying the space between buildings as roads. Some studies have taken note of these issues, and current related research is exploring “intervention” methods outside the network to further enhance the accuracy of road segmentation. For instance, some studies use data augmentation techniques to enrich the original feature information within the dataset. Others utilize auxiliary constraint tasks to address connectivity issues arising from the image quality (Li et al. Citation2021). Furthermore, adaptive histogram equalization techniques have been used in certain studies to mitigate the influence of external factors (Xu et al. Citation2021). Therefore, it is recommended that some constraints or image enhancement methods can be introduced to constrain the road area from the level of data in future research.

Figure 10. The roads that the D-FusionNet is difficult to extract.

Figure 10. The roads that the D-FusionNet is difficult to extract.

4.5 Spatial transferability

To investigate the spatial transferability of the proposed approach, we performed a new road extraction task on another scene image using a model trained on the GF-2 dataset. In order to consider a wide range of potential issues that may arise in practical scenarios, our image selection criteria were as follows: Firstly, we prioritized images with high vegetation coverage. Secondly, we gave preference to images with a certain level of cloud coverage. Additionally, we took into account the practical significance of updating road network data and prioritized regions with low urbanization where there might be potential issues of slow road network data updates. Lastly, we prioritized images that included linear features other than roads, such as rivers. Ultimately, we selected a summer image from Enshi Tujia and Miao Autonomous Prefecture, Hubei Province, China (Product Serial Number: 4952204).

The image has a resolution of 7300 × 6908, and the results are shown in . Observing the left side of the image (Results), we can see that D-FusionNet successfully extracts the road areas in the image at this moment, and the extraction results exhibit good continuity. We further analyze the extraction results by comparing the extraction capabilities of D-FusionNet and FusionNet (Group I, a~h). It can be observed that D-FusionNet improves upon the extraction results of FusionNet: the road areas (blue) that FusionNet can capture but D-FusionNet cannot are almost non-existent, indicating that D-FusionNet inherits the extraction capabilities of FusionNet. Additionally, the red areas, which are the additional road areas obtained by D-FusionNet, effectively connect the previously fragmented locations in FusionNet, demonstrating that D-FusionNet optimizes the extraction of FusionNet to some extent.

Figure 11. Spatial transferability test of D-FusionNet (The figure consists of three parts: on the left is the extraction result of D-FusionNet in the summer imagery of Enshi Tujia and Miao Autonomous Prefecture, Hubei Province. The extracted road areas are highlighted in yellow. In group I, we randomly selected eight image tiles from the imagery and compared the extraction results of D-FusionNet and FusionNet. Yellow represents the common areas extracted by both networks, red indicates the additional road areas extracted by D-FusionNet compared to FusionNet, and the meaning of blue is opposite to that of red. In group II, we demonstrate several error sources of D-FusionNet. Among them, i shows the influence of rivers on network extraction, using the same annotation method as group I. j and k demonstrate the impact of cloud and fog cover on network extraction. l~o illustrate the impact of imaging obstruction from surrounding buildings, vegetation, and other factors on network extraction. The annotation method for j~o is the same as that on the left side of the figure).

Figure 11. Spatial transferability test of D-FusionNet (The figure consists of three parts: on the left is the extraction result of D-FusionNet in the summer imagery of Enshi Tujia and Miao Autonomous Prefecture, Hubei Province. The extracted road areas are highlighted in yellow. In group I, we randomly selected eight image tiles from the imagery and compared the extraction results of D-FusionNet and FusionNet. Yellow represents the common areas extracted by both networks, red indicates the additional road areas extracted by D-FusionNet compared to FusionNet, and the meaning of blue is opposite to that of red. In group II, we demonstrate several error sources of D-FusionNet. Among them, i shows the influence of rivers on network extraction, using the same annotation method as group I. j and k demonstrate the impact of cloud and fog cover on network extraction. l~o illustrate the impact of imaging obstruction from surrounding buildings, vegetation, and other factors on network extraction. The annotation method for j~o is the same as that on the left side of the figure).

However, we should also pay attention to the errors that occur during the extraction task (Group II). Firstly, rivers near the roads pose challenges to the extraction process. CNNs mistakenly identify some river areas with similar morphology to roads as road areas. This error is present in both FusionNet and D-FusionNet, but it is amplified in the results of D-FusionNet due to its larger detection range for capturing area continuity. Secondly, the presence of clouds above the roads also poses difficulties for extraction. The roads that should appear in certain areas are covered by clouds, making it challenging for the network to imagine what the covered roads should look like. Lastly, imaging obstruction from surrounding buildings, vegetation, and other factors also has a similar impact to the challenges posed by clouds.

Certainly, we have proposed some data processing optimization suggestions for these three types of errors. Firstly, for the river areas, we recommend using masks to remove them from the image. Secondly, regarding the challenges posed by cloud coverage, we believe that cloud removal is an essential preprocessing step for road extraction from remote sensing images. Lastly, for imaging obstruction from surrounding buildings, vegetation, and other factors, we suggest considering a multi-source data fusion approach to eliminate them. For example, we can incorporate terrain factors by using DSM (Digital Surface Model) obtained through Lidar scans in the study area to constrain the learning process of the network (Ma et al. Citation2022).

5. Conclusions

In this study, we propose a novel semantic segmentation network called D-FusionNet to address the challenges in road extraction tasks. In contrast to other studies, we simultaneously consider the benefits brought by enlarging the receptive field and utilizing feature buffering in the forward propagation for road extraction. D-FusionNet incorporates the use of DCB to reconstruct the encoding process and extract road information. With the inclusion of DCB, D-FusionNet obtains a larger receptive field, addressing road discontinuity and enhances the capability of capturing roads. Additionally, similar to the residual structures, DCB facilitates tighter transmission of contextual information within D-FusionNet, reducing feature loss caused by data downsampling. The conclusions are summarized as follows:

  1. Considering the impact of feature loss in the network design is necessary. This study demonstrates that denser networks yield superior road extraction performance. This phenomenon can be attributed to the effective minimization of feature loss achieved through a dense connection pattern, which in turn improves extraction accuracy.

  2. It becomes crucial to introduce mechanisms that effectively expand the receptive field in the network. The DCB module, in addition to buffering feature losses, also expands the receptive field of D-FusionNet, enabling it to adapt to different datasets and possess spatial transferability in road extraction tasks.

Although D-FusionNet demonstrates good performance in various experiments, it has two limitations. Firstly, the inclusion of DCB introduces additional parameters to D-FusionNet. This leads to increased training costs and potential limitations on its scalability. Secondly, there is a possibility of D-FusionNet misidentifying foreground and background, resulting in failures to accurately recognize roads in the image. Therefore, future studies will prioritize finding a balance between training cost and accuracy improvement, as well as exploring image enhancement methods for road extraction from remote sensing images. Certainly, during the transition of our research to practical implementation, we will further refine the task by taking into account factors such as the type of coating and class (main, secondary, or vicinal).

Acknowledgments

The authors would like to express their sincere gratitude to the reviewers and editors for their constructive and high-quality revision suggestions on this manuscript.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The public dataset used in this study can be obtained from https://www.cs.toronto.edu/~vmnih/data/. The satellite imagery data involved in the study can be acquired from the China Center for Resources Satellite Data and Application at https://www.cresda.com/zgzywxyyzx/index.html.

Additional information

Funding

This research was funded by the National Key R&D Program of China (2020YFC1512001), Natural Science Foundation of China (NSFC) (Nos.42074040), Natural Science Basic Research Plan in Shaanxi Province of China (2023-JC-JQ-24), Innovative Talents Promotion Plan of Shaanxi Province (Grant No. 2022KJXX-22), and Fundamental Research Funds for the Central Universities of CHD (Nos: 300102262902, 300102263401, 300102262512, 300102263502).

References

  • Abdollahi, A., B. Pradhan, and A. Alamri. 2022. “SC-Roaddeepnet: A New Shape and Connectivity-Preserving Road Extraction Deep Learning-Based Network from Remote Sensing Data.” IEEE Transactions on Geoscience & Remote Sensing 60:15. https://doi.org/10.1109/tgrs.2022.3143855.
  • Abdollahi, A., B. Pradhan, N. Shukla, S. Chakraborty, and A. Alamri. 2020. “Deep Learning Approaches Applied to Remote Sensing Datasets for Road Extraction: A State-Of-The-Art Review.” Remote Sensing 12 (9): 22. https://doi.org/10.3390/rs12091444.
  • Chaurasia, A., and E. Culurciello. 2017.”LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation, In: 2017 IEEE Visual Communications and Image Processing (VCIP).”Presented at the 2017 IEEE Visual Communications and Image Processing (VCIP), https://doi.org/10.1109/VCIP.2017.8305148.
  • Chen, H., S. Peng, C. Du, J. Li, and S. Wu. 2022. “SW-GAN: Road Extraction from Remote Sensing Imagery Using Semi-Weakly Supervised Adversarial Learning.” Remote Sensing 14 (17): 16. https://doi.org/10.3390/rs14174145.
  • Chen, L.-C., G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. 2018. “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs.” IEEE Transactions on Pattern Analysis & Machine Intelligence 40 (4): 834–17. https://doi.org/10.1109/tpami.2017.2699184.
  • Chen, L.-C., G. Papandreou, I. Kokkinos, K. P. Murphy, and A. L. Yuille. 2014. “Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs.” CoRr. https://arxiv.org/abs/1412.7062.
  • Chen, L.-C., G. Papandreou, F. Schroff, and H. Adam. 2017. “Rethinking Atrous Convolution for Semantic Image Segmentation.” arXiv. https://arxiv.org/abs/1706.05587.
  • Chen, W., G. Zhou, Z. Liu, X. Li, X. Zheng, and L. Wang. 2022. “NIGAN: A Framework for Mountain Road Extraction Integrating Remote Sensing Road-Scene Neighborhood Probability Enhancements and Improved Conditional Generative Adversarial Network.” IEEE Transactions on Geoscience & Remote Sensing 60:15. https://doi.org/10.1109/tgrs.2022.3188908.
  • Cheng, G., Y. Wang, S. Xu, H. Wang, S. Xiang, and C. Pan. 2017. “Automatic Road Detection and Centerline Extraction via Cascaded End-To-End Convolutional Neural Network.” IEEE Transactions on Geoscience & Remote Sensing 55 (6): 3322–3337. https://doi.org/10.1109/tgrs.2017.2669341.
  • Cui, F., Y. Shi, R. Feng, L. Wang, and T. Zeng. 2022. “A Graph-Based Dual Convolutional Network for Automatic Road Extraction from High Resolution Remote Sensing Images.” IEEE International Geoscience and Remote Sensing Symposium(IGARSS), Kuala Lumpur, Malaysia: 3015–3018. https://doi.org/10.1109/IGARSS46834.2022.9883088.
  • Dai, J., Y. Wang, Y. Du, T. Zhu, S. Xie, C. Li, and X. Fang. 2020. “Development and Prospect of Road Extraction Method for Optical Remote Sensing Image.” Journal of Remote Sensing 24 (7): 804–823. https://doi.org/10.11834/jrs.20208360.
  • Dai, L., G. Zhang, and R. Zhang. 2023. “RADANet: Road Augmented Deformable Attention Network for Road Extraction from Complex High-Resolution Remote-Sensing Images.” IEEE Transactions on Geoscience & Remote Sensing 61:13. https://doi.org/10.1109/tgrs.2023.3237561.
  • Goodfellow, I. J., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley Ozair, S. A. Courville, and Y. Bengio. 2014. “Generative Adversarial Nets 2014 Neural Information Processing Systems (NIPS).” In Presented at the 2015 Neural Information Processing Systems (NIPS), 2672–2680. Montreal, CANADA. https://arxiv.org/abs/1411.1784.
  • He, H., D. Yang, S. Wang, S. Wang, and Y. Li. 2019. “Road Extraction by Using Atrous Spatial Pyramid Pooling Integrated Encoder-Decoder Network and Structural Similarity Loss.” Remote Sensing 11 (9): 16. https://doi.org/10.3390/rs11091015.
  • He, K., X. Zhang, S. Ren, and J. Sun. 2016. “Deep Residual Learning for Image Recognition, In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).” Presented at the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, IEEE. 770–778. https://doi.org/10.1109/CVPR.2016.90.
  • Ho, Y., and W. Samuel. 2020. “The Real-World-Weight Cross-Entropy Loss Function: Modeling the Costs of Mislabeling.” IEEE Access 8:4806–4813. https://doi.org/10.1109/access.2019.2962617.
  • Howard, A. G., Z. Menglong, C. Bo, D. Kalenichenko, W. Weijun, T. Weyand, M. Andreetto, and H. Adam. 2017. “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications arXiv.” arXiv 9:–9. https://doi.org/10.48550/arXiv.1704.04861.
  • Jia, J., H. Sun, C. Jiang, K. Karila, M. Karjalainen, E. Ahokas, E. Khoramshahi, et al. 2021. “Review on Active and Passive Remote Sensing Techniques for Road Extraction.” Remote Sensing 13 (21): 29. https://doi.org/10.3390/rs13214235.
  • Jiang, Y., C. Zhong, and B. Zhang. 2023. “AGD-Linknet: A Road Semantic Segmentation Model for High Resolution Remote Sensing Images Integrating Attention Mechanism, Gated Decoding Block and Dilated Convolution.” IEEE Access, 22585–22595. https://doi.org/10.1109/ACCESS.2023.3253289.
  • Kingma, D. P., and J. Ba. 2014. “Adam: A Method for Stochastic Optimization.” CoRr 6980. abs/1412. https://doi.org/10.48550/arXiv.1412.6980.
  • Krizhevsky, A., I. Sutskever, and G. E. Hinton. 2012. “Imagenet Classification with Deep Convolutional Neural Networks.” Communications of the ACM 60 (6): 84–90. https://doi.org/10.1145/3065386.
  • Li, S., C. Liao, Y. Ding, H. Hu, Y. Jia, M. Chen, B. Xu, X. Ge, T. Liu, and D. Wu. 2022. “Cascaded Residual Attention Enhanced Road Extraction from Remote Sensing Images.” ISPRS International Journal of Geo-Information 11 (1): 19. https://doi.org/10.3390/ijgi11010009.
  • Li, X., Z. Zhang, S. Lv, M. Pan, Q. Ma, and H. Yu. 2021. “Road Extraction from High Spatial Resolution Remote Sensing Image Based on Multi-Task Key Point Constraints.” IEEE Access, 95896–95910. https://doi.org/10.1109/ACCESS.2021.3094536.
  • Liu, R., Q. Miao, Y. Zhang, M. Gong, and P. Xu. 2020. “A Semi-Supervised High-Level Feature Selection Framework for Road Centerline Extraction.” IEEE Geoscience & Remote Sensing Letters 17 (5): 894–898. https://doi.org/10.1109/lgrs.2019.2931928.
  • Liu, Z., R. Feng, L. Wang, Y. Zhong, and L. Cao. 2019. “D-Resunet: Resunet and Dilated Convolution for High Resolution Satellite Imagery Road Extraction.” IEEE International Geoscience and Remote Sensing Symposium, 3927–3930. https://doi.org/10.1109/IGARSS.2019.8898392.
  • Long, J., E. Shelhamer, and T. Darrell. 2015. “Fully Convolutional Networks for Semantic Segmentation, In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).” Presented at the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Boston, MA, USA, 3431–3440. https://doi.org/10.1109/CVPR.2015. 7298965.
  • Lu, X., Y. Zhong, Z. Zheng, D. Chen, Y. Su, A. Ma, and L. Zhang. 2022. “Cascaded Multi-Task Road Extraction Network for Road Surface, Centerline, and Edge Extraction.” IEEE Transactions on Geoscience and Remote Sensing, 1–14. https://doi.org/10.1109/TGRS.2022.3165817.
  • Ma, H., H. Ma, L. Zhang, K. Liu, and W. Luo. 2022. “Extracting Urban Road Footprints from Airborne LiDar Point Clouds with PointNet++ and Two-Step Post-Processing.” Remote Sensing 14 (3): 789. https://doi.org/10.3390/rs14030789.
  • Mansourifar, H., A. Moskovitz, B. Klingensmith, D. Mintas, and S. J. Simske. 2022. “GAN-Based Satellite Imaging: A Survey on Techniques and Applications.” IEEE Access 10:118123–118140. https://doi.org/10.1109/access.2022.3221123.
  • Martin, D. R., C. C. Fowlkes, and J. Malik. 2004. “Learning to Detect Natural Image Boundaries Using Local Brightness, Color, and Texture Cues.” IEEE Transactions on Pattern Analysis & Machine Intelligence 26 (5): 530–549. https://doi.org/10.1109/tpami.2004.1273918.
  • Mnih, V. 2013. “Machine Learning for Aerial Image Labeling.” Ph.D. dissertation, Dept. Comput. Sci., Univ. Toronto, Toronto, ON, Canada.
  • Peng, C., X. Zhang, G. Yu, G. Luo, and J. Sun. 2017. “Large Kernel Matters - Improve Semantic Segmentation by Global Convolutional Network, In: 2017 IEEE Conference on Computer Vision and Pattern Recognition(cvpr).” Presented at the 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 1743–1751. https://doi.org/10.1109/cvpr.2017.189.
  • Quan, T. M., D. G. C. Hildebrand, and W. K. Jeong. 2021. “FusionNet: A Deep Fully Residual Convolutional Neural Network for Image Segmentation in Connectomics.” Frontiers in Computer Science-Switz 3:12. https://doi.org/10.3389/fcomp.2021.613981.
  • Ronneberger, O., P. Fischer, and T. Brox. 2015. “U-Net: Convolutional Networks for Biomedical Image Segmentation.” In Medical Image Computing and Computer- Assisted Intervention – MICCAI 2015, Lecture Notes in Computer Science, edited by N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, 234–241. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-24574-4_28.
  • Sandler, M., A. Howard, M. Zhu, A. Zhmoginov, and L. Chen. 2018. “MobileNetv2: Inverted Residuals and Linear Bottlenecks, In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).” Presented at the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4510–4520. https://doi.org/10.1109/cvpr.2018.00474.
  • Selvaraju, R. R., M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. 2017. “Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization, In: 2017 IEEE International Conference on Computer Vision (ICCV).” Presented at the 2017 IEEE International Conference on Computer Vision (ICCV), 618–626.https://doi.org/10.1109/iccv.2017.74.
  • Simonyan, K., and A. Zisserman. 2014. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” Availavle from: https://arxiv.org/abs/1409.1556.
  • Taran, V., N. Gordienko, Y. Kochura, Y. Gordienko, A. Rokovyi, O. Alienin, and S. Stirenko. 2018. “Performance Evaluation of Deep Learning Networks for Semantic Segmentation of Traffic Stereo-Pair Images.” Bulgarian International Conference on Computer Systems and Technologies (CompSysTech). Assoc Computing Machinery, Ruse, BULGARIA, 73–80. https://doi.org/10.1145/3274005.3274032.
  • Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 2017. “Attention is All You Need 2017 Neural Information Processing Systems (NIPS)”. In Presented at the 2017 Neural Information Processing Systems (NIPS), Long Beach, CA: https://arXiv:1706.03762.
  • Xu, Z., Z. Shen, Y. Li, L. Xia, H. Wang, S. Li, S. Jiao, and Y. Lei. 2021. “Road Extraction in Mountainous Regions from High-Resolution Images Based on DSDNet and Terrain Optimization.” Remote Sensing 13 (1): 18. https://doi.org/10.3390/rs13010090.
  • Yang, K., J. Yi, A. Chen, J. Liu, and W. Chen. 2022. “ConDinet Plus Plus: Full-Scale Fusion Network Based on Conditional Dilated Convolution to Extract Roads from Remote Sensing Images.” IEEE Geoscience & Remote Sensing Letters 19:5. https://doi.org/10.1109/lgrs.2021.3093101.
  • Yang, M., Y. Yuan, and G. Liu. 2022. “SDUNet: Road Extraction via Spatial Enhanced and Densely Connected UNet.” Pattern Recognition 126:8. https://doi.org/10.1016/j.patcog.2022.108549.
  • Yu, F., and V. Koltun. 2015. “Multi-Scale Context Aggregation by Dilated Convolutions.” Availavle from: https://arxiv.org/abs/1409.1556
  • Yu, F., V. Koltun, and T. Funkhouser. 2017.”Dilated Residual Networks, In: 2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).” Presented at the 2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 636–644. https://doi.org/10.1109/cvpr.2017.75.
  • Zeiler, M. D., D. Krishnan, G. W. Taylor, and R. Fergus. 2010. “Deconvolutional Networks, In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).” Presented at the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2528–2535. https://doi.org/10.1109/cvpr.2010.5539957.
  • Zeiler, M. D., G. W. Taylor, and R. Fergus. 2011. “Adaptive Deconvolutional Networks for Mid and High Level Feature Learning, In: 2011 IEEE International Conference on Computer Vision (ICCV).” Presented at the 2011 IEEE International Conference on Computer Vision (ICCV), 2018–2025. https://doi.org/10.1109/ICCV.2011.6126474.
  • Zhang, Z., Q. Liu, and Y. Wang. 2018. “Road Extraction by Deep Residual U-Net.” IEEE Geoscience & Remote Sensing Letters 15 (5): 749–753. https://doi.org/10.1109/lgrs.2018.2802944.
  • Zhang, Z., and Y. Wang. 2019. “JointNet: A Common Neural Network for Road and Building Extraction.” Remote Sensing 11 (6): 22. https://doi.org/10.3390/rs11060696.
  • Zhou, B., A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. 2016. “Learning Deep Features for Discriminative Localization, In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).” Presented at the 2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2921–2929. https://doi.org/10.1109/cvpr.2016.319.
  • Zhou, G., W. Chen, Q. Gui, X. Li, and L. Wang. 2021. “Split Depth-Wise Separable Graph-Convolution Network for Road Extraction in Complex Environments from High-Resolution Remote-Sensing Images.” IEEE Transactions on Geoscience and Remote Sensing 60:1–15. https://doi.org/10.1109/TGRS.2021.3128033.
  • Zhou, L., C. Zhang, and M. Wu. 2018.”D-Linknet: LinkNet with Pretrained Encoder and Dilated Convolution for High Resolution Satellite Imagery Road Extraction, In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).” Presented at the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 192–196. https://doi.org/10.1109/cvprw.2018.00034.