Full article: Dual convolutional network based on hypergraph and multilevel feature fusion for road extraction from high-resolution remote sensing images

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Road extraction from high-resolution remote sensing images (HRSI) is confronted with the challenge that roads are occluded by other objects, including opaque obstructions and similarly colored areas. This paper proposes a dual convolutional network based on hypergraph and multilevel feature fusion (DHM) for road extraction to address these challenges. The DHM consists of two branch networks (HGNN branch and CNN branch) and a bimodal feature fusion module (BFFM). In the HGNN branch, an algorithm is developed to exploit the shape features of roads and construct hypergraphs on the HRSI. Then, hypergraph neural networks are employed for the first time to capture the long-range context of roads to enhance road connectivity. In the CNN branch, a bottleneck fusion module integrated with an encoder-decoder network structure is built to aggregate multiscale local features. In BFFM, the long-range context from the HGNN branch and the local features from the CNN branch are fused through the designed position converter and enhanced graph reasoning module to achieve the complementary advantages of the dual-branch network. Extensive experiments on three datasets show that DHM outperforms other state-of-the-art methods, especially on the GS-Mountain road dataset. Furthermore, DHM significantly improves road extraction in occluded and similar road regions.

KEYWORDS:

1. Introduction

The wealth of feature information presented in high-resolution remote sensing images (HRSI) has prompted the employment of HRSI in numerous fields, such as crop protection (Li et al. Citation2016), land cover (Tong et al. Citation2020) and urban planning (Huang, Zhao, and Song Citation2018). Roads, as an important feature marker in HRSI, offer insight into the connectivity of different geographical locations. Currently, roads are extracted from HRSI in two main scenarios: roads in urban scenarios and roads in mountainous scenarios. Regardless of the road scenarios, roads may face the effects of shadows, occlusions, and similar areas. Compared to roads in urban scenes, roads in mountainous areas have larger shadow areas and blurred edges. Extracting accurate roads from urban scenarios allows timely updating of the road network (Bastani and Madden Citation2021) and facilitates the development of autonomous driving (Z. Xu, Sun, and Liu Citation2021). Extracting accurate roads from mountain scenarios can support emergency rescue decisions (Barrile et al. Citation2020; X. Zhang et al. Citation2022). Therefore, there is a great need for reliable extraction of roads from HRSI for various scenarios.

In recent decades, numerous road extraction methods have been proposed, broadly classified into traditional methods and deep learning-based methods. Most of the methods are for HRSI road extraction in urban scenarios. Traditional methods rely on morphological features, texture information of roads in remote sensing images or manually created features. Specifically, among the methods utilizing morphological features, Maurya, Gupta, and Shankar Shukla (Citation2011) employed the K-means algorithm and morphological operations to screen and filter road regions in remote sensing images. Since classical morphological methods cannot detect both straight and curved roads, B. Liu et al. (Citation2015) applied directional linear morphological filters to enhance the contrast between the road and the background and to reduce nonroad noise. Texture as an important feature to differentiate roads, a number of researchers have worked on extracting roads using texture features. Senthilnath, Rajeshwari, and Omkar (Citation2009) used spectral and geometric features of roads for denoise images and then utilized progressive texture analysis to extract the road network. To aggregate multiscale information, Sghaier and Lepage (Citation2015) applied texture analysis based on structural feature set and beamlet transform to integrate multiscale information for road extraction. Some methods also apply handcrafted features and specific classifiers for road extraction. For example, Movaghati, Moghaddamjoo, and Tavakoli (Citation2010) implemented multiple filters to identify and track road networks. Y. Zhang et al. (Citation2016) used morphology to optimize the road results after initial road extraction in combination with a devised tensor voting mechanism. The traditional methods can achieve good road extraction results on a small number of datasets, but they lack robustness and scalability and are easily affected by occlusions and data sources. In addition, since these methods often require a large number of human preprocessing or postprocessing steps to complete the extraction of results, they are labor intensive for handling large-scale datasets.

Benefiting from the powerful representation learning ability of deep learning (Chai et al. Citation2021), research on automatic road extraction from HRSI using deep learning methods has also attracted much attention (Abdollahi et al. Citation2020; P. Liu et al. Citation2022). These deep learning-based road extraction methods can be categorized as early, medium and recent methods according to time. At the early stage of the road extraction task using deep learning, there were some relatively simple applications of network structures and their variant structures in road extraction. Specifically, as a pioneering network for semantic segmentation, a fully convolutional network (FCN) employed convolution and deconvolution to achieve pixel-level semantic segmentation (Long, Shelhamer, and Darrell Citation2015). The fine-tuned FCN was soon applied to road and building extraction tasks, and it was demonstrated that a smaller pooling layer is helpful in the road extraction task (Zhong et al. Citation2016). The two FCN-based approaches described above do not focus on the imbalance between road and background in HRSI. X. Zhang et al. (Citation2019) designed a weighted loss function and an ensemble strategy based on spatial consistency for FCN. This approach effectively alleviates the occurrence of 'false roads' in the background. The same year that FCN was introduced, there was also UNet (Ronneberger, Fischer, and Brox Citation2015), which introduced jump connections in the encoder-decoder structure to fuse features at different scales. Similar to the UNet structure, SegNet recorded position indices in the pooling layer to increase accuracy by capturing boundary features during sampling in the decoder (Badrinarayanan, Kendall, and Cipolla Citation2017). Compared to how UNet cascades feature maps directly in the decoder, SegNet uses pooling indices and can reduce memory consumption. The above methods have achieved good results in road extraction, but the nature of the regular convolution kernel restricts the acquisition of contextual information about the road at a large distance in the image, limiting the development of accuracy.

In the middle development of deep learning, to overcome the limitations imposed by the traditional convolutional kernel, a number of road extraction networks that utilize pyramid structures or atrous convolution have emerged. In image segmentation, the DeepLab series successively employed atrous convolution to obtain larger sensory fields and atrous spatial pyramid pooling (ASPP) to obtain contextual information at multiple scales (L.-C. Chen et al. Citation2014; L.-C. Chen, Papandreou, Kokkinos et al. Citation2017; L.-C. Chen, Papandreou, Schroff et al. Citation2017; L.-C. Chen et al. Citation2018). Compared to the DeepLab series, D-LinkNet had a greater advantage in road extraction tasks due to the HRSI and the characteristics of roads (L. Zhou, Zhang, and Wu Citation2018). D-LinkNet utilizes stacked dilated convolutions to enlarge the receptive field of central features and mitigate the loss of fine details, making it more effective in road extraction tasks. Meanwhile, it was possible that neither atrous convolution nor ASPP was applied to obtain multiscale feature information. For example, Z. Zhang, Liu, and Wang (Citation2018) proposed a deep residual UNet extraction road to reduce information propagation losses using residual units. HsgNet embeds intermediate blocks in the middle of the encoder-decoder to hold global and long-range context information (Xie et al. Citation2019). These methods also incorporate multiscale features but mitigate the problem of discontinuity in atrous convolutional features compared to D-LinkNet. The above methods have achieved further improvement in the road extraction effect, but they still have some limitations. The atrous convolution tends to perform poorly on fine-grained objects (P. Wang et al. Citation2018); additionally, the method based on the rule convolution extracts many features from the Euclidean space (X. Wang et al. Citation2022), which cannot match well with the road features.

In the recent development of deep learning, several studies have attempted to overcome the above weaknesses of regular convolution by attentional mechanisms (Guo et al. Citation2022; Y. Wang et al. Citation2022; Y. Xu et al. Citation2021) or graph neural networks. In particular, Wan et al. (Citation2021) proposed DA-RoadNet, which designed dual attention modules for space and accessed shallow feature maps to reduce the effects of road shading. Dai, Zhang, and Zhang (Citation2023) proposed RADANet, which enhanced road connectivity by incorporating striped convolution in the encoder and utilizing deformable attention and spatial attention between the encoder and decoder. RADANet and DA-RoadNet both utilized attention mechanisms to capture global dependencies. DA-RoadNet emphasized addressing the issue of class imbalance in HRSI, while RADANet focused on leveraging road features for road extraction. Z. Yang et al. (Citation2023) proposed RCFSNet, which developed the coordinate dual attention mechanism to enhance road features in space and channel dimensions. Additionally, they integrated multiscale and full-phase features to optimize road extraction performance. From the transformer's perspective, Z. Yang et al. (Citation2022) applied the Swin transformer (Z. Liu et al. Citation2021) combined with a position attention module to capture contextual information from different road locations. X. Liu et al. (Citation2023) utilized the Swin transformer as an encoder, combining it with separable convolution to capture global contextual information and road structures. The attention mechanism or transformer-based approach can obtain more global information and achieve higher accuracy than its predecessors; however, it also has the obvious limitations that it tends to ignore some local information (Sun et al. Citation2022) and requires complex computational memory to train the model. Furthermore, the development of multiscale interaction fusion networks (J. Wang et al. Citation2023) or structure-optimized transfer networks (M. Zhang et al. Citation2023) using CNN or attention mechanism methods has achieved success in remote sensing image classification. This approach can also provide valuable insights into road classification for remote sensing images.

As an important representative of the recent development of deep learning, graph neural networks (GNNs) have received increasing attention in numerous fields (Kumar et al. Citation2022; Meyer et al. Citation2023; L. Yang et al. Citation2021) due to their ability to model irregular structures. In road extraction, G. Zhou et al. (Citation2022) used gradient operators to construct graphs over spatial and channel features and used graph convolution (Kipf and Welling Citation2016) to aggregate contextual information for the extraction of urban and mountainous roads. The authors also provided a mountain road dataset to compensate for the lack of a mountain road dataset. The method validated that graph convolution was useful for extracting zigzag, small and covered roads; however, it also led to the question of how to model graph structures on HRSI. In contrast to graph creation based on high-level features, Cui et al. (Citation2022) applied the SLIC algorithm (Achanta et al. Citation2012) to build graphs directly on HRSI and proposed a dual convolutional network (GDCNet) together with CNN and GCN. However, GDCNet did not explicitly utilize road a priori features to model the graph and did not make sufficient use of high-level semantic features. In the same year, to reduce the large number of computations caused by the transformer, Bandara, Maria Jose Valanarasu, and Patel (Citation2022) proposed the spatial and interaction space graph reasoning module based on global graph reasoning (Y. Chen et al. Citation2019) to infer the long-range relationship between roads. The above methods show that graph structures and GNNs are effective and achievable for road modeling and assist road extraction. However, there is still room for refinement in reasonably constructing graphs for HRSI and effectively merging local and global feature information.

Although many road extraction networks have achieved fine performance, the field, in general, still faces these challenges. First, in complex road environments, the extraction and fusion of local and remote road features may be limited by the influence of road occlusion and similar areas. Second, most existing methods that use GNNs to support road extraction lack preferable approaches for modeling graph structures directly on HRSI, and the information in the original HRSI is not fully utilized in the graph structure (G. Zhou et al. Citation2022). Finally, most of the models have demonstrated good results on the urban road dataset, but their performance on the complex mountainous road dataset is yet to be verified.

To solve the above challenges and further increase the prediction accuracy, a dual convolutional network based on hypergraph and multilevel feature fusion (DHM) is proposed. The main contributions of this work are summarized as follows.

(1)	An algorithm is created to directly construct a hypergraph structure for HRSI, incorporating the shape features of roads. The hypergraph neural network is introduced for the first time in road extraction research to extract long-range context features of roads from the preconstructed hypergraph structure.
(2)	A bimodal feature fusion module (BFFM) is developed, and an interactive road extraction network (DHM) is designed. The BFFM implements the transformation and fusion of hypergraph features and CNN features, which gives the DHM a strong contextual inference capability.
(3)	A bottleneck fusion module (BFM) is developed in the CNN branch. It is responsible for fusing CNN features of different scales in the encoder and helps to implement full-stage feature fusion in the decoder to enhance the global context extraction of the road.
(4)	Extensive experiments on the GS-Mountain Road Dataset, Massachusetts Road Dataset and CHN6-CUG Road Dataset have shown that the proposed DHM has good robustness, and the experimental results outperform most state-of-the-art methods, especially on the GS-Mountain Road Dataset.

The remainder of this paper consists of the following sections. Section 2 describes the DHM proposed in this paper in detail. Section 3 describes the dataset, the evaluation metrics, and some implementation details of the related experiments. Section 4 describes the experimental results in comparison with six different types of models and the detailed ablation experiments of DHM. Finally, Section 5 gives the summary and outlook.

2. Proposed methods

The motivation for the method proposed in this paper is the following. The road extraction task currently still faces discontinuous extraction results and misclassification. The main reason is that roads may be occluded by buildings, vehicles, shadows, etc., and there are areas in HRSI that resemble roads. It is almost impossible to identify whether the region is a road or not just by the local region features in the image. However, it is possible to identify occluded roads and similar areas through the trend analysis of road directions and the inference of the surrounding region of occluded roads and similar regions due to the usual stable trend direction and road continuity. The trend direction of a road is usually determined by its long-range context, so it is important to establish and extract the long-range road context. A hypergraph is an appropriate tool to describe long-range contextual relationships (Luo, Peng, and Liang Citation2022), and a hypergraph neural network can capture the dependence of long-range context. The inference of the surrounding region requires a high-quality representation of local features and complementing them with global features to leverage the context for inferring whether the local area is a road. Multiscale local features can be obtained through deep convolutional neural networks (DCNNs). The complementation of local features can be accomplished by incorporating graph reasoning techniques or integrating long-range contextual features to fuse global features. As shown in , if the roads in HRSI are targeted with hypergraphs and as many of the same roads as possible are included in a single hyperedge, then a single hyperedge can be a good representation of the connections within the same object. The long-range context therein is also easily extracted by the hypergraph neural network. Additionally, since different hyperedges can be used to represent connections between different feature objects, this helps to reason or filter out whether a particular hyperedge is a road from other hyperedges. For the surrounding area of the road, the design of DCNNs can extract multiscale local features. Fusing long-range contexts in hypergraphs and local features in CNNs and exploiting graph reasons can help infer local regions from a global perspective.

Figure 1. Motivation for modeling road images using hypergraphs. We use different colored hyperedges to simulate hyperedges within a hypergraph, with each hyperedge containing vertices with similar properties. In the most ideal state, a road can be represented by a hyperedge, and all vertices belonging to that road are contained within a hyperedge.

According to the above motivation, by extracting long-range road semantic features and fusing features of different modalities and scales, we design the DHM, as shown in . DHM consists of three parts: a hypergraph-based feature extraction network (HGNN branch), a CNN-based multiscale feature extraction network (CNN branch), and a BFFM. The HGNN branch and the CNN branch are two parallel feature extraction network branches, and their interactions are connected by BFFM. The HGNN branch is responsible for constructing the hypergraph structure on the input image and feeding the hypergraph data to the subsequent hypergraph convolutional layer to extract the long-range contextual road feature. The CNN branch performs multiscale local semantic feature extraction on the input image using the encoder-decoder network structure. BFFM fuses and reasons the local semantic features and long-range contextual features extracted from the middle layer of the two branches. This process helps to leverage the complementary advantages of the two branches and facilitates global semantic feature extraction in the subsequent network.

Figure 2. Overall structure of the proposed dual convolutional network based on hypergraph and multilevel feature fusion (DHM). The 'fusion' in the figure denotes the BFFM; the EGR module is also part of the bimodal feature fusion module. BFM is a bottleneck fusion module proposed to achieve full-scale feature fusion in the CNN branch.

2.1. Hypergraph-based feature extraction (HGNN branch)

In HRSI, roads often have long and curved connectivity and do not change direction abruptly. These road properties can distinguish them from other objects in the images. Hypergraphs are very suitable for depicting roads in HRSI since their advantage is that they express the long-distance connectivity and similarity of adjacent nodes. Hypergraphs can be constructed directly on the original HRSI to exploit these road properties, and the possible road directions can be considered in the construction process. We propose an algorithm to construct hypergraphs for road images and use hypergraph convolution (Bai, Zhang, and Torr Citation2021) to propagate features on the obtained hypergraph data to capture road features.

A hypergraph is defined as $G = (V, E)$ with N vertices and M hyperedges, where each hyperedge can connect multiple vertices. Unlike a normal graph, the incidence matrix $H \in R^{N \times M}$ is used to represent the hypergraph.

2.1.1. Hypergraph construction

For a given original input image $X \in R^{H \times W \times C}$ , C represents the image channel dimension, and H and W represent the height and width of the image. The parameter $s t e p (s)$ is used to regularly divide the original image into L blocks of equal size and nonintersecting superpixels, where $L = \frac{H \times W}{s^{2}} = N$ , ensuring that L is an integer. The original input image $X$ is then transformed into the new feature $\tilde{X} \in R^{L \times D}$ . Here, D denotes the feature dimension of the new feature $\tilde{X}$ (D equals $s \times s \times C$ ), and L represents the number of superpixels after division. $\tilde{X} = {{\tilde{x}}_{i}}_{i}^{L}$ , each feature ${\tilde{x}}_{i} \in R^{1 \times D}$ will be a node $v_{i}$ in the hypergraph. graphically shows the process of converting image features into hypergraph node features.

Figure 3. Example of the process of converting an image into a hypergraph vertex. s = 2 indicates that a block of pixels of size $2 \times 2 \times C$ is selected as a vertex feature.

Figure 3. Example of the process of converting an image into a hypergraph vertex. s = 2 indicates that a block of pixels of size 2×2×C is selected as a vertex feature.

Unlike most of the superpixel creation methods using the SLIC algorithm, the proposed superpixel creation method in this paper will facilitate the subsequent BFFM. If the parameter s is set to a small value, such as 2 or 3, the final result obtained by this superpixel creation method will not be much worse than that of the SLIC method. This inference has been proven experimentally.

Once the features of all nodes in the hypergraph are obtained, the Frobenius norm is utilized to calculate the similarity and determine which nodes should belong to the same hyperedge $e_{i}$ . For a matrix $A \in R^{n \times m}$ , the F-norm expression is as follows: (1) $‖ A ‖_{F} = \sqrt{\sum_{i = 1}^{m} \sum_{j = 1}^{n} {| a_{i j} |}^{2}}$ (1) To simplify the calculation, for each node $v_{i} \in V$ , its features ${\tilde{x}}_{i}$ are averaged over the channel dimensions C, and the newly obtained features $F_{i} \in R^{s \times s}$ are used as input features for the similarity calculation. The whole process can be expressed as follows: (2) $\begin{matrix} {\tilde{x}}_{i} \in R^{1 \times D} t o {\tilde{X}}_{i} \in R^{s \times s \times C} \\ {\tilde{X}}_{i} = {{\hat{X}}_{i}^{1}, \dots, {\hat{X}}_{i}^{C}}, {\hat{X}}_{i}^{j} \in R^{s \times s} \\ F_{i} = \frac{\sum_{j = 1}^{C} {\hat{X}}_{i}^{j}}{C} \end{matrix}$ (2) For each node, except for the boundary nodes, the similarity between the current node $v_{i}$ and its neighboring nodes, including the right node $v_{i + 1}$ , below node $v_{i + \frac{H}{s}}$ and below right node $v_{i + \frac{H}{s} + 1}$ , is calculated in sequence. Based on the threshold parameter $r a t e (r)$ , it is determined which nodes are grouped into the same hyperedge. The similarity is calculated by the following method: (3) $\begin{matrix} s i m i l a r i t y = 1 - \frac{F_{s u b}}{(F_{i} + F_{j}) / 2} \\ j = (i + 1, i + \frac{H}{s}, i + \frac{H}{s} + 1) \end{matrix}$ (3) $F_{s u b}$ denotes the F-norm of the difference of the features of two nodes $F_{i}$ , $F_{j}$ features, $F_{i}$ denotes the F-norm of feature $F_{i}$ of node $v_{i}$ . For nodes with similarity greater than the threshold r, these nodes are treated as the same object and join the same hyperedge $e_{i}$ ; otherwise, they are treated as a different object and wait for the next node calculation. When all the nodes in the hypergraph have been calculated, the representation of the incidence matrix $H$ is obtained as follows: (4) $H_{i j} = {\begin{matrix} 1, & v_{i} \in e_{j} \\ 0, & v_{i} \notin e_{j} \end{matrix}$ (4) The process of generating a hypergraph is shown in detail in . shows the results of visualizing the hyperedges and hypergraph vertices after generating a hypergraph from a remote sensing image.

Figure 4. Example of the results of generating and visualizing hypergraphs from remote sensing images. The visualization was generated with parameters s=2,r= 0.7. The white pixel points in the figure are the vertices in the hypergraph with hyperedges greater than 5 in length. (a) Cropped Massachusetts Road Dataset (b) CHN6-CUG Road Dataset (c) GS-Mountain Road Dataset.

Algorithm 1 Constructing hypergraphs

Display Table

2.1.2. Hypergraph convolution

Hypergraph convolution is applied to propagate features across the constructed hypergraph $G$ . The hypergraph convolution is defined as follows: (5) ${\tilde{X}}^{(l + 1)} = σ (D^{- 1 / 2} H W B^{- 1} H^{T} D^{- 1 / 2} {\tilde{X}}^{(l)} P)$ (5) where $σ ()$ is a nonlinear activation function, and ${\tilde{X}}^{(l)}$ is the embedding of all graph vertices at the layer. The weight matrix between the (l)th and $(l + 1)$ th layers is denoted by $P$ , where $W \in R^{M \times M}$ is the diagonal matrix of the positive weights assigned to each hyperedge. $D \in R^{N \times N}$ and $B \in R^{M \times M}$ are the diagonal matrix of the degrees of the vertices in the hypergraph and the diagonal matrix of the degrees of the hyperedges, respectively, which can be obtained by the following equation: (6) $\begin{matrix} D_{i i} = \sum_{ϵ = 1}^{M} W_{ϵ ϵ} H_{i ϵ} \\ B_{ϵ ϵ} = \sum_{i = 1}^{N} H_{i ϵ} \end{matrix}$ (6) In the proposed HGNN branch, after the first layer of hypergraph convolution, shallow CNN features are embedded into HGNN using the proposed bimodal feature fusion module to correct the noise in hypergraph node features. After the second layer of hypergraph convolution, two parallel hypergraph convolution layers are used to build a classifier and embed road long-distance context into the CNN branch using the BFFM.

2.2. CNN-based multiscale feature extraction (CNN branch)

Many deep convolution neural networks have achieved good results in depth feature extraction (He et al. Citation2016; Szegedy et al. Citation2015). However, some DCNNs are still ineffective in road extraction because roads can be affected by the shadows of the surrounding trees and cars. To enhance the extraction of features at different scales and mitigate the impact of road occlusion, multiscale feature extraction and full-stage multiscale feature fusion are implemented in the CNN branch. In the proposed CNN branch, the bottleneck residual blocks in ResNet-50 (He et al. Citation2016) are utilized as part of the feature encoder, and a decoder is designed to fuse all scale features. Additionally, a BFM is designed to integrate multiscale features. The prediction results of the CNN branch are fused with those of the HGNN branch to obtain the final result.

2.2.1. Feature encoder

Before the input features enter the bottleneck block, three convolution layers with $3 \times 3$ sized convolution kernels are used to replace the convolution layers with $7 \times 7$ sized convolution kernels in the original ResNet-50. Following each convolution layer, there is a batch normalization(BN) layer and a nonlinear activation function (ReLU). The three convolution layers are divided into two stages. In the first stage, there is only one convolution layer with a $3 \times 3$ kernel size and a stride size of 1. This convolution layer is used to extract features from the original image. The second stage consists of two convolution layers. The first layer has a stride size of 2, which is responsible for downsampling. The second layer, with a stride size of 1, extracts features from the downsampled features.

Bottleneck residual blocks are divided into four groups, and each group contains 3, 4, 6 and 3 bottlenecks. After the first group of bottlenecks, the features obtained are fused with the hypergraph features using the proposed BFFM and then fed to the subsequent bottleneck for deep feature extraction.

2.2.2. Bottleneck fusion module

In the feature encoder, the extracted features at different stages usually contain different information (S. Liu et al. Citation2022). Shallow features represent considerable fine-grained information in terms of edges, textures, etc. Deep features represent much more regional semantic information. The BFM is designed for the bottleneck residual block to fuse feature information from the full stage in the decoder stage.

The BFM selects the features extracted from the first three groups of bottleneck blocks as the fusion object. Since the input object of the second group of residual blocks is not directly derived from the output of the first group of residual blocks, the output of the first group of residual blocks is selected as one of the fusion objects. The first three groups of residual blocks are finally selected as the fusion objects to compensate for the missing features in the decoder.

The detailed BFM process is shown in . The features of the second and third groups are upsampled to the same size as the first group using bilinear interpolation and $3 \times 3$ convolution for each group. Then, the three groups of features are merged before using a $3 \times 3$ convolution layer to obtain the fusion result.

Figure 5. Overall structure of the bottleneck fusion module.

2.2.3. Feature decoder

In the decoder, each $3 \times 3$ convolution layer is followed by BN and ReLU operations, and a transposed convolution and two $3 \times 3$ convolution layers form a set of upsampling operations. Three sets of upsampling operations make up the decoder. The first upsampling operation is responsible for fusing the features of the last residual block with the features that have been fused by the BFM module. The last two upsampling operations are responsible for fusing the output of the previous upsampling operation with the features of the first two stages of the feature encoder.

2.3. Bimodal feature fusion module

In the HGNN branch, the long-range contextual information of the road is modeled by the hypergraph and extracted by the hypergraph convolution; however, the connectivity of the hypergraph is inevitably affected by road obstacles during the hypergraph construction. Compared with hypergraph convolution, CNN can capture local features well but is not good at extracting domain features over a long range. To fully leverage the strengths of hypergraph and CNN convolution, a BFFM is designed to fuse hypergraph features with image features and transform them into corresponding feature forms before subsequent tasks.

To enable the transformation of the two modal features, a position converter is developed in this study. This position converter is also at the heart of the module. Without the position converter, the spatial position of the two features would be corrupted, as shown in . The detailed execution process of the position converter is shown in and . Additionally, inspired by Bandara, Maria Jose Valanarasu, and Patel (Citation2022), an enhanced graph reasoning (EGR) module is established to aid the fusion in this section.

Figure 6. Example of a position converter with parameter s = 2 and the feature map size is $4 \times 4 \times 3$ . Numbers of the same color indicate that they belong to the same vertex feature.

Figure 6. Example of a position converter with parameter s = 2 and the feature map size is 4×4×3. Numbers of the same color indicate that they belong to the same vertex feature.

Algorithm 2 Position Converter: Image to Vertices

Display Table

Algorithm 3 Position Converter: Vertices to Image

Display Table

2.3.1. EGR module

Graph reasoning is performed using the following graph convolution (Kipf and Welling Citation2016) formula (Equation7(7) $\bar{X} = σ (A X W)$ (7) ): (7) $\bar{X} = σ (A X W)$ (7) where $A$ is the similarity matrix, $X \in R^{c \times l}$ is the input features, and $W$ is the learnable weights matrix.

Graph reasoning contains spatial graph reasoning and interactive graph reasoning. We upgrade spatial graph reasoning. The original spatial similarity matrix $A_{S}$ is obtained by performing matrix dot product calculation after $1 \times 1$ convolution on the input features. It is believed that $1 \times 1$ convolution, despite increasing the learnable parameters, has the potential to compromise the true similarity. Matrix distance is added to the similarity matrix $A_{S}$ to enhance the true similarity.

$P_{i}$ is defined in the ith row of $X^{T} \in R^{l \times c}$ , and the new matrix distance between $X$ and $X^{T}$ is expressed as follows: (8) $\begin{matrix} P = (\begin{matrix} {‖ P_{1} ‖}^{2} & {‖ P_{1} ‖}^{2} & \dots & {‖ P_{1} ‖}^{2} \\ {‖ P_{2} ‖}^{2} & {‖ P_{2} ‖}^{2} & \dots & {‖ P_{2} ‖}^{2} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ {‖ P_{l} ‖}^{2} & {‖ P_{l} ‖}^{2} & \dots & {‖ P_{l} ‖}^{2} \end{matrix}) \\ D t = \frac{P + P^{T} - 2 \times X^{T} X}{c} \end{matrix}$ (8) Equation (Equation8(8) $\begin{matrix} P = (\begin{matrix} {‖ P_{1} ‖}^{2} & {‖ P_{1} ‖}^{2} & \dots & {‖ P_{1} ‖}^{2} \\ {‖ P_{2} ‖}^{2} & {‖ P_{2} ‖}^{2} & \dots & {‖ P_{2} ‖}^{2} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ {‖ P_{l} ‖}^{2} & {‖ P_{l} ‖}^{2} & \dots & {‖ P_{l} ‖}^{2} \end{matrix}) \\ D t = \frac{P + P^{T} - 2 \times X^{T} X}{c} \end{matrix}$ (8) ) shows the process of calculating the matrix distance of $X^{T}$ . $D t$ denotes the distance between each feature point and all other points. (9) $\begin{matrix} A_{n e w} = m a x (D t) - D t \\ A_{S} = Softmax (ϕ_{S} (X) Λ (X) ϕ_{S} (X)^{T} + A_{n e w}) \end{matrix}$ (9) In Equation (Equation9(9) $\begin{matrix} A_{n e w} = m a x (D t) - D t \\ A_{S} = Softmax (ϕ_{S} (X) Λ (X) ϕ_{S} (X)^{T} + A_{n e w}) \end{matrix}$ (9) ), $m a x (D t)$ is the maximum value of the matrix $D t$ , $A_{n e w}$ is the added similarity matrix, and $A_{S}$ is the new spatial similarity matrix.

The fusion procedure of bimodal features is demonstrated from the CNN branch to the HGNN branch. The image features from the CNN branch are first input into the EGR module, and then the output features from the EGR module are spatially aligned using the position converter. The aligned features are finally downsampled and summed with the hypergraph features as the input of the HGNN branch.

The fusion procedure of bimodal features is illustrated from the HGNN branch to the CNN branch. The hypergraph features from the HGNN branch are first aligned according to the spatial positions by a position converter. Then, the aligned hypergraph features and the image features from the CNN branch are extracted using 1 × 1 convolution + BN + ReLU after a concatenation operation and fed into the CNN branch.

3. Experiments

3.1. Dataset and preprocessing

3.1.1. GS-mountain road dataset

This dataset, constructed from ZY-3 satellite images by G. Zhou et al. (Citation2022), covers an area of 2500 km $^{2}$ in Gansu Province, China. The dataset includes 204 images in the training set, 40 images in the validation set and 11 images in the test set, all of which are 256*256 pixels in size. The dataset focuses on roads in mountain scenes, and all images have a resolution of 2 m/pixel.

3.1.2. Massachusetts road dataset

This dataset (Mnih Citation2013) contains 1171 remote sensing images from Massachusetts. The training set contains 1108 images, the validation set contains 14 images, and the test set contains 49 images, each of which is 1500*1500 pixels in size. Images of roads in rural, urban and other regional scenes are included in this dataset. All datasets have a resolution of 1 pixel/m $^{2}$ .

3.1.3. CHN6-CUG road dataset

The dataset (Zhu et al. Citation2021) covers remote sensing images of six Chinese cities. It comprises a total of 4511 images that are 512*512 pixels in size and have a resolution of 50 cm/pixel. These images were divided into 3608 images for training and 903 images for testing. In addition to road images in urban and rural scenes, the dataset also contains road images in railway and highway scenes.

To reduce the video memory consumption during the execution of the selected models, each image and the corresponding labeled image in the Massachusetts road dataset were cropped to 48 224*224 pixel images. The image sizes in the remaining two datasets are left unchanged. The partially obscured images, which were present in the Massachusetts Road Dataset and the CHN6-CUG Road Dataset, were removed from the dataset. and show detailed information on all datasets after preprocessing.

Table 1. Size of the number of three datasets after preprocessing.

Download CSV Display Table

Table 2. Normalization criteria for the three datasets after preprocessing.

Download CSV Display Table

3.2. Implementation details

The proposed DHM is implemented using the PyTorch framework. It undergoes training for a total of 100 epochs. It should be noted that data augmentation techniques were not employed during the experiment. All experiments are conducted on two NVIDIA A10 GPUs. Binary cross entropy and Dice coefficients are incorporated into the loss function utilized in DHM. As the model optimizer, Adam's algorithm is employed with a weight decay of 0.0001 and a momentum of 0.95. Additionally, a self-defined learning rate scheduler similar to ReduceLROnPlateau is employed, with the distinction that the patience value increases linearly after each learning rate update. The initial learning rate is set at 0.001, with a reduction factor of 0.5 and a patience value of 3 for three different datasets (batch sizes are set at 16, 32, and 8). Regarding the parameters used to construct the hypergraph, step = 2 and rate = 0.7 are chosen.

3.3. Evaluation metrics

Intersection over union (IOU), mean-IOU (mIOU), precision, recall rate, f1-score, floating point operations (FLOPs) and total number of parameters (Param) are used as evaluation metrics and defined as follows: (10) $\begin{matrix} I O U = \frac{T P}{T P + F P + F N} \\ m I o U = \frac{1}{k + 1} \sum_{i = 0}^{k} {I O U}_{i} \\ R e c a l l = \frac{T P}{T P + F N} \\ P r e c i s i o n = \frac{T P}{T P + F P} \\ F 1 - S c o r e = \frac{2 T P}{2 T P + F P + F N} \end{matrix}$ (10) where k represents the number of classes. For the road extraction task, there are two classes: road (1) and background (0). TP and TN denote the number of pixels correctly predicted as road and background, respectively. FP indicates the number of pixels that incorrectly identified the background as a road. FN denotes the number of pixels that incorrectly identified the road as a background. Param and FLOPs are used to describe the number of parameters of the model and the computational complexity, respectively, with the latter being related to the input size.

4. Result

4.1. Comparative analysis and visualization

In this section, several classical deep learning models and recent road extraction models are chosen for comparison with DHM on three datasets, including UNet (Ronneberger, Fischer, and Brox Citation2015), SegNet (Badrinarayanan, Kendall, and Cipolla Citation2017), Deeplabv3+ (L.-C. Chen et al. Citation2018), D-LinkNet (L. Zhou, Zhang, and Wu Citation2018), SGCN (G. Zhou et al. Citation2022), and RCFSNet (Z. Yang et al. Citation2023). UNet and SegNet are classical semantic segmentation models in the image domain. DeepLabv3+ introduces an ASPP module to capture semantic information at different scales. D-LinkNet achieves outstanding results in the CVPR DeepGlobe 2018 Road Extraction Challenge by utilizing dilation convolution. SGCN utilizes gradient operators to construct graphs over spatial and channel features and employs graph convolution to extract long-range contextual road information. RCFSNet incorporates the coordinate dual attention mechanism to extract multiscale context and utilizes full-stage feature fusion for accurate road extraction.

4.1.1. Experimental results on the massachusetts road dataset

GDCNet (Cui et al. Citation2022) is additionally cited for the comparison experiment, and the results are directly quoted from its paper. Because the GDCNet code is not publicly available, some of the evaluation metrics are not publicly available. shows the evaluation metrics for the prediction results of DHM on the test set of the Massachusetts Road dataset. Among these models, DHM predicts results with better overall performance than the classical Models UNet, SegNet, DeepLabv3+, and D-LinkNet. This is because the DHM introduces hypergraph neural networks, which increase the DHM's context acquisition ability. Compared with the state-of-the-art models in the last two years, SGCN, GDCNet, RCFSNet, and DHM outperform other models in four metrics: IOU, mIOU, recall, and F1. This indicates that DHM can extract more complete roads compared to these models, benefiting from the fact that the shape features of the roads are incorporated into the DHM. Compared to GDCNet with the SLIC algorithm, the proposed DHM performs well in the remaining metrics, except for precision. The reason behind this is attributed to the incorporation of road shape features in the proposed hypergraph construction algorithm. This allows DHM to extract road-related contextual information effectively. Additionally, fusing CNN branch and HGNN branch features demonstrates better results compared to utilizing two independent branch networks.

Table 3. Comparison of the Performance of the Different Models on the Test Set of the Massachusetts Road Dataset.

Download CSV Display Table

As shown in , a visual comparison of all the selected methods on the Massachusetts Road Dataset is conducted. In general, the selected models can extract the main trunk of the road but have significant differences in some detailed areas. Although the red box areas of images in the first and fourth rows are roads occluded by other objects, it can be clearly seen in that RCFSNet displays the extraordinary extraction ability of these occluded targets because it employs the attention mechanism to enhance the acquisition of global features. Similarly, DHM achieved good extraction results by combining local features with long-range contextual features. Even more encouragingly, it can be observed from the red box areas of images in the second and third rows that DHM outperforms both SGCN and RCFSNet in extracting roads within these similarity regions in which road areas are similar in color or texture to the surrounding environment. The greatest difference between the DHM and both of these networks is that the DHM captures the context related to road shape features. It employs graph reasoning and global feature extraction techniques to infer road regions and extract roads that are not easily recognizable. Benefiting from the hypergraph structure applied to DHM, DHM can extract long-range contextual road features, and it is easy to see that DHM achieves good robustness when compared with other models. In summary, DHM is better equipped to extract these complex regions than other state-of-the-art models.

Figure 7. Comparison of each model visualization on the Massachusetts Road Dataset, with the red box highlighting the advantages of the proposed model in complex road areas.

4.1.2. Experimental results on the GS-mountain road dataset and CHN6-CUG road dataset

summarizes the quantitative results of the selected models on the GS-Mountain road dataset and the CHN6-CUG road dataset. For the GS-Mountain road dataset, it is easy to see that the roads in the GS-Mountain road dataset are slenderer and have a more uneven proportion than those in the other two datasets. DHM is significantly ahead of other state-of-the-art models in five metrics: IOU, mIOU, precision, recall, and F1. The extraction results of mountain roads can be significantly enhanced by using hypergraph and multiscale feature extraction. For the CHN6-CUG road dataset, DHM leads in other evaluation indicators, except for ranking second in accuracy. Both SGCN and DHM obtain better results than RCFSNet due to the introduction of graph structure for feature extraction. Compared to SGCN, better results are also obtained by constructing the hypergraph structure directly on HRSI before extracting the long-range context.

Table 4. Comparison of the Performance of the Different Models on the GS-Mountain Road Dataset and CHN6-CUG Road Dataset Test Sets.

Download CSV Display Table

shows the results of the visualization comparison of different models on the GS-Mountain Road Dataset. It can be clearly seen that UNet, SegNet and DeepLabv3+, which performed reasonably well in the Massachusetts Road Dataset, show a precipitous drop in the visualization of the GS-Mountain Road Dataset. Overall, the visualization of DHM on the GS-Mountain Road Dataset shows better continuity and finer-grained accuracy than the other six models. This indicates that extracting comprehensive road features using hypergraphs and multiscale feature fusion for roads in complex scenes will effectively enhance the connectivity and accuracy of road extraction in mountainous scenes.

Figure 8. Comparison of each model visualization on the GS-Mountain Road Dataset.

shows the visualization of the model on the CHN-CUG road dataset. The third image focuses on the complex interchange road scene, and it can be seen that none of the models were able to extract an accurate road in that scene. The first, second and fourth images are mainly roads obscured by shadows, trees and vehicles in different scenes. In the first image, DHM extracted more roads in the shaded area within the red box than the other six models. In the second image, the DHM effect has less noise compared to SegNet, DeepLabV3+, and SGCN. Moreover, the fourth image showcases improved road prediction in the occluded area by SGCN, RCFSNet, and the DHM. These observations serve as evidence of the superiority of graph structures in handling road networks. By incorporating hypergraph structures and enhancing graph reasoning, the powerful representation capability of hypergraph structures for road structures is provided. Additionally, the graph reasoning module effectively filters out noise information and significantly enhances the network's ability to predict occluded regions, thereby enhancing road connectivity.

Figure 9. Comparison of each model visualization on the CHN6-CUG Road Dataset.

4.2. Ablation study

4.2.1. Effectiveness of the EGR module

The primary purpose of ablation Experiment I is to determine the validity of the proposed EGR module, i.e. the yellow module in . shows the experimental effects of this module on the three datasets. On the Massachusetts Road Dataset, this module has little improvement effect on this dataset since the Massachusetts Road Dataset covers mostly urban roads. On the CHN6-CUG Road Dataset, which contain a greater number of road images in complex scenarios, the EGR module aids in reasoning about global information. As a result, each metric shows an average increase of approximately $0.5 %$ . On the GS-Mountain Road Dataset, with its slender roads, the module significantly optimizes the prediction results.

Table 5. Ablation experiment results on the EGR module compared on the test set of the three datasets.

Download CSV Display Table

4.2.2. Effectiveness of each module

Ablation experiment II is used to test the effectiveness of each module proposed in this paper. DHM is divided into three components: GC, BFM, and BFFM. GC means hypergraph construction and hypergraph convolution, i.e. the HGNN branch. BFM is the bottleneck fusion module, and BFFM is the bimodal feature fusion module that contains the EGR module.

Specifically, the baseline is the CNN branch. It should be noted that in the absence of the BFFM component, the output of the first bottleneck block in the CNN branch is the input of the second bottleneck block. model1 is the model with the BFM added to the baseline, Model 2 is the network composed of the baseline and the HGNN branch, and Model 3 is the model with the BFM module added to Model 2, and Model 4 is the model with the BFFM added to Model 2. Since the BFFM module cannot exist without the GC component, the ablation experiments of each module are shown in .

Table 6. Ablation experiment results compared on the test set of three datasets.

Download CSV Display Table

By analyzing , it is evident that either the BFM component or the GC component added to the baseline model leads to superior extraction performance on the three datasets. Notably, the GC component demonstrates particularly significant enhancements, as it excels in capturing semantic information within complex environments. On the GS-Mountain Road Dataset, the various metrics in Model 2 are increased compared to the baseline by $10.28 %, 5.34 %, 3.42 %, 16.6 %$ and $11.29 %$ , respectively. On the CHN6-CUG Road Dataset, although the precision of model2 decreased by $3.17 %$ compared to the baseline, the other metrics increased by $2.01 %, 1.01 %, 4.77 %, 1.67 %$ . The proposed hypergraph construction algorithm and the use of hypergraph convolution to extract long-range contextual information demonstrate their positive impact. Additionally, the BFM module responsible for fusing multiscale information in the CNN branch also plays a beneficial role compared to extracting roads using only the CNN branch. This also proves that when dealing with road extraction, multiscale feature fusion or the use of graph structure to extract long-range contextual information can be beneficial to road extraction results.

Based on the final experimental results, removing either the BFM or the BFFM module from DHM results in a varying degree of performance loss, which is especially evident in the GS-Mountain Road Dataset. This further validates the ability of the proposed BFM and BFFM modules to extract both local and global information, thus improving the road extraction performance.

4.2.3. Limitations and improvement

The proposed model has the following limitations.

Limitations of high FLOPs. As seen in and , DHM has higher FLOPs than the other models, although DHM has different levels of accuracy on all three datasets. This limitation may prevent DHM from being deployed on lightweight devices.
There are some missing road areas in the visualized forecast results. A comparison of models showing visualized prediction results in to shows that there are still some roads that cannot be fully extracted by DHM compared to real labels.

Improvements can be targeted in the follow-up to address these limitations.

FLOPs Improvement: Separable convolution and dilated convolution can effectively reduce the computational and parameter overhead of the model. Utilizing them to devise a new context extraction method may efficiently lower the model's computational burden while maintaining accuracy.
Model improvement: As mentioned in Sections 4.2.1 and 4.2.1, fusing features with different scale contexts might improve model accuracy. Additionally, incorporating some prior road knowledge into deep learning models may also help improve accuracy.

5. Conclusion

In this paper, we propose a dual convolutional network for road extraction based on hypergraph and multilevel feature fusion (DHM), which exploits the advantages of both CNN and HGNN and complements each other to acquire the rich road semantics. In the HGNN branch, hypergraph neural networks are introduced for road extraction for the first time, and a hypergraph structure construction algorithm are based on road shape features. This branch utilizes a hypergraph structure for feature propagation, significantly boosts the acquisition of long-range contextual information and enhances the connectivity of roads. In the CNN branch, a BFM is incorporated into the network architecture of the encoder-decoder. This branch achieves full-stage local multiscale feature extraction and mitigates the impact of road occlusion areas. In BFFM, position converters and an enhanced graph reasoning module (EGR) are designed to transform and fuse features from the two modalities. This achieves the interaction between the dual branches and enhances the acquisition of global features by the network. Compared with six other state-of-the-art models, extensive experiments on three datasets show that DHM achieves competitive results. Moreover, the visualized detection results show that DHM significantly improves the road extraction performance in occluded or similar regions, as well as road connectivity.

In future research, hypergraph neural networks and hypergraph construction algorithm will be further optimized to enhance the performance of road feature extraction. Simultaneously, simplifying the existing CNN branch and optimizing the context fusion method are essential steps in building lightweight models.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Additional information

Funding

This work was supported by the Guizhou Provincial Key Technology R&D Program under Grant (QKHZC(2022)YB074) and the Guizhou Provincial Major Scientific and Technological Program under Grant (QKHZDZX(2022)001).

References

Abdollahi, Abolfazl, Biswajeet Pradhan, Nagesh Shukla, Subrata Chakraborty, and Abdullah Alamri. 2020. “Deep Learning Approaches Applied to Remote Sensing Datasets for Road Extraction: A State-of-the-Art Review.” Remote Sensing 12 (9): 1444. https://doi.org/10.3390/rs12091444.
Web of Science ®Google Scholar
Achanta, Radhakrishna, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. 2012. “SLIC Superpixels Compared to State-of-the-Art Superpixel Methods.” IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (11): 2274–2282. https://doi.org/10.1109/TPAMI.2012.120.
PubMed Web of Science ®Google Scholar
Badrinarayanan, Vijay, Alex Kendall, and Roberto Cipolla. 2017. “Segnet: A Deep Convolutional Encoder–Decoder Architecture for Image Segmentation.” IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (12): 2481–2495. https://doi.org/10.1109/TPAMI.34.
PubMed Web of Science ®Google Scholar
Bai, Song, Feihu Zhang, and Philip H. S. Torr. 2021. “Hypergraph Convolution and Hypergraph Attention.” Pattern Recognition 110:107637. https://doi.org/10.1016/j.patcog.2020.107637.
Web of Science ®Google Scholar
Bandara, Wele Gedara Chaminda, Jeya Maria Jose Valanarasu, and Vishal M. Patel. 2022. “SPIN Road Mapper: Extracting Roads from Aerial Images via Spatial and Interaction Space Graph Reasoning for Autonomous Driving.” In 2022 International Conference on Robotics and Automation (ICRA), 343–350.
Google Scholar
Barrile, Vincenzo, Giuliana Bilotta, Antonino Fotia, and Ernesto Bernardo. 2020. “Road Extraction for Emergencies from Satellite Imagery.” In Computational Science and Its Applications – ICCSA 2020, 767–781. Springer International Publishing.
Google Scholar
Bastani, Favyen, and Samuel Madden. 2021. “Beyond Road Extraction: A Dataset for Map Update Using Aerial Images.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 11905–11914.
Google Scholar
Chai, Junyi, Hao Zeng, Anming Li, and Eric W. T. Ngai. 2021. “Deep Learning in Computer Vision: A Critical Review of Emerging Techniques and Application Scenarios.” Machine Learning with Applications 6:100134. https://doi.org/10.1016/j.mlwa.2021.100134.
Google Scholar
Chen, Liang-Chieh, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2014. “Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected crfs.” arXiv preprint arXiv:1412.7062.
Google Scholar
Chen, Liang-Chieh, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2017. “Deeplab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected Crfs.” IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4): 834–848. https://doi.org/10.1109/TPAMI.2017.2699184.
PubMed Web of Science ®Google Scholar
Chen, Liang-Chieh, George Papandreou, Florian Schroff, and Hartwig Adam. 2017. “Rethinking Atrous Convolution for Semantic Image Segmentation.” arXiv preprint arXiv:1706.05587.
Google Scholar
Chen, Yunpeng, Marcus Rohrbach, Zhicheng Yan, Yan Shuicheng, Jiashi Feng, and Yannis Kalantidis. 2019. “Graph-Based Global Reasoning Networks.” In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 433–442. IEEE.
Google Scholar
Chen, Liang-Chieh, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation.” In Proceedings of the European Conference on Computer Vision (ECCV), 801–818.
Google Scholar
Cui, Fumin, Yichang Shi, Ruyi Feng, Lizhe Wang, and Tieyong Zeng. 2022. “A Graph-Based Dual Convolutional Network for Automatic Road Extraction from High Resolution Remote Sensing Images.” In IGARSS 2022 -- 2022 IEEE International Geoscience and Remote Sensing Symposium, 3015–3018.
Google Scholar
Dai, Ling, Guangyun Zhang, and Rongting Zhang. 2023. “RADANet: Road Augmented Deformable Attention Network for Road Extraction From Complex High-Resolution Remote-Sensing Images.” IEEE Transactions on Geoscience and Remote Sensing 61:1–13.
Web of Science ®Google Scholar
Guo, Meng-Hao, Tian-Xing Xu, Jiang-Jiang Liu, Zheng-Ning Liu, Peng-Tao Jiang, Tai-Jiang Mu, Song-Hai Zhang, et al. 2022. “Attention Mechanisms in Computer Vision: A Survey.” Computational Visual Media 8 (3): 331–368. https://doi.org/10.1007/s41095-022-0271-y.
Google Scholar
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. “Deep Residual Learning for Image Recognition.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778.
Google Scholar
Huang, Bo, Bei Zhao, and Yimeng Song. 2018. “Urban Land-Use Mapping Using a Deep Convolutional Neural Network with High Spatial Resolution Multispectral Remote Sensing Imagery.” Remote Sensing of Environment 214:73–86. https://doi.org/10.1016/j.rse.2018.04.050.
Web of Science ®Google Scholar
Kipf, Thomas N., and Max Welling. 2016. “Semi-Supervised Classification with Graph Convolutional Networks.” arXiv preprint arXiv:1609.02907.
Google Scholar
Kumar, V. Suresh, Ahmed Alemran, Dimitrios A. Karras, Shashi Kant Gupta, Chandra Kumar Dixit, and Bhadrappa Haralayya. 2022. “Natural Language Processing Using Graph Neural Network for Text Classification.” In 2022 International Conference on Knowledge Engineering and Communication Systems (ICKES), 1–5. IEEE.
Google Scholar
Li, Weijia, Haohuan Fu, Le Yu, and Arthur Cracknell. 2016. “Deep Learning Based Oil Palm Tree Detection and Counting for High-resolution Remote Sensing Images.” Remote Sensing 9 (1): 22. https://doi.org/10.3390/rs9010022.
Web of Science ®Google Scholar
Liu, Ze, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012–10022.
Google Scholar
Liu, Xiangzeng, Ziyao Wang, Jinting Wan, Juli Zhang, Yue Xi, Ruyi Liu, and Qiguang Miao. 2023. “RoadFormer: Road Extraction Using a Swin Transformer Combined with a Spatial and Channel Separable Convolution.” Remote Sensing 15 (4): 1049. https://doi.org/10.3390/rs15041049.
Web of Science ®Google Scholar
Liu, Pengfei, Qing Wang, Gaochao Yang, Lu Li, and Huan Zhang. 2022. “Survey of Road Extraction Methods in Remote Sensing Images Based on Deep Learning.” PFG–Journal of Photogrammetry, Remote Sensing and Geoinformation Science 90 (2): 135–159. https://doi.org/10.1007/s41064-022-00194-z.
Web of Science ®Google Scholar
Liu, Bo, Huayi Wu, Yandong Wang, and Wenming Liu. 2015. “Main Road Extraction From ZY-3 Grayscale Imagery Based on Directional Mathematical Morphology and VGI Prior Knowledge in Urban Areas.” PLoS One 10 (9): e0138071. https://doi.org/10.1371/journal.pone.0138071.
PubMed Web of Science ®Google Scholar
Liu, Sicong, Yongjie Zheng, Qian Du, Lorenzo Bruzzone, Alim Samat, Xiaohua Tong, Yanmin Jin, and Chao Wang. 2022. “A Shallow-to-Deep Feature Fusion Network for VHR Remote Sensing Image Classification.” IEEE Transactions on Geoscience and Remote Sensing 60:1–13.
Google Scholar
Long, Jonathan, Evan Shelhamer, and Trevor Darrell. 2015. “Fully Convolutional Networks for Semantic Segmentation.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3431–3440.
Google Scholar
Luo, Xiaoyi, Jiaheng Peng, and Jun Liang. 2022. “Directed Hypergraph Attention Network for Traffic Forecasting.” IET Intelligent Transport Systems 16 (1): 85–98. https://doi.org/10.1049/itr2.v16.1.
Web of Science ®Google Scholar
Maurya, Rohit, P. R. Gupta, and Ajay Shankar Shukla. 2011. “Road Extraction Using K-Means Clustering and Morphological Operations.” In 2011 International Conference on Image Information Processing, 1–6. IEEE.
Google Scholar
Meyer, Eivind, Maurice Brenner, Bowen Zhang, Max Schickert, Bilal Musani, and Matthias Althoff. 2023. “Geometric Deep Learning for Autonomous Driving: Unlocking the Power of Graph Neural Networks With CommonRoad-Geometric.”
Google Scholar
Mnih, Volodymyr. 2013. “Machine Learning for Aerial Image Labeling.” PhD diss., University of Toronto.
Google Scholar
Movaghati, Sahar, Alireza Moghaddamjoo, and Ahad Tavakoli. 2010. “Road Extraction From Satellite Images Using Particle Filtering and Extended Kalman Filtering.” IEEE Transactions on Geoscience and Remote Sensing 48 (7): 2807–2817. https://doi.org/10.1109/TGRS.36.
Web of Science ®Google Scholar
Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. 2015. “U-Net: Convolutional Networks for Biomedical Image Segmentation.” In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III 18, 234–241. Springer.
Google Scholar
Senthilnath, J., M. Rajeshwari, and S. N. Omkar. 2009. “Automatic Road Extraction Using High Resolution Satellite Image Based on Texture Progressive Analysis and Normalized Cut Method.” Journal of the Indian Society of Remote Sensing 37 (3): 351–361. https://doi.org/10.1007/s12524-009-0043-5.
Web of Science ®Google Scholar
Sghaier, Moslem Ouled, and Richard Lepage. 2015. “Road Extraction From Very High Resolution Remote Sensing Optical Images Based on Texture Analysis and Beamlet Transform.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 9 (5): 1946–1958. https://doi.org/10.1109/JSTARS.4609443.
Web of Science ®Google Scholar
Sun, Le, Guangrui Zhao, Yuhui Zheng, and Zebin Wu. 2022. “Spectral Spatial Feature Tokenization Transformer for Hyperspectral Image Classification.” IEEE Transactions on Geoscience and Remote Sensing 60:1–14.
Google Scholar
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. “Going Deeper with Convolutions.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1–9.
Google Scholar
Tong, Xin-Yi, Gui-Song Xia, Qikai Lu, Huanfeng Shen, Shengyang Li, Shucheng You, and Liangpei Zhang. 2020. “Land-cover Classification with High-resolution Remote Sensing Images Using Transferable Deep Models.” Remote Sensing of Environment 237:111322. https://doi.org/10.1016/j.rse.2019.111322.
Web of Science ®Google Scholar
Wan, Jie, Zhong Xie, Yongyang Xu, Siqiong Chen, and Qinjun Qiu. 2021. “DA-RoadNet: A Dual-Attention Network for Road Extraction From High Resolution Satellite Imagery.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14:6302–6315. https://doi.org/10.1109/JSTARS.2021.3083055.
Web of Science ®Google Scholar
Wang, Panqu, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, and Garrison Cottrell. 2018. “Understanding Convolution for Semantic Segmentation.” In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 1451–1460. IEEE.
Google Scholar
Wang, Junjie, Wei Li, Yunhao Gao, Mengmeng Zhang, Ran Tao, and Qian Du. 2023. “Hyperspectral and SAR Image Classification Via Multiscale Interactive Fusion Network.” IEEE Transactions on Neural Networks and Learning Systems 34: 10823–10837.
PubMed Web of Science ®Google Scholar
Wang, Ying, Yuexing Peng, Wei Li, George C. Alexandropoulos, Junchuan Yu, Daqing Ge, and Wei Xiang. 2022. “DDU-Net: Dual-Decoder-U-Net for Road Extraction Using High-Resolution Remote Sensing Images.” IEEE Transactions on Geoscience and Remote Sensing 60:1–12.
Web of Science ®Google Scholar
Wang, Xianghai, Keyun Zhao, Xiaoyang Zhao, and Siyao Li. 2022. “CSDBF: Dual-Branch Framework Based on Temporal–Spatial Joint Graph Attention With Complement Strategy for Hyperspectral Image Change Detection.” IEEE Transactions on Geoscience and Remote Sensing 60:1–18.
Web of Science ®Google Scholar
Xie, Yan, Fang Miao, Kai Zhou, and Jing Peng. 2019. “HsgNet: A Road Extraction Network Based on Global Perception of High-Order Spatial Information.” ISPRS International Journal of Geo-Information8 (12): 571. https://doi.org/10.3390/ijgi8120571.
Web of Science ®Google Scholar
Xu, Yingxiao, Hao Chen, Chun Du, and Jun Li. 2021. “MSACon: Mining Spatial Attention-Based Contextual Information for Road Extraction.” IEEE Transactions on Geoscience and Remote Sensing60:1–17.
Web of Science ®Google Scholar
Xu, Zhenhua, Yuxiang Sun, and Ming Liu. 2021. “icurb: Imitation Learning-based Detection of Road Curbs Using Aerial Images for Autonomous Driving.” IEEE Robotics and Automation Letters 6 (2): 1097–1104. https://doi.org/10.1109/LSP.2016..
Web of Science ®Google Scholar
Yang, Liangwei, Zhiwei Liu, Yingtong Dou, Jing Ma, and Philip S. Yu. 2021. “ConsisRec: Enhancing GNN for Social Recommendation via Consistent Neighbor Aggregation.” In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2141–2145.
Google Scholar
Yang, Zhigang, Daoxiang Zhou, Ying Yang, Jiapeng Zhang, and Zehua Chen. 2022. “TransRoadNet: A Novel Road Extraction Method for Remote Sensing Images Via Combining High-Level Semantic Feature and Context.” IEEE Geoscience and Remote Sensing Letters 19:1–5.
Web of Science ®Google Scholar
Yang, Zhigang, Daoxiang Zhou, Ying Yang, Jiapeng Zhang, and Zehua Chen. 2023. “Road Extraction From Satellite Imagery by Road Context and Full-Stage Feature.” IEEE Geoscience and Remote Sensing Letters 20:1–5.
Web of Science ®Google Scholar
Zhang, Xinyu, Yu Jiang, Lizhe Wang, Wei Han, Ruyi Feng, Runyu Fan, and Sheng Wang. 2022. “Complex Mountain Road Extraction in High-Resolution Remote Sensing Images Via a Light Roadformer and a New Benchmark.” Remote Sensing 14 (19): 4729. https://doi.org/10.3390/rs14194729.
Web of Science ®Google Scholar
Zhang, Mengmeng, Wei Li, Yuxiang Zhang, Ran Tao, and Qian Du. 2023. “Hyperspectral and LiDAR Data Classification Based on Structural Optimization Transmission.” IEEE Transactions on Cybernetics53 (5): 3153–3164. https://doi.org/10.1109/TCYB.2022.3169773.
PubMed Web of Science ®Google Scholar
Zhang, Zhengxin, Qingjie Liu, and Yunhong Wang. 2018. “Road Extraction by Deep Residual U-Net.” IEEE Geoscience and Remote Sensing Letters 15 (5): 749–753. https://doi.org/10.1109/LGRS.2018.2802944.
Web of Science ®Google Scholar
Zhang, Xiangrong, Wenkang Ma, Chen Li, Jie Wu, Xu Tang, and Licheng Jiao. 2019. “Fully Convolutional Network-Based Ensemble Method for Road Extraction From Aerial Images.” IEEE Geoscience and Remote Sensing Letters 17 (10): 1777–1781. https://doi.org/10.1109/LGRS.8859.
Web of Science ®Google Scholar
Zhang, Yingying, Junping Zhang, Tong Li, and Ke Sun. 2016. “Road Extraction and Intersection Detection Based on Tensor Voting.” In 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 1587–1590. IEEE.
Google Scholar
Zhong, Zilong, Jonathan Li, Weihong Cui, and Han Jiang. 2016. “Fully Convolutional Networks for Building and Road Extraction: Preliminary Results.” In 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 1591–1594. IEEE.
Google Scholar
Zhou, Gaodian, Weitao Chen, Qianshan Gui, Xianju Li, and Lizhe Wang. 2022. “Split Depth-Wise Separable Graph-Convolution Network for Road Extraction in Complex Environments From High-Resolution Remote-Sensing Images.” IEEE Transactions on Geoscience and Remote Sensing60:1–15.
Web of Science ®Google Scholar
Zhou, Lichen, Chuang Zhang, and Ming Wu. 2018. “D-LinkNet: LinkNet with Pretrained Encoder and Dilated Convolution for High Resolution Satellite Imagery Road Extraction.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 182–186.
Google Scholar
Zhu, Qiqi, Yanan Zhang, Lizeng Wang, Yanfei Zhong, Qingfeng Guan, Xiaoyan Lu, Liangpei Zhang, and Deren Li. 2021. “A Global Context-Aware and Batch-Independent Network for Road Extraction From VHR Satellite Imagery.” ISPRS Journal of Photogrammetry and Remote Sensing 175:353–365. https://doi.org/10.1016/j.isprsjprs.2021.03.016.
Web of Science ®Google Scholar

Dual convolutional network based on hypergraph and multilevel feature fusion for road extraction from high-resolution remote sensing images

ABSTRACT

1. Introduction

2. Proposed methods

2.1. Hypergraph-based feature extraction (HGNN branch)

2.1.1. Hypergraph construction

Algorithm 1 Constructing hypergraphs

2.1.2. Hypergraph convolution

2.2. CNN-based multiscale feature extraction (CNN branch)

2.2.1. Feature encoder

2.2.2. Bottleneck fusion module

2.2.3. Feature decoder

2.3. Bimodal feature fusion module

Algorithm 2 Position Converter: Image to Vertices

Algorithm 3 Position Converter: Vertices to Image

2.3.1. EGR module

3. Experiments

3.1. Dataset and preprocessing

3.1.1. GS-mountain road dataset

3.1.2. Massachusetts road dataset

3.1.3. CHN6-CUG road dataset

Table 1. Size of the number of three datasets after preprocessing.

Table 2. Normalization criteria for the three datasets after preprocessing.

3.2. Implementation details

3.3. Evaluation metrics

4. Result

4.1. Comparative analysis and visualization

4.1.1. Experimental results on the massachusetts road dataset

Table 3. Comparison of the Performance of the Different Models on the Test Set of the Massachusetts Road Dataset.

4.1.2. Experimental results on the GS-mountain road dataset and CHN6-CUG road dataset

Table 4. Comparison of the Performance of the Different Models on the GS-Mountain Road Dataset and CHN6-CUG Road Dataset Test Sets.

4.2. Ablation study

4.2.1. Effectiveness of the EGR module

Table 5. Ablation experiment results on the EGR module compared on the test set of the three datasets.

4.2.2. Effectiveness of each module

Table 6. Ablation experiment results compared on the test set of three datasets.

4.2.3. Limitations and improvement

5. Conclusion

Disclosure statement

Data availability statement

Additional information

Funding

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date