Publication Cover
Canadian Journal of Remote Sensing
Journal canadien de télédétection
Volume 50, 2024 - Issue 1
438
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Dense Connected Edge Feature Enhancement Network for Building Edge Detection from High Resolution Remote Sensing Imagery

Réseau dense connecté de rehaussement de contours pour la détection des contours de bâtiments dans des images de

, &
Article: 2298806 | Received 01 Jun 2023, Accepted 18 Dec 2023, Published online: 16 Jan 2024

Abstract

Deep-learning-based methods for building-edge-detection have been widely researched and applied in the field of image processing. However, these methods often emphasis the analysis of deep features, which may result in neglecting crucial shallow information representation. Furthermore, abstract features in the deep layers can potentially interfere with the accuracy of edge extraction. To address these challenges, we propose a densely connected edge-detection enhancement network (DCEFE-Net) for building-edge-detection in high-resolution remote sensing images. Firstly, by introducing spatial land channel attention modules, we effectively captured low-level spatial information and high-level semantic information from the input image. Secondly, the proposed edge-aware feature enhancement (EAFE) module emphasis the representation of informative edge features. By alliteratively generating multiple layers of edge-detection maps, it addresses the issue of edge detail loss and enhances edge-detection accuracy. Finally, the dense connectivity blocks strengthen the connections between the convolutional layers, thereby preventing the loss of edge features. Experimental results on the WHU and the Inria Aerial Image Labeling datasets validate the effectiveness of DCEFE-Net, as it consistently produces clear and reliable building-edge results.

Résumé

Les méthodes basées sur l’apprentissage profond pour la détection des contours de bâtiments ont fait l’objet de nombreuses recherches et applications dans le domaine du traitement d’images. Cependant, ces méthodes mettent souvent l’accent sur l’analyze des caractéristiques profondes, ce qui peut conduire à négliger la représentation de l’information superficielle. De plus, les caractéristiques abstraites dans les couches profondes peuvent potentiellement interférer avec la précision de l’extraction des contours. Pour relever ces défis, nous proposons un réseau densément connecté de rehaussement de la détection de contours (DCEFE-Net) pour la détection des contours des bâtiments dans les images de télédétection à haute résolution. Tout d’abord, en introduisant des modules d’attention pour les canaux spatiaux, nous avons capturé efficacement des informations spatiales de bas niveau et des informations sémantiques de haut niveau à partir de l’image d’entrée. Deuxièmement, le module proposé de rehaussement des caractéristiques tient compte des arêtes (EAFE) et met l’accent sur la représentation des particularités informatives des contours. En générant par allitérations plusieurs couches de cartes de détection des contours, le module résout le problème de la perte de détails aux contours et améliore la précision de leur détection. Enfin, les blocs de connectivité denses renforcent les connexions entre les couches convolutives, empêchant ainsi la perte des particularités des contours. Les résultats expérimentaux sur les jeux de données WHU et Inria Aerial Image Labeling valident l’efficacité du DCEFE-Net, car il produit systématiquement des résultats clairs et fiables en bordure de bâtiments.

Introduction

Buildings are closely related to factors such as human socioeconomic status and the ecological environments. The detection of building-edge in high-resolution remote sensing imagery is an important component of geographical information databases, which plays a crucial role in urban expansion (Durieux et al. Citation2008), disaster warnings (Li et al. Citation2017), and urban planning and construction (He et al. Citation2021). Due to differences in roof materials and the presence of skylights, as well as adjacency of buildings to roads and potential obstruction by tall trees, rapid and accurate building-edge-detection is a significant focus area for experts and scholars studying urban development.

With the ongoing advancements in science and technology, remote sensing imagery has improved. High-resolution remote sensing imagery (HRRSI) can capture detailed textural information and the semantic context of buildings. Early edge-detection algorithms were characterized by long processing times, low efficiency, and poor accuracy (Ahmadi et al. Citation2010; Cui et al. Citation2012; Partovi et al. Citation2017; Peng et al. Citation2005; Su et al. Citation2018). New building extraction methods based on HRRSI have academic value and practical applications.

With the rapid development of artificial intelligence, deep-learning technology has been widely used across various subfields of image processing owing to its ability to efficiently and accurately analyze and decipher data. Remote sensing building data are characterized by massive, diverse, and complex attribute features, and deep-learning methods can automatically and extract complex abstract features from images at multiple levels. Therefore, many researchers have conducted studies on deep-learning-based edge extraction. Lu et al. (Citation2018) retrained a richer convolutional features (RCF) network to create the RCF-building model and improved the edge probability map using topographic concepts. This approach effectively leverages high-level semantic information, although the post-processing stage encounters unavoidable issues with building discontinuities. Wen et al. (Citation2021) proposed a multiscale erosion network (ME-Net) to address the thickness issue of edge extraction; however, its numerous convolutional and pooling operations can cause a loss of edge pixel values. Xia et al. (Citation2021) innovatively combined semi-supervised learning with edge-detection networks to create a semi-supervised deep learning based on edge-detection (SDLED) network to minimize labeled sample requirements while achieving a more accurate extraction of rooftop boundaries in remote sensing images. Hamaguchi et al. (Citation2018) proposed a local feature extraction (LFE) module to address the segmentation of small, densely distributed objects in HRRSI.

In CNNs deep information is some coarse-grained and more abstract features that tend to describe the image as a whole with stronger semantic information. However, this approach overlooked the deep information generated during network convolution. Ahmed and Byun (Citation2019) utilized the automatic feature extraction capability of convolutional neural networks (CNNs) for edge-detection of buildings with different volume sizes. In addition, a dense extreme inception network (DexiNed) (Soria et al.Citation2020) is a dense extreme initial network that improves holistic nested (HED) networks to obtain thin edge maps without pre-trained weights. Although this has been successful in datasets, its effectiveness in remote sensing image research, specifically for building-edge extraction, was questioned in the initial results (Bousias Alexakis and Armenakis Citation2021).

The aforementioned CNN-based methods have facilitated significant progress in building-edge-detection. Nevertheless, limitations arise due to the gradual loss of multiscale features in remote sensing images with decreasing spatial resolution. Consequently, these approaches can result in incomplete boundary extraction, diminished accuracy, and low efficiency. Furthermore, traditional CNNs frequently disregard local textures and other low-level features. Recognizing the importance of both deep and shallow information, it is imperative to introduce a network structure that unifies low and high-level features.

Fang et al. (Citation2021) utilized a feature decoder to analyze the relationships and differences between low-level edges and high-level semantic features. They devised a multi-layer edge-feature-guided network (MLEFGN) that simultaneously de-noises and predicts precise object edges. However, the applicability of this method to natural images contrasts with the distinct feature presentation of remote sensing imagery. To address these variations in remote sensing characteristics, Sun et al. (Citation2017) enhanced building detection performance by creating a two-stage CNN model that extracts low and high-level features of image objects, effectively mitigating complex background interference. Bai et al. (Citation2022) proposed an end-to-end edge-guided recurrent convolutional neural network (EGRCNN) that can generate accurate building boundaries by combining prior knowledge. Zhao et al. (Citation2023) introduced a multiscale receptive field network (MSRF-Net) that employs attention mechanisms to capture multiscale receptive field features and enhance building detection through a feature fusion module.

This study extensively considered the link between low- and high-level features. Low-level features provide accurate target localization, whereas high-level features contain rich semantic information. Notably directly merging low- and high-level features may result in the loss of edge feature details and the inability to eliminate background noise present in low-level features. In addition, the consecutive convolution processes in the network structure may lead to a loss of edge information and the problem of building-edge breakpoints. Moreover, consecutive convolutional processes in the network architecture can lead to edge information loss and create discontinuities in the building-edges.

To address these challenges, we present a novel approach called a densely connected edge-detection enhancement network (DCEFE-Net), which efficiently generates precise building-edges in HRRSI. The main contributions of this study are summarized as follows.

  1. We propose a DCEFE-Net capable of effectively considering both low-level spatial features and high-level semantic features to address the issue of boundary details blurring during building-edge extraction.

  2. We propose an edge-detection enhancement (EAFE) module, which addresses the scale variations among buildings by iteratively stacking edge features and up-sampling them based on the network depth.

  3. To further preserve the extracted edge information during the convolutional process, we enhance the connectivity between the network modules by adding a dense convolutional layer. This enables stronger connections between the modules and improves the robustness of the network for edge extraction.

Related work

This section describes related work in three aspects: transformer – based salient object detection (SOD), CNN – based SOD and edge-detection methods.

Transformer -based SOD

The transformer model, proposed by Vaswani et al. (Citation2023), was initially aimed at modeling the long-range dependencies between inputs and outputs in machine translation. Leveraging its strong contextual information representation, the transformer handles crucial multiscale features using cross-attention layers. Consequently, this model has been applied in across diverse visual tasks, including image segmentation (Strudel et al. Citation2021; Yan et al. Citation2022), object tracking (Shan et al. Citation2023), image recognition (Dosovitskiy et al. Citation2021; Touvron et al. Citation2021), and object detection (Carion et al. Citation2020; Chen et al. Citation2021).

In addition, transformers have proven to be effective in different applications of salient object detection (SOD) (Liu et al. Citation2021, Citation2022; Wang et al. Citation2022). For example, Wang et al. (Citation2023) created a transformer-enhanced module to expand the receptive field and enhance salient object recognition. Similarly, Min et al. (Citation2022) embedded a transformer in a network structure, facilitating the distant propagation of contextual dependencies, thereby improving the accuracy of in video-based saliency detection. Pei et al. (Citation2023) introduced a salient instance segmentation framework with a transformer, emphasizing global features from the encoder and query features. Wang et al. (Citation2023) designed a transformer comprising feature enhancement and fusion modules to effectively extract valuable information across multiple scales for saliency detection in red, green, blue and depth (RGB-D) images.

Although transformer-based salient object detection (SOD) models excel, they often lack precise spatial cues and structural information, causing challenges such as pixel confusion between salient objects and backgrounds. In addition, for remote sensing imagery, CNNs adept at handling spatial features are better suited for image data processing.

CNN-based SOD

Researchers have introduced human visual attention (HVA) mechanisms into image- processing tasks through visual image saliency detection (Itti et al. Citation1998), which filters and screens important information in images via computer simulations. Deep-learning has significantly facilitated the application of saliency target detection research in semantic segmentation (Han et al. Citation2006; Ko and Nam Citation2006), target detection (Gao et al. Citation2012), image indexing (Zheng et al. Citation2015), image classification (Benjilali et al. Citation2019; Zhang et al. Citation2019), and super-pixels (Lin et al. Citation2018), overcoming the time-consuming nature and poor results produced by manual feature extraction using traditional methods. The main methods are outlined as follows:

  1. Significant target detection methods based on auxiliary networks.

This approach utilizes models from other domains as auxiliary networks to improve salient object-detection performance. The multiscale deep features (MDF) algorithm (Li and Yu Citation2015) was the first to apply deep neural networks to saliency detection. CNNs were used to extract features at multiple scales to enhance the spatial coherence of the saliency results. The algorithm also constructs a large saliency annotation dataset, MSRA-B, to facilitate research on saliency models. Li et al. (Citation2018) proposed a contour-to-saliency network (C2SNet) algorithm consisting of two branches. One branch predicts the contour of the target object, and the other estimates the saliency score for each pixel. This method can automatically convert contour detection in deep models into salient object detection. Wu et al. (Citation2019) proposed a mutual learning supervised learning network (MLSLNet) that combines foreground contour detection, edge-detection, and saliency object detection to effectively reduce image noise and achieve accurate foreground contour prediction.

  • 2. Boundary and semantic enhancement based on saliency target detection methods.

High-level features contain rich semantic information, whereas low-to-middle -level features contain good spatial information. Semantic enhancement can locate salient targets better (Liu et al. Citation2018; Wu et al. Citation2019), whereas boundary enhancement can obtain clearer boundaries for salient targets (Feng et al. Citation2019; Qin et al. Citation2019). However, enhancing high-level features can cause blurring of low-to-middle-level features, and vice versa. Simultaneously enhancing both can play a more positive role in edge extraction. Wang et al. (Citation2017) introduced a feed-forward network (FFN) improvement method that utilized pyramid pooling and multi-stage refinement to extract high-level semantic and detailed underlying features. Wang et al. (Citation2019) proposed PAGE-Net to refine edge-detection information using multiscale saliency information. Zhao and Wu (Citation2019) developed the pyramid feature attention (PFA) network, that combines a context-aware pyramid feature extraction (CPFE) module to gather contextual semantic information and a spatial attention (SA) module to precisely identify salient object boundaries, resulting in enhanced performance.

  • 3. Prominence target detection methods based on a combination of global and local information.

By using recursive operations and attention mechanisms, global information such as color, texture, and foreground or background, is integrated with local information for boundary detection. This approach improved performance in detecting salient objects. Wang et al. (Citation2018) introduced a global recursive localization network (RLN) capable of adaptively learning image contextual information and accurately refining object boundaries. Zhang et al. (Citation2018) incorporated multi-path recursive feedback into a network and transferred top-level semantic information toshallow layers. This allowed the attention-guided network to effectively filter out background interference in the generated features. Additionally, several improved algorithms have been proposed (Liu and Han Citation2016; Luo et al. Citation2017).

Edge-detection methods

Image edge-detection is considered one of the most crucial components of both image processing and computer vision. It plays a vital role in a wide range of applications, including image improvement, restoration, and segmentation, as well as pattern recognition, feature description, and object tracking (Ma et al. Citation2018; Pirzada and Siddiqui Citation2013). By accurately identifying and extracting edge-feature information from remote sensing images, edge-detection methods can quickly locate important features. There are two main types of edge-detection algorithms: traditional and deep-learning. Traditional edge-detection methods rely on analyzing the underlying information within an image, such as color, luminance, texture, and gradient. These methods determine edges in images by analyzing the grey-scale jumps.

Traditional edge-detection methods are classified into two main categories, first and second-order edge-detection operators. There are several first-order edge-detection operators, including the Robert (Marr and Hildreth Citation1980), Prewitt (Chaudhuri and Chanda Citation1984), Sobel (Gao et al. Citation2010), and improved isotropic Sobel operators (Harris and Stephens Citation1988). While these operators can identify pixel locations with luminance variations, the edges they extract may be coarse and imprecise, and these operators can also produce pseudo-edges.

second-order edge-detection methods, such as the Laplacian operator (Lecun et al. Citation1998), Canny operator (Canny Citation1986), Marr-Hildreth operator, the Laplacian of Gaussian (LOG) operator, and Difference of Gaussian (DOG) operator (Ren Citation2008), are more effective in identifying edges with a strong response in textured regions. However, these methods are sensitive to noise, which is a major disadvantage.

With the significant progress made by CNNs in computer vision (Chen et al. Citation2018; He et al. Citation2016), edge-detection methods have improved by incorporating their ability to automatically extract high-quality semantic information (Shen et al. Citation2015; Xie and Tu Citation2015). Popular CNNs used in edge-detection algorithms include AlexNet (Krizhevsky, Sutskever, and Hinton Citation2017), visual geometry group network (VGGNet) (Simonyan and Zisserman Citation2015), and deep residual networks (ResNet) (He et al. Citation2016).

Earlier fully convolutional network (FCN) (Shelhamer et al. Citation2017) and holistically-nested edge detection (HED) (Xie and Tu Citation2015) models were proposed as binary classification models for edges and non-edges by fuzing multi-level information based on classical networks. Yu et al. (Citation2017) modified ResNet to enable the extraction of multi-category edges, achieving good segmentation results in the SBD and Cityscapes datasets. Liu et al. (Citation2019) used VGGNet as the base network to synthesize multiscale and multi-level image information and achieved accurate edge-detection using rich convolutional features (RCF). Furthermore, many datasets have been proposed to evaluate the reliability of edge-detection algorithms (Liu et al. Citation2019).

Methodology

In this paper, we propose a DCEFE-Net designed to extract precise building-edges from HRRSI. HRRSI contains abundant information, with shallow features capturing spatial structural characteristics for edge localization, and deep features accurately identifying boundary regions. Both types of features are essential for multiscale building-edge-detection. Therefore, we propose an edge-detection architecture consisting of four blocks. To extract spatial structural information, we incorporated a spatial attention (SA) module into block 1. In block 4, we introduced a channel attention (CA) module to capture channel dependencies and recalibrate feature responses, thereby enhancing the accuracy of edge-detection. Furthermore, to achieve a better representation of multiscale edge features, feature aggregation is necessary. However, fuzing edge features directly from different levels can introduce noise from the lower-level features. To effectively preserve the edge features, we employed dense connections between and within the main blocks to prevent the loss of edge information. In addition, we propose an edge-aware feature enhancement (EAFE) module, that enhances edge features using dilated convolutions. This module combines the enhanced features with the upsampled output from the subsequent layer, thereby reducing background noise in the shallow features. This approach enables the network to focus on extracting precise building-edge features.

Architecture overview

Based on the HED (Xie and Tu Citation2015): for image edge-detection, we propose an end-to-end network structure called DCEFE-Net that can extract fine building-edges from HRRSI. The overall framework is illustrated in .

Figure 1. Overview of the DCEFE-Net architecture.

Figure 1. Overview of the DCEFE-Net architecture.

DCEFE-Net comprises four main encoder blocks, each comprising a stack of 3 × 3 convolutional layers with batch normalization and a ReLU function. To better capture both the shallow and deep features of the image, we introduce the SA module in block 1, which extracts spatial structural information from the lower layers through three convolutional operations. Additionally, we incorporated the CA module into block 4 to capture high-level semantic information toward the end of the network structure. To address the issue of blurry edge prediction caused by pooling layers incorrectly identifying non edge points in the background of HRRSI, we introduced dense connections between the outputs of different sections. This was achieved by incorporating 1 × 1 convolutional layers to establish dense connections, effectively addressing the potential loss of edge information during the convolutional process and improving the network’s ability to preserve edge information. The output of each main block was then passed through the EAFE module. This module enhances the building- edge features and combines the edge feature maps from different scales to produce the final building-edge-detection results. The structure of the EAFE module is depicted in .

Figure 2. The structure of the EAFE module.

Figure 2. The structure of the EAFE module.

The EAFE module uses the output of each main block as input. Because the edge feature scales vary across different main blocks, shallow features may not capture fine building-edges, whereas deep features may disperse the edge information owing to a large number of parameters. By employing the EAFE module, we can enhance the edge features within the convolutional layers, enabling the effective utilization of multiscale feature layers. This ensures that both fine-scale and high-level edge information are preserved and utilized in the network. To handle single-scale feature inputs in the EAFE module, we incorporated N dilated convolutional layers with different dilation rates to extract the edge features at multiple scales. These features were then combined with the original feature residuals to fuze the multi-layered edge information. The dilation rate for the nth dilated convolutional layer was determined as rn = max (1, r0 × k), where r0 represents the base dilation rate. Subsequently, based on the network depth where the enhanced edge features were located, a multi-level upsampling process was performed to obtain the edge extraction results at that particular scale. Multi-level upsampling involves two subblocks, each consisting of a convolutional layer and a transposed convolutional layer. In particular, subblock 1 includes a 1 × 1 convolutional layer followed by the ReLU function, whereas the transposed convolutional layer has a size of s × s, and the number of filters is set to 16. When the difference in scale (DS) between the feature values and ground truth exceeds 2, the enhanced features are input into subblock 1. Through multiple iterations with the filters in subblock 1, the DS is reduced to 2 and is further propagated to sub-block 2. The only difference between subblocks 1 and 2 is that subblock 2 had a single filter. When DS reaches 2, the convolutional and transposed convolutional layers in subblock 2 generate an edge-enhanced layer of the same size as the ground truth, effectively enhancing the edges.

Feature extraction block

This section introduces two modules in DCEFE-Net: the CA module, which scores the channels, and the SA module, which extracts low-level spatial structural features. The SA module was introduced after two convolutional layers and a pooling layer in block 1. Its purpose is to enhance the building details while minimizing edge loss. The CA module was introduced after the convolutional operations in block 4. It ranks the importance of the features and selects effective high-level features during the edge-detection process. Finally, the dense connection module effectively combines features from different levels and generates building-edge feature maps. High-level semantic features play a crucial role in improving edge-detection accuracy (Bell et al. Citation2015; Marcu and Leordeanu Citation2016). However, in deep feature learning, the receptive field of CNNs often exceeds the effective receptive field, leading to the underutilization of global image information (Chen et al. Citation2018). To address this issue, we input the HRRSI into DCEFE-Net and performed basic feature extraction using the first two convolutional layers and the SA module. Subsequently, after obtaining multiscale deep features through ten convolutional layers, we utilized the CA module to selectively filter the acquired deep features, effectively harnessing the deep information that is relevant for accurate edge-detection.

The CA module learns the significance of each channel feature by enabling channel interaction, and adjusts the strength representation of the channel features accordingly. This module is illustrated in . The input features were preserved to the maximum extent through global average pooling, and the resulting channel descriptors are fed into two convolutional layers to capture channel dependencies. The CA module assigned greater weights to channels that demonstrated a high response to edge information. The formula for computing channel weights in the CA module is as follows: (1) ω=σ(F{fj}(f))(1)

Figure 3. CA module structure for high level information filtering, and SA module structure for extracting the underlying information.

Figure 3. CA module structure for high level information filtering, and SA module structure for extracting the underlying information.

Here, σ represents the sigmoid activation function, and F denotes the convolution operation. The input image is divided along the channel dimension into standard convolution layers fj of size 1 × 1 with normalized weight values denoted by j. (2) CA=σ(fj2(ξ(fj1(vh,P1)),P2))(2)

Here, ξ represents the ReLU function, and fj1(.,P1) and fj2(.,P2) correspond to Conv 1 and Conv 2, respectively. Conv 1 serves as the dimension-reduction layer, whereas Conv 2 acts as the dimension-augmentation layer. P1 and P2 represent the parameters of the CA module. By performing global average pooling on individual feature channels, the channel statistics vh were obtained. The final output of the high-level features (denoted as Chigh˜) was obtained by weighting the CA module using the following formula: (3) Chigh˜=CA·Chigh(3)

Remote sensing images typically contain abundant foreground and complex background information. Relying solely on the deep features extracted by the CA block may lead to a significant loss of detailed information after the upsampling and convolution operations. An imbalance between the edge and deep features may also result in suboptimal edge prediction results. To address these issues and obtain more effective spatial information, the low-level features obtained after the convolution layer in block 1 were directly fed into the SA block to capture the spatial details of the image. This helped generate more distinctive features that compensate for the loss of spatial information caused by stacked convolutions and upsampling operations. The structure of the SA block is shown in .

First, the spatial information of the low-level feature maps was concentrated and compressed using max-pooling and average-pooling operations. The resulting features were then enhanced by applying a sharp filter of size s × s to emphasize the building- edge features. Next, a convolution operation and sigmoid activation function were applied to reduce the number of feature maps in each channel. Here, P3 represents a parameter in the SA modules. Finally, the output is the weighted sum of the SA module on the feature maps of Clow˜, as shown in the following formula: (4) SA=σ(P3(fs([Iavgs;Imaxs])))(4) (5) Clow˜=SA·Clow(5)

Loss function

To evaluate the relationship between the model effects and losses, we used the loss function proposed by He et al. (Citation2022) to optimize the network learning process. The loss function for DCEFE-Net is expressed as follows: (6) LDCEFE=z=1Zw1 L(Rz,Yz)+w2L(R,Y)(6)

The variable z represents the number of convolutional layers in the DCEFE-Net, where 1 ≤ z ≤ Z. Rz denotes the edge-detection outcome of the side output layer z in each convolutional layer at a specific scale, while R represents the multiscale edge outcome of the fused intermediate detection results. L(.) denotes the difference function between the predicted and ground truth values of edge mapping, and w1 , w2 denote the fusion weights of the z-layer edge-detection and fused results, respectively. (7) Y=z=1ZYz(7) where Y denotes the ground truth value for the building-edge detection, which can be decomposed into z binary edge maps. Z denotes the scale size and Yz denotes the ground truth edge map, resulting in the side output layer z for a particular scale.

Experiments

Datasets

We used two challenging datasets, the WHU Aerial image dataset (Ji et al. Citation2019) and the Inria Aerial image labeling dataset (Maggiori et al. Citation2017), to train and evaluate the model.

  1. WHU Aerial Image Dataset:

The WHU building dataset comprise aerial and satellite imagery data, including building samples. The dataset covers 450 km2 of Christchurch, New Zealand, and consists of more than 220,000 independent buildings, as illustrated in . We used HRRSI with a spatial resolution of 0.3 m. The dataset comprised 8189 RGB images, each with a size of 512 × 512. Out of these images, 4000 containing buildings were filtered and divided into a training dataset of 3400 images, a validation dataset of 600 images, and a test dataset of 550 images.

  • 2. Inria Aerial Image Labeling Dataset (IAIL):

The IAIL dataset contains 180 HRRSI of five cities, namely West Tyrol, Chicago, Vienna, Kitsap County, and Austin, each with an image size of 5000 × 5000 pixels. Because the IAIL dataset includes images from Europe and the WHU dataset comprises images collected from the Southern Hemisphere, they complement each other. The IAIL dataset has a similar spatial resolution and surface coverage to the WHU dataset; however, it covers a more complex landscape and architectural style. Therefore, the IAIL dataset was cropped to the same 512 × 512 pixel size as the WHU dataset sample and filtered to include 8019 images as training data, 1419 images as validation data, and 576 images as test data. It was used to evaluate the generalization ability of DCEFE-Net under different image acquisition methods, scene complexities, and building structures. Samples of the cropped images are shown in .

Figure 4. Samples of the WHU aerial image data set.

Figure 4. Samples of the WHU aerial image data set.

Figure 5. Samples of the IAIL aerial image data set.

Figure 5. Samples of the IAIL aerial image data set.

Comparative methods and evaluation metrics

Based on the WHU and IAIL datasets, DCEFE-Net’s performance was compared with that of five state-of-the-art deep models. The main conclusions of these comparisons’ are summarized as follows:

HED (Xie and Tu Citation2015): The HED network is based on the VGGNet architecture and combines fully CNNs and a deep supervised model to achieve edge-detection in using natural images. This solves the problem of the difficult representation of multiscale and multilevel features present in the overall image, which ultimately improves the network’s learning efficiency.

DexiNed (Soria et al. Citation2020): The dense extreme inception network (DexiNed) is inspired by the HED and the Xception networks. DexiNed consists of six main blocks, each of which is closely connected to a subblock. The output of each main block is passed through the upsampling module to obtain an intermediate edge mapping map, which finally constitutes the fused edge mapping result.

LPCB (Deng et al. Citation2018): The learning to predict crisp boundaries (LPCB) addresses the problem of over-predicting thick edges in the edge-detection process. It introduces a new edge-detection loss function to balance the representation of the boundary and background classes in the training data. A hierarchical network architecture enables the generation of thin edges without post processing requirement.

RCF (Liu et al. Citation2019): A richer convolutional feature (RCF) network detect image edges using a rich convolutional feature hierarchy that captures multilevel and multiscale information. This allows for overall edge prediction from image to image, making full use of the available information.

BDCN (He et al. Citation2022): The bidirectional cascade network (BDCN)utilizes labeled edge pixels for supervision at specific scales. The network also introduces a scale enhancement module to efficiently learn multiscale representations at each layer during the training process.

Results and discussion

In this paper, we assessed the performance of DCEFE-Net using various metrics: overall accuracy (OA), recall, precision, F-measure, and intersection over union (IoU). We used two evaluation methods, strict and relaxed, for each model tested on the WHU and IAIL datasets, respectively. Strict evaluation calculates the results based on the exact correspondence between the predicted and test data. Relaxed evaluation has two components, precision and recall. Relaxed precision refers to the proportion of actual building-edge pixels that are detected as building-edge pixels within ρ pixels, while relaxed recall refers to the proportion of predicted building-edge pixels that are actual building-edge pixels within ρ pixels (Arbeláez et al. Citation2011). In all of the experimental models, the relaxation parameter ρ was set to 3 pixels.

Evaluation metrics

Inspired by Wen et al. (Citation2021), DCEFE -Net’s effectiveness in extracting building-edges is measured using the aforementioned five metrics, specifically: OA, recall, precision, F-measure, and IoU. The evaluation metrics can be described using the following equations: (8) Recall =TPTP+FP(8) (9) Precision =TPTP+FN(9) (10) OA=TP+TNTP+FP+TN+FN(10) (11) IoU=TPTP+FP+FN(11) (12) F‐measure =2× Precision × Recall Precison + Recall (12)

The following definitions were used to evaluate the performance of the algorithm in detecting building-edges: true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN). TP indicates the number of pixels that agree between the detected edges and the true value of the ground building-edge. FP indicates the number of pixels that are inconsistent between the detected edge and the true value of the ground-building-edge. FN indicates the number of non-conforming pixels between the non-building and non-building-edges detected in the ground truth. TN indicates the number of detected background pixels that agree with the true value of the non-building-edges of the ground surface. The F-measure takes into account both the optimal dataset scale (ODS) and the optimal image scale (OIS). ODS involves selecting a fixed threshold and applying it to all images, whereas OIS involves selecting a threshold for each image that maximizes the F-measure for the entire dataset.

Training details

PyTorch 1.11.0, PyTorchvision 0.12.0, and CUDA 10.2 were used to implement MSD-Net in this study. All the experiments were conducted on an NVIDIA 2070 GPU with 16 GB of memory. The training data were resized to 512 × 512 pixels, and the batch size was set to 4. The AdamW optimizer was used to train all network structures because it is more effective than the commonly used Adam optimizer in terms of convergence and regularization.

To optimize the model, several hyper parameters were set: the learning rate was set to 1e-3, betas were set to values between 0.9 and 0.999, eps was set to 1e-6, weight-decay was set to 0, and correct-bias was set to true. Although the model’s performance improved initially, after several iterations the improvement stopped. To address this issue, the ReduceLROnPlateau callback function was implemented, which further reduced the learning rate. This function relies on four key parameters: the monitor, set to val_accuracy to track the metric; the factor, set to 0.2 scale the learning rate; the patience, set to 20 determine the number of stalled epochs before activation; and min_lr, set to 0.001 to prevent unnecessary and unhelpful reductions in the learning rate.

The experiments revealed that higher epoch counts improved accuracy. However, at 120 epochs, the rate of improvement in the accuracy decreased and then gradually leveled off. To reduce training costs, the final epoch count was set to 100. The implementation also set verbose to 1, mode to auto, and threshold epsilon to 0.0001 to determine the “plain zone” of detection. The optimization method employed was WarmUP, which prioritizes the use of a smaller learning rate and gradually increases it based on changes in the input data. When the model was stabilized, the method used a lower learning rate to obtain the local optimal point.

Comparative experiments

To quantitatively evaluate the performance of DCEFE-Net, we compared it with several existing methods on two datasets, WHU and IAIL. The results of the comparisons, including specific metrics and building-edge-detection plots, are presented in Sections 5.3.1 and 5.3.2, respectively. To ensure experimental fairness and consistency of the metrics, we set the binarization threshold to 0.5 and performed post-processing using the non-maximum suppression (NMS) method.

  1. Results on WHU Dataset

presents the building-edge detection metrics for the six models LPCB, HED, RCF, DexiNed, BDCN and DCEFE-Net, on the WHU dataset. All networks achieved strict OA values above 90%, with DCEFE -Net achieving the highest value of 98.33%, outperforming the other models. Moreover, after relaxed processing, DCEFE-Net performed well for all metrics except the recall index. Specifically, the relaxed OA and relaxed precision were 98.2% and 80.11%, respectively, with a precision improvement of 7.37% compared to DexiNed. DCEFE-Net has a lower recall than LPCB because of its thinner building-edges and a significant reduction in the number of FP samples. However, the F1 score of the DCEFE-Net was 72.94%, which was 26.56% better than that of the LPCB model. This suggests that DCEFE-Net is more effective in handling non-building-edge noise constraints. In addition, DCEFE-Net outperformed DexiNed in terms of the relaxed F1 score and IoU metrics, with improvements of 5.93% and 6.71%, respectively. Moreover, the strict-OA, prediction, and F1 values of the DCEFE-Net were 1.41%, 4.83% and 4.38% higher than those of the recently proposed BDCN, respectively. This suggests that DCEFE-Net is effective in enhancing the edge-detection performance of DexiNed.

shows the building-edge-detection results for the six networks on the WHU dataset. LPCB and RCF produced noisy edges and were too sensitive to impervious surfaces such as roads. The HED and DexiNed performed better, producing more accurate building contours. However, as show in , HED and DexiNed continued to exhibit omissions and misclassifications in detecting fine building-edges, inaccurately controlled of building inflection points, and the possibility of misclassifying impervious surfaces as building-edges. By contrast, the proposed DCEFE -Net achieved better performance on the WHU dataset, reducing misclassification while preserving thin building-edges, accurately detecting complex building-edge inflection points, and matching labeled data in the dataset.

  • 2. Results on the IAIL dataset

Figure 6. Example of building-edge mapping generated by six models in the WHU dataset. From left to right are the original image, ground truth labels, DexiNed, LPCB, HED, RCF, BDCN and DCEFE -Net.

Figure 6. Example of building-edge mapping generated by six models in the WHU dataset. From left to right are the original image, ground truth labels, DexiNed, LPCB, HED, RCF, BDCN and DCEFE -Net.

Figure 7. The plot of the final building-edge-detection results generated by DexiNed, LPCB, HED, RCF, BDCN and DCEFE -Net after NMS processing. The red in the figure indicates the building-edge-detection labels, and the green shows the detection results of each model.

Figure 7. The plot of the final building-edge-detection results generated by DexiNed, LPCB, HED, RCF, BDCN and DCEFE -Net after NMS processing. The red in the figure indicates the building-edge-detection labels, and the green shows the detection results of each model.

Table 1. For the evaluation results of the test dataset in the WHU dataset.

presents the building-edge-detection metrics for the six models: LPCB, HED, RCF, DexiNed, BDCN and DCEFE -Net, on the IAIL dataset, which included buildings with vegetation occlusion and more complex city types. As a result, the relaxed OA and relaxed precision are 0.37% and 0.84% lower, respectively, than those on the WHU dataset. DCEFE-Net nonetheless still achieved the best detection performance, with higher precision than LPCB, HED, RCF, BDCN and DexiNed by 18.27%, 18.80%, 19.65%, 20.51%, and 19.58%, respectively, even under strict metrics. In addition, the strict and relaxed OA values of DCEFE-Net were 6.7% and 4.76% higher than those of the BDCN, respectively. Although the strict recall of DCEFE-Net was 0.15% lower than that of HED, the overall results were better.

In , the building-edge detection results for the five networks on the IAIL dataset are presented. DexiNed, LPCB, and RCF are sensitive to noise, resulting in rough building-edge-detection. Although the HED performs better in suppressing noise, the details of the roof are not sufficiently expressed, and the building-edges are confusing. DCEFE -Net has a better response to noise and can detect clear building-edge patterns while determining fine independent building-edges more accurately. The details are shown in . Owing to the concealment of vegetation and the complexity of the background, the building-edges in the IAIL dataset cannot be completely represented. DexiNed, LPCB, HED, and RCF cannot achieve clear and accurate identification of low freestanding building-edges, resulting in blurred and rough building-edges. For the first column of large independent buildings with complex boundary morphology, the correct detailed building outline is also not predicted, and there are a large number of FP pixels. By contrast, DCEFE-Net excels at accurately edge detection for independent low buildings and distinguish them from neighboring buildings without merging them into larger segments. In addition, it can achieve good edge-detection results for of large buildings with complex edges, providing the most accurate and complete representations of the buildings’ thin edges.

Figure 8. Example of building-edge mapping generated by six models in the IAIL dataset. From left to right are the original image, ground truth labels, DexiNed, LPCB, HED, RCF, BDCN and DCEFE -Net.

Figure 8. Example of building-edge mapping generated by six models in the IAIL dataset. From left to right are the original image, ground truth labels, DexiNed, LPCB, HED, RCF, BDCN and DCEFE -Net.

Figure 9. Plot of the final building-edge-detection results generated by DexiNed, LPCB, HED, RCF, BDCN, and DCEFE-Net after NMS processing. Red indicates the building-edge-detection labels and the green shows detection results of each model.

Figure 9. Plot of the final building-edge-detection results generated by DexiNed, LPCB, HED, RCF, BDCN, and DCEFE-Net after NMS processing. Red indicates the building-edge-detection labels and the green shows detection results of each model.

Table 2. For the evaluation results of the test dataset in the Inria dataset.

Ablation experiments

The proposed DCEFE-Net incorporates two attention modules, namely the SA and CA modules, as well as the EAFE module. Processing of the WHU dataset using DCEFE -Net is illustrated in . To validate further the performance of these modules, different combinations were experimentally evaluated using DCEFE-Net, as listed in . The baseline represents the performance of the backbone network without any of the three modules. Compared to the baseline, adding the SA module to block 1 significantly improved the edge-detection performance, with an increase of 3.59% in the OA and 2.66% in the F1 score. Adding the CA module alone to block 4 improved the OA by 3.71% and the F1 score by 2.71%. When both the SA and CA modules were simultaneously included, the OA improved by 4.28% and the F1 score improved by 3.54%. When the EAFE module was added alone, an OA increase of 4.04% and an F1 increase of 3.16% were achieved. The network performance is best when the value of the expansion convolution r in the EAFE module is taken as (3, 6, 9), and the network accuracy starts to show a decreasing trend as the value of r increases. Integrating all three modules into the network structure resulted in an OA improvement of 5.41% and an F1 improvement of 4.18%, respectively. In summary, incorporating attention modules and the EAFE module into the backbone network effectively enhanced the edge-detection performance.

Figure 10. The edge mapping of DCEFE -Net was tested in the WHU dataset. Block 1 to block 4 are the outputs of the four main modules that went through the EAFE module, fusion denote the connection and fusion between the outputs, and the average is the average of all predicted values.

Figure 10. The edge mapping of DCEFE -Net was tested in the WHU dataset. Block 1 to block 4 are the outputs of the four main modules that went through the EAFE module, fusion denote the connection and fusion between the outputs, and the average is the average of all predicted values.

Table 3. The ablation study of DCEFE-Net on the WHU dataset investigates the impact of different components: SA (Spatial Attention) module, CA (Channel Attention) module, and EAFE (Edge-Aware Feature Enhancement) module.

By training DCEFE-Net on the WHU dataset, the effect of the dilation rate factor r0 on the EAFE module was determined. The experimental results, as shown in , reveal that the network achieves optimal performance when r0 is set to three, with a fixed number of three layers for the dilated convolutions. As the value of r0 increased beyond three, the network performance started to gradually decline.

Table 4. Investigating the impact of EAFE module parameters on edge detection performance on the WHU dataset.

Conclusion

We propose DCEFE-Net to address the issue of small and independent buildings being easily overlooked in building-edge-detection due to scale differences and blurred boundary details.

The main concept of DCEFE-Net is to enhance the detection of separate building structures and separate closely connected building-edges by employing multi-level dense connectivity blocks and an EAFE. Dense connectivity blocks were designed to prevent the loss of architectural edge features in the HRRSI. The inclusion of the SA and CA modules in the main module ensures a better focus on the multiscale edge features. Moreover, the proposed edge-aware EAFE effectively combines edge features from different levels and suppresses background noise in the images. Validation using the WHU dataset and the IAIL dataset indicates that the proposed method enhances edge-detection performance, enabling the generation of more precise building boundary masks and capturing detailed local structural information of building-edges. DCEFE-Net fuses several different scales of building edge features, and by analyzing the similarities and differences between the multi-scale edge features, it effectively solves the adhesion problem between small-scale buildings and obtains a better extraction effect of small buildings. However, it should be noted that DCEFE-Net may exhibit some sensitivity to building shadows during the edge-detection process, which warrants further investigation in future research.

Disclosure statement

The authors declare no competing financial interests.

References

  • Ahmadi, S., Zoej, M.J.V., Ebadi, H., Moghaddam, H.A., and Mohammadzadeh, A. 2010. “Automatic urban building boundary extraction from high resolution aerial images using an innovative model of active contours.” International Journal of Applied Earth Observation and Geoinformation, Vol. 12 (No. 3): pp. 150–157. doi:10.1016/j.jag.2010.02.001.
  • Ahmed, A., and Byun, Y.-C. 2019. ‘Edge-detection using CNN for roof images’. Paper presented at Proceedings of the 2019 Asia Pacific Information Technology Conference, 75–78. Jeju Island Republic of Korea: ACM. doi:10.1145/3314527.3314544.
  • Arbeláez, P., Maire, M., Fowlkes, C., and Malik, J. 2011. “Contour detection and hierarchical image segmentation.” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 33 (No. 5): pp. 898–916. doi:10.1109/TPAMI.2010.161.
  • Bai, B., Fu, W., Lu, T., and Li, S. 2022. “Edge-guided recurrent convolutional neural network for multitemporal remote sensing image building change detection.” IEEE Transactions on Geoscience and Remote Sensing, Vol. 60: pp. 1–13. doi:10.1109/TGRS.2021.3106697.
  • Bell, S., Zitnick, C., Bala, K., and Girshick, R. 2015. ‘Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks’. arXiv. http://arxiv.org/abs/1512.04143.
  • Benjilali, W., Guicquero, W., Jacques, L., and Sicard, G. 2019. ‘Hardware-friendly compressive imaging based on random modulations & permutations for image acquisition and classification.” 2019 IEEE International Conference on Image Processing (ICIP), 2085–89. Taipei, Taiwan: IEEE. doi:10.1109/ICIP.2019.8803113.
  • Bousias Alexakis, E., and Armenakis, C. 2021. “Performance improvement of encoder/decoder-based CNN architectures for change detection from very high-resolution satellite imagery.” Canadian Journal of Remote Sensing, Vol. 47 (No. 2): pp. 309–336. doi:10.1080/07038992.2021.1922880.
  • Canny, J. 1986. “A computational approach to edge-detection.” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 8 (No. 6): pp. 679–698. doi:10.1109/TPAMI.1986.4767851.
  • Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. 2020. ‘End-to-End Object Detection with Transformers’. arXiv. http://arxiv.org/abs/2005.12872.
  • Chaudhuri, B.B., and Chanda, B. 1984. “The equivalence of best plane fit gradient with Robert’s, Prewitt’s and Sobel’s gradient for edge-detection and a 4-neighbour gradient with useful properties.” Signal Processing, Vol. 6 (No. 2): pp. 143–151. doi:10.1016/0165-1684(84)90015-X.
  • Chen, D.-J., Hsieh, H.-Y., and Liu, T.-L. 2021. ‘Adaptive Image Transformer for One-Shot Object Detection.” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12242–51. Nashville, TN, USA: IEEE. doi:10.1109/CVPR46437.2021.01207.
  • Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A.L. 2018. “DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs.” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40 (No. 4): pp. 834–848. doi:10.1109/TPAMI.2017.2699184.
  • Cui, S., Yan, Q., and Reinartz, P. 2012. “Complex building description and extraction based on hough transformation and cycle detection.” Remote Sensing Letters, Vol. 3 (No. 2): pp. 151–159. doi:10.1080/01431161.2010.548410.
  • Deng, R., Shen, C., Liu, S., Wang, H., and Liu, X. 2018. “Learning to predict crisp boundaries”. arXiv. http://arxiv.org/abs/1807.10097.
  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., et al. 2021. “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale”. arXiv. http://arxiv.org/abs/2010.11929.
  • Durieux, L., Lagabrielle, E., and Nelson, A. 2008. “A method for monitoring building construction in Urban sprawl areas using object-based analysis of spot 5 images and existing GIS data.” ISPRS Journal of Photogrammetry and Remote Sensing, Vol. 63 (No. 4): pp. 399–408. doi:10.1016/j.isprsjprs.2008.01.005.
  • Fang, F., Li, J., Yuan, Y., Zeng, T., and Zhang, G. 2021. “Multilevel edge-detections guided network for image denoising.” IEEE Transactions on Neural Networks and Learning Systems, Vol. 32 (No. 9): pp. 3956–3970. doi:10.1109/TNNLS.2020.3016321.
  • Fang, T., Zhang, M., Fan, Y., Wu, W., Gan, H., and She, Q. 2021. “Developing a feature decoder network with low-to-high hierarchies to improve edge-detection.” Multimedia Tools and Applications, Vol. 80 (No. 1): pp. 1611–1624. doi:10.1007/s11042-020-09800-x.
  • Feng, M., Lu, H., and Ding, E. 2019. “Attentive feedback network for boundary-aware salient object detection.” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1623–32. Long Beach, CA, USA: IEEE. doi:10.1109/CVPR.2019.00172.
  • Gao, Y., Wang, M., Tao, D., Ji, R., and Dai, Q. 2012. “3-D object retrieval and recognition with hypergraph analysis.” IEEE Transactions on Image Processing, Vol. 21 (No. 9): pp. 4290–4303. doi:10.1109/TIP.2012.2199502.
  • Hamaguchi, R., Fujita, A., Nemoto, K., Imaizumi, T., and Hikosaka, S. 2018. ‘Effective Use of Dilated Convolutions for Segmenting Small Object Instances in Remote Sensing Imagery’. 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 1442–50. Lake Tahoe, NV: IEEE. doi:10.1109/WACV.2018.00162.
  • Han, J., Ngan, K.N., Li, M., and Zhang, H.-J. 2006. “Unsupervised extraction of visual attention objects in color images.” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 16 (No. 1): pp. 141–145. doi:10.1109/TCSVT.2005.859028.
  • Harris, C., and Stephens, M. 1988. “A combined corner and edge detector.” Procedings of the Alvey Vision Conference 1988, 23.1–23.6. Manchester: Alvey Vision Club. doi:10.5244/C.2.23.
  • He, J., Zhang, S., Yang, M., Shan, Y., and Huang, T. 2022. “BDCN: Bi-directional cascade network for perceptual edge-detection.” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44 (No. 1): pp. 100–113. doi:10.1109/TPAMI.2020.3007074.
  • He, K., Zhang, X., Ren, S., and Sun, J. 2016. “Deep residual learning for image recognition.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–78. Las Vegas, NV, USA: IEEE. doi:10.1109/CVPR.2016.90.
  • He, X., Zhang, Z., and Yang, Z. 2021. “Extraction of Urban built-up area based on the fusion of night-time light data and point of interest data.” Royal Society Open Science, Vol. 8 (No. 8): pp. 210838. doi:10.1098/rsos.210838.
  • Itti, L., Koch, C., and Niebur, E. 1998. “A model of saliency-based visual attention for rapid scene analysis.” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20 (No. 11): pp. 1254–1259. doi:10.1109/34.730558.
  • Ji, S., Wei, S., and Lu, M. 2019. “Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set.” IEEE Transactions on Geoscience and Remote Sensing, Vol. 57 (No. 1): pp. 574–586. doi:10.1109/TGRS.2018.2858817.
  • Ko, B.C., and Nam, J.-Y. 2006. “Object-of-interest image segmentation based on human attention and semantic region clustering.” Journal of the Optical Society of America, Vol. 23 (No. 10): pp. 2462–2470. doi:10.1364/JOSAA.23.002462.
  • Krizhevsky, A., Sutskever, I., and Hinton, G.E. 2017. “ImageNet classification with deep convolutional neural networks.” Communications of the ACM, Vol. 60 (No. 6): pp. 84–90. doi:10.1145/3065386.
  • Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. 1998. “Gradient-Based Learning Applied to Document Recognition.” Proceedings of the IEEE, Vol. 86 ((No. 11):pp. 2278–2324. doi:10.1109/5.726791.
  • Li, G., and Yu, Y. 2015. “Visual Saliency Based on Multiscale Deep Features.” arXiv. http://arxiv.org/abs/1503.08663.
  • Li, S., Liu, Q., Li, Z., Chen, E., and Zhang, J. 2017. “Building Height Extraction from Overlapping Airborne Images in Urban Environment Using Computer Vision Approach.” 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 5767–69. Fort Worth, TX: IEEE. doi:10.1109/IGARSS.2017.8128318.
  • Li, X., Yang, F., Cheng, H., Liu, W., and Shen, D. 2018. “Contour Knowledge Transfer for Salient Object Detection.” In Computer Vision – ECCV 2018, edited by V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Vol. 11219, 370–385. Cham: Springer International Publishing. doi:10.1007/978-3-030-01267-0_22.
  • Lin, D., Ji, Y., Lischinski, D., Cohen-Or, D., and Huang, H. 2018. “Multiscale Context Intertwining for Semantic Segmentation.” In Computer Vision – ECCV 2018: Lecture Notes in Computer Science, edited by V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, vol. 11207, 622–638. Cham: Springer International Publishing. doi:10.1007/978-3-030-01219-9_37.
  • Liu, N., and Han, J. 2016. “DHSNet: Deep hierarchical saliency network for salient object detection.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 678–86. Las Vegas, NV, USA: IEEE. doi:10.1109/CVPR.2016.80.
  • Liu, N., Han, J., and Yang, M.-H. 2018. “PiCANet: Learning Pixel-Wise Contextual Attention for Saliency Detection.” arXiv. http://arxiv.org/abs/1708.06433.
  • Liu, N., Zhang, N., Wan, K., Shao, L., and Han, J. 2021. “Visual Saliency Transformer.” arXiv. http://arxiv.org/abs/2104.12099.
  • Liu, Y., Cheng, M.-M., Hu, X., Bian, J.-W., Zhang, L., Bai, X., and Tang, J. 2019. “Richer convolutional features for edge-detection.” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41 (No. 8): pp. 1939–1946. doi:10.1109/TPAMI.2018.2878849.
  • Liu, Z., Tan, Y., He, Q., and Xiao, Y. 2022. “SwinNet: Swin transformer drives edge-aware RGB-D and RGB-T salient object detection.” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32 (No. 7): pp. 4486–4497. doi:10.1109/TCSVT.2021.3127149.
  • Lu, T., Ming, D., Lin, X., Hong, Z., Bai, X., and Fang, J. 2018. “Detecting building-edges from high spatial resolution remote sensing imagery using richer convolution features network.” Remote Sensing, Vol. 10 (No. 9): pp. 1496. doi:10.3390/rs10091496.
  • Luo, Z., Mishra, A., Achkar, A., Eichel, J., Li, S., and Jodoin, P.-M. 2017. “Non-local deep features for salient object detection.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6593–6601. Honolulu, HI: IEEE. doi:10.1109/CVPR.2017.698.
  • Ma, X., Liu, S., Hu, S., Geng, P., Liu, M., and Zhao, J. 2018. “SAR image edge-detection via sparse representation.” Soft Computing, Vol. 22 (No. 8): pp. 2507–2515. doi:10.1007/s00500-017-2505-y.
  • Maggiori, E., Tarabalka, Y., Charpiat, G., and Alliez, P. 2017. “Can semantic labeling methods generalize to any city? The Inria aerial image labeling benchmark.” 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 3226–29. Fort Worth, TX: IEEE. doi:10.1109/IGARSS.2017.8127684.
  • Marcu, A., and Leordeanu, M. 2016. “Dual local-global contextual pathways for recognition in aerial imagery.” arXiv. http://arxiv.org/abs/1605.05462.
  • Min, D., Zhang, C., Lu, Y., Fu, K., and Zhao, Q. 2022. “Mutual-guidance transformer-embedding network for video salient object detection.” IEEE Signal Processing Letters, Vol. 29: pp. 1674–1678. doi:10.1109/LSP.2022.3192753.
  • Partovi, T., Bahmanyar, R., Kraus, T., and Reinartz, P. 2017. “Building outline extraction using a heuristic approach based on generalization of line segments.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, Vol. 10 (No. 3): pp. 933–947. doi:10.1109/JSTARS.2016.2611861.
  • Pei, J., Cheng, T., Tang, H., and Chen, C. 2023. “Transformer-based efficient salient instance segmentation networks with orientative query.” IEEE Transactions on Multimedia, Vol. 25: pp. 1964–1978. doi:10.1109/TMM.2022.3141891.
  • Peng, J., Zhang, D., and Liu, Y. 2005. “An improved snake model for building detection from urban aerial images.” Pattern Recognition Letters, Vol. 26 (No. 5): pp. 587–595. doi:10.1016/j.patrec.2004.09.033.
  • Pirzada, S.J.H., and Siddiqui, A. 2013. “Analysis of edge-detection algorithms for feature extraction in satellite images.” 2013 IEEE International Conference on Space Science and Communication (IconSpace), 238–42. Melaka, Malaysia: IEEE. doi:10.1109/IconSpace.2013.6599472.
  • Qin, X., Zhang, Z., Huang, C., Gao, C., Dehghan, M., and Jagersand, M. 2019. “BASNet: Boundary-Aware Salient Object Detection.” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7471–81. Long Beach, CA, USA: IEEE. doi:10.1109/CVPR.2019.00766.
  • Ren, X. 2008. “Multiscale improves boundary detection in natural images.” In Computer Vision – ECCV 2008: Lecture Notes in Computer Science, edited by D. Forsyth, P. Torr, and A. Zisserman, vol. 5304, 533–545. Berlin, Heidelberg: Springer Berlin Heidelberg. doi:10.1007/978-3-540-88690-7_40.
  • Shan, J., Zhou, S., Cui, Y., and Fang, Z. 2023. “Real-time 3d single object tracking with transformer.” IEEE Transactions on Multimedia, Vol. 25: pp. 2339–2353. doi:10.1109/TMM.2022.3146714.
  • Shelhamer, E., Long, J., and Darrell, T. 2017. “Fully convolutional networks for semantic segmentation.” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39 (No. 4): pp. 640–651. doi:10.1109/TPAMI.2016.2572683.
  • Simonyan, K., and Zisserman, A. 2015. “Very deep convolutional networks for large-scale image recognition.” arXiv. http://arxiv.org/abs/1409.1556.
  • Soria, X., Riba, E., and Sappa, A.D. 2020. “Dense extreme inception network: Towards a robust CNN model for edge-detection.” arXiv. http://arxiv.org/abs/1909.01955.
  • Strudel, R., Garcia, R., Laptev, I., and Schmid, C. 2021. “Segmenter: Transformer for Semantic Segmentation.” arXiv. http://arxiv.org/abs/2105.05633.
  • Su, N., Yan, Y., Qiu, M., Zhao, C., and Wang, L. 2018. “Object-based dense matching method for maintaining structure characteristics of linear buildings.” Sensors, Vol. 18 (No. 4): pp. 1035. doi:10.3390/s18041035.
  • Sun, L., Tang, Y., and Zhang, L. 2017. “Rural building detection in high-resolution imagery based on a two-stage CNN Model.” IEEE Geoscience and Remote Sensing Letters, Vol. 14 (No. 11): pp. 1998–2002. doi:10.1109/LGRS.2017.2745900.
  • Marr, D., and Hildreth, E. 1980. “Theory of edge-detection.” Proceedings of the Royal Society of London. Series B, Biological sciences, Vol. 207: pp. 187–217. doi:10.1098/rspb.1980.0020.
  • Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and Jégou, H. 2021. “Training Data-Efficient Image Transformers & Distillation through Attention.” arXiv. http://arxiv.org/abs/2012.12877.
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. 2023. “Attention is all you need.” arXiv. http://arxiv.org/abs/1706.03762.
  • Wang, T., Borji, A., Zhang, L., Zhang, P., and Lu, H. 2017. “A stagewise refinement model for detecting salient objects in images.” 2017 IEEE International Conference on Computer Vision (ICCV), pp. 4039–48. Venice: IEEE. doi:10.1109/ICCV.2017.433.
  • Wang, T., Zhang, L., Wang, S., Lu, H., Yang, G., Ruan, X., and Borji, A. 2018. “Detect Globally, Refine Locally: A Novel Approach to Saliency Detection.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3127–35. Salt Lake City, UT: IEEE. doi:10.1109/CVPR.2018.00330.
  • Wang, W., Zhao, S., Shen, J., Hoi, S.C.H., and Borji, A. 2019. “Salient object detection with pyramid attention and salient edges.” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1448–57. Long Beach, CA, USA: IEEE. doi:10.1109/CVPR.2019.00154.
  • Wang, X., Chen, S., Wei, G., and Liu, J. 2023. “TENet: Accurate light-field salient object detection with a transformer embedding network.” Image and Vision Computing, Vol. 129: pp. 104595. doi:10.1016/j.imavis.2022.104595.
  • Wang, Y., Jia, X., Zhang, L., Li, Y., Elder, J.H., and Lu, H. 2023. “A uniform transformer-based structure for feature fusion and enhancement for RGB-D saliency detection.” Pattern Recognition, Vol. 140: pp. 109516. doi:10.1016/j.patcog.2023.109516.
  • Wang, Z., Zhang, Y., Liu, Y., Wang, Z., Coleman, S., and Kerr, D. 2022. “TF-SOD: A novel transformer framework for salient object detection.” Neural Computing and Applications, Vol. 34 (No. 14): pp. 11789–11806. doi:10.1007/s00521-022-07069-9.
  • Shen, W., Wang, X., Wang, Y., Bai, X., and Zhang, Z. 2015. “DeepContour: A deep convolutional feature learned by positive-sharing loss for contour detection.” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3982–91. Boston, MA, USA: IEEE. doi:10.1109/CVPR.2015.7299024.
  • Wen, X., Li, X., Zhang, C., Han, W., Li, E., Liu, W., and Zhang, L. 2021. “ME-net: A multiscale erosion network for crisp building-edge-detection from very high resolution remote sensing imagery.” Remote Sensing, Vol. 13 (No. 19): pp. 3826. doi:10.3390/rs13193826.
  • Gao, W., Zhang, X., Yang, L., and Liu, H. 2010. “An improved sobel edge-detection.” 2010 3rd International Conference on Computer Science and Information Technology, 67–71. Chengdu, China: IEEE. doi:10.1109/ICCSIT.2010.5563693.
  • Wu, R., Feng, M., Guan, W., Wang, D., Lu, H., and Ding, E. 2019. “A mutual learning method for salient object detection with intertwined multi-supervision.” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8142–51. Long Beach, CA, USA: IEEE. doi:10.1109/CVPR.2019.00834.
  • Wu, Z., Su, L., and Huang, Q. 2019. “Cascaded partial decoder for fast and accurate salient object detection.” arXiv. http://arxiv.org/abs/1904.08739.
  • Xia, L., Zhang, X., Zhang, J., Yang, H., and Chen, T. 2021. “Building extraction from very-high-resolution remote sensing images using semi-supervised semantic edge.” Remote Sensing, Vol. 13 (No. 11): pp. 2187. doi:10.3390/rs13112187.
  • Xie, S., and Tu, Z. 2015. “Holistically-nested edge-detection.” arXiv. http://arxiv.org/abs/1504.06375.
  • Yan, X., Tang, H., Sun, S., Ma, H., Kong, D., and Xie, X. 2022. “AFTer-UNet: Axial Fusion Transformer UNet for Medical Image Segmentation.” 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 3270–80. Waikoloa, HI, USA: IEEE. doi:10.1109/WACV51458.2022.00333.
  • Yu, Z., Feng, C., Yu Liu, M., and Ramalingam, S. 2017. “CASENet: Deep category-aware semantic edge-detection.” arXiv. http://arxiv.org/abs/1705.09759.
  • Zhang, K., Guo, Y., Wang, X., Yuan, J., Ma, Z., and Zhao, Z. 2019. “Channel-wise and feature-points reweights densenet for image classification.” 2019 IEEE International Conference on Image Processing (ICIP), 410–14. Taipei, Taiwan: IEEE. doi:10.1109/ICIP.2019.8802982.
  • Zhang, X., Wang, T., Qi, J., Lu, H., and Wang, G. 2018. “Progressive attention guided recurrent network for salient object detection.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 714–22. Salt Lake City, UT: IEEE. doi:10.1109/CVPR.2018.00081.
  • Zhao, T., and Wu, X. 2019. “Pyramid feature attention network for saliency detection.” arXiv. http://arxiv.org/abs/1903.00179.
  • Zhao, Y., Sun, G., Zhang, L., Zhang, A., Jia, X., and Han, Z. 2023. “MSRF-Net: Multiscale receptive field network for building detection from remote sensing images.” IEEE Transactions on Geoscience and Remote Sensing, Vol. 61: pp. 1–14. doi:10.1109/TGRS.2023.3282926.
  • Zheng, L., Wang, S., Liu, Z., and Tian, Q. 2015. “Fast image retrieval: Query pruning and early termination.” IEEE Transactions on Multimedia, Vol. 17 (No. 5): pp. 648–659. doi:10.1109/TMM.2015.2408563.