742
Views
0
CrossRef citations to date
0
Altmetric
Target Article

Vectorizing planar roof structure from very high resolution remote sensing images using transformers

, , , &
Pages 1-15 | Received 13 Jul 2023, Accepted 04 Dec 2023, Published online: 17 Dec 2023

ABSTRACT

Accurately predicting the geometric structure of a building's roof as a vectorized representation from a raster image is a challenging task in building reconstruction. In this paper, we propose an efficient and precise parsing method called Roof-Former, based on a vision Transformer. Our method involves three steps: (1) Image encoder and edge node initialization, (2) Image feature fusion with an enhanced segmentation refinement branch, and (3) Edge filtering and structural reasoning. Our method outperforms previous works on the vectorizing world building dataset and the Enschede dataset, with vertex and edge heat map F1-scores increasing from 87.1%, 76.2% to 89.1%, 78.1%, and from 69.7%, 68.8% to 71.2%, 69.5%, respectively. Furthermore, our method demonstrates superior performance compared to the current state-of-the-art based on qualitative evaluations, indicating its effectiveness in extracting global image information while maintaining the consistency and topological validity of the roof structure.

This article is part of the following collections:
Integration of Advanced Machine/Deep Learning Models and GIS

1. Introduction

Creation of comprehensive 3D building models requires access to roof structure information. Such models are useful in applications like building energy modeling, and urban planning (Biljecki et al. Citation2016; Lv et al. Citation2022; W. Zhou et al. Citation2020). A variety of remote sensing data types, including monocular or stereo images, point clouds, or digital surface models (DSMs) have been used to extract geometric building outlines and roof structures (Xiong, Oude Elberink, and Vosselman Citation2014; Zhao, Persello, and Stein Citation2021, Citation2022). The procurement of 3D spatial data and the creation of accompanying 3D models, however, are expensive, particularly when carried out over large areas (Lv et al. Citation2021; Zhao, Persello, and Stein Citation2023). By contrast, lower-cost and broader coverage are anticipated advantages for roof structure extraction based on optical remote sensing images.

The Level of Detail (LoD) concept serves to differentiate multi-scale representations of 3D building models in terms of their geometric and semantic details. With the refinement of LoD1 model construction workflows, an increasing number of scholars have turned their attention to automating the construction of LoD2 models. At the heart of LoD2 model construction lies the geometric structure of building roofs. As a topological collection of fine-grained geometric elements of building roofs, roof structure combines line and junction elements with their connections. Geometric feature (or primitive) extraction from images is a fundamental task in computer vision. Conventionally, its extraction is carried out using perceptual grouping of low-level cues, namely, image gradients, spatial-frequency and textures (Konishi et al. Citation2003; Le, Dabke, and Egan Citation2006; Von Gioi et al. Citation2008). The emergence of deep convolutional neural networks (CNNs) has introduced advances not only in spotting low-level primitives but also in providing the essential recognition of high-level geometric structures. Notable performance has been achieved in detecting lines, points, wireframes, and floor planes (Nauata and Furukawa Citation2020; Xue et al. Citation2020; Zhao, Persello, and Stein Citation2022). In Zhang, Nauata, and Furukawa (Citation2020) and Zhao, Persello, and Stein (Citation2022), methods based upon graph neural networks are used to identify geometric primitives and infer their connections in separated stages or in an end-to-end mode, respectively. Automated extraction of structured geometry (outline and roof structure) from optical remote sense images has received limited research attention so far due to the challenges posed by scene complexity and the wide variety of roof configurations. Nauata and Furukawa (Citation2020) proposed to utilize Convolutional Neural Networks (CNNs) for the detection of geometric primitives and their relationships. They further leverage Integer Programming (IP) to combine this information and generate a planar graph in 2D. Similarly, Zhang, Nauata, and Furukawa (Citation2020) introduced a novel message-passing neural (MPN) architecture that employs convolutions to encode messages, effectively addressing the issue of similar line features. Additionally, J. Chen, Qian, and Furukawa (Citation2022) present an attention-based neural network for structured reconstruction, where a 2D raster image serves as input, and a planar graph representing the underlying geometric structure is reconstructed. Existing methods for this task often suffer from the identification of false positive candidates that are not associated with buildings, as well as the exclusion of adjacency relationships among the geometric primitives (Esmaeily and Rezaeian Citation2023; M. Lin et al. Citation2023; Lu et al. Citation2023).

Recently, the Transformer's sequence-to-sequence model has been extensively used in vision tasks. The DEtection TRansformer (DETR) by Carion et al. (Citation2020) framed object detection as a sequence-to-sequence prediction problem, enabling straightforward forecasting of a set of objects based upon learned object queries and context feature sequences. This framework is then expanded to restructure a planar graph representing an underlying geometric structure, named the Holistic Edge Attention Transformer (HEAT) (J. Chen, Qian, and Furukawa Citation2022). These efforts demonstrate the promising application of Transformers in geometric structure reconstruction. Issues concerning the effective and efficient extraction of global image features persist, however, due to insufficient single-scale feature maps and high computational costs. These issues highlight the need for improved methods that can address these challenges and accurately extract structured geometry from optical remote sensing images.

To efficiently extract the roof structure from very high resolution remote sensing images, we introduce Roof-Former, a Transformer network for planar roof structure extraction. First, we include a Pyramid Transformer to the backbone. Second, we add a collaborative branch of semantic segmentation into primitive extraction. Roof-Former guarantees the consistency of spatial and topological relations of the extracted primitives within the roof structure. We sum up our contributions as follows.

(1)

We introduce a Transformer-based planar roof structure extraction method Roof-Former based on HEAT (J. Chen, Qian, and Furukawa Citation2022). We introduce an enhanced feature pyramid module to the framework, which makes the image encoder flexible for learning multi-scale features while reducing resource consumption during training.

(2)

We add a collaborative segmentation refinement branch to the existing framework, which enhances the spatial and topological relations of the extracted primitives within the roof structure by jointly learning the building masks. Different modality features are effectively fused based upon an Attention Feature Fusion Module (AFFM).

We evaluated the proposed method on two roof structure reconstruction benchmarks. We conducted conducte K-fold cross validation with k = 5 to illustrate the usefulness and stability of the modal and dataset during the training process. The quantitative and qualitative evaluation indicates that our method outperforms all compared methods, demonstrating more effective global structural reasoning.

2. Related work

2.1. Roof structure extraction

Most building feature extraction studies focus on footprint extraction, while deep CNNs based semantic segmentation methods are mostly rather advanced for this task. These methods, however, consistently yield results with irregular and noisy boundaries, necessitating manual editing for practical applications. To train deep learning models more effectively, Ji, Wei, and Lu (Citation2018) created a large, high-resolution dataset of aerial images, known as the WHU-building dataset. Subsequent studies proposed the PolygonCNN framework for generating building outlines from aerial imagery, which involves extracting the initial building contour and regularizing the shape and vertices to produce more precise polygons (Q. Chen et al. Citation2020; Zhao, Persello, and Stein Citation2021).

Extract roof structures directly from remote sensing data as it is a crucial step in generating LoD2 building models. The detection of building geometric primitives such as lines, points, and planes (Y. Zhou, Qi, and Ma Citation2019; Zou et al. Citation2018) has been a well-studied vision task. Initially, it was tackled using handcrafted descriptors. However, in recent years, the emergence of deep learning models with hierarchical feature learning capabilities has greatly improved the accuracy and completeness of detected building outlines and roof structures. One common approach for extracting low-level geometry is through heatmap regression or pixel-wise binary classification.

In contrast to pixel-wise inference methods for corners and edges, the detection of line primitives is more focused on the structural aspect, as a line segment is defined by two endpoints. The task of wireframe parsing, introduced by Y. Zhou, Qi, and Ma (Citation2019), has generated growing interest in inferring line segment candidates using learned junctions. As advancements in low- and mid-level primitive detection have emerged, researchers have recently directed their efforts towards structured reconstruction. In remote sensing field, researchers encoded both internal and external feature lines to reconstruct an arbitrary topology of building roof structure (Nauata and Furukawa Citation2020; Zhang, Nauata, and Furukawa Citation2020). A method called ‘integrally attracted wireframe parsing’ was developed for reconstructing roof structures as planar graphs. This method uses the Hough transform to add global geometric lines to deep neural networks, enabling more effective extraction of linear geometric features (Y. Lin, Pintea, and van Gemert Citation2020). Zhao, Persello, and Stein (Citation2022) proposed a vectorized roofline extraction framework called the Roof Structure Graph Neural Network, which includes a multitask learning module for geometric primitive extraction and a GNN module to reconstruct the roofline structure. In this method, a deep convolutional network extracts key vertices and line segments before constructing the graph. These works are based on the idea of wireframe extraction, which aims to extract lines and junctions in man-made environments. A novel Transformer-based network for structured roof reconstruction was proposed by J. Chen, Qian, and Furukawa (Citation2022). This method extracts the roof structure as a structured architecture reconstruction task by using Transformer to infer the connectivity between all nodes as candidate edges. It takes a 2D raster image as input and produces a planar graph representing the underlying geometric structure.

As shown in , the issue with existing methods is that they may identify false positive candidates in the extracted geometric primitives that are outside a building. This leads to errors and inaccuracies in the final results. Additionally, these methods often fail to take the adjacency relationships among the primitives into account, which can further contribute to inaccuracies.

Figure 1. False detection caused by false positive vertex or edge candidates, which are marked the yellow dotted line.

Figure 1. False detection caused by false positive vertex or edge candidates, which are marked the yellow dotted line.

2.2. Vision transformer and feature enhancement

Convolutional neural network (CNNs) have achieved remarkable success in computer vision, making them versatile and dominant for almost all tasks. To provide translation equivariance, the weights of convolutional kernels are shared over the entire image space. Unlike the full-blown CNNs, vision tasks have made substantial use of the Transformers' sequence-to-sequence model (Carion et al. Citation2020). Initially, it denotes both the input features and output targets as visual tokens that engage in global interactions with one another via the Transformers' attention mechanism. Specifically, Carion et al. (Citation2020) DEtection TRansformer (DETR) framed the dilemma of object detection as a sequence-to-sequence prediction problem. This enabled a straightforward forecasting of a set of objects based upon the learned object queries in tandem with the context feature sequence. Following DETR, Xu et al. (Citation2021) proposed a network LinE segment TRansformer (LERT) to directly predict line segments from tokenized image features, thereby facilitating the process of line segment detection.

One of the major challenges in various computer vision tasks is the issue of scale variation. To resolve this, i.e., to detect objects at multiple scales, multi-scale image pyramids have been developed and combined with non-maximum suppression techniques. Alternatively, the hierarchical feature pyramids of CNNs have been exploited to approximate image pyramids, with features from multiple layers fused to obtain high-resolution semantic features (Hu, Shen, and Sun Citation2018; T.-Y. Lin et al. Citation2017). Similarly to CNNs, transformer networks also rely on high-resolution or multi-scale feature maps to explore richer global context representations (L.-C. Chen et al. Citation2016). The attention mechanism in deep learning, which imitates the human visual attention mechanism (Fu et al. Citation2019), was originally developed on a global scale. Recently, however, researchers have recently begun to consider the scale issue of attention mechanisms. To achieve multi-scale attention mechanisms, they either feed multi-scale features into an attention module or combine feature contexts of multiple scales inside an attention module. Our study aims to leverage the multi-scale attention mechanism to effectively combine local and global features from different branches, addressing the scale variation challenge in computer vision. This contributes to the development of more accurate and effective deep learning models.

3. Methodology

We introduce Roof-Former, a Transformer network for planar roof structure extraction. The overall architecture of Roof-Former is developed on the basis of HEAT. It identifies vertices and categorizes edge candidates between vertices in an end-to-end manner (). The model infers vectorized planar graphs (i.e. vertices and edges) representing a roof structure given a 2D raster image. The proposed Roof-Former comprises three modules: (1) Image encoding and edge node initialization, (2) Image feature fusion with enhanced segmentation refinement branch, and (3) Edge filtering and structural reasoning. The idea is that, the proposed method firstly uses trigonometric positional encoding (Vaswani et al. Citation2017) to represent potential candidate roof lines composed of line segments' endpoints. (1) fcoord=Mcoord[γ(e1x),γ(e1y),γ(e2x),γ(e2y)]γ(t)=[sin(w0t),cos(w0t),sin(w31t),cos(w31t)],wi=(1/10,000)2i/32(i=0,1,31).(1)(1) The two corners e1 and e2 are denoted by their x and y coordinates, e1x and e1y, respectively. These coordinates are then linearly mapped using a 256×256 learnable matrix Mcoord. Additionally, the function γ is employed to encode ordinal priors, it is a linear projection of the x, y coordinates. For any fixed position offset δ, the positional encoding at position i+δ can be represented by any linear projection of that at position i. The authors of Vaswani et al. (Citation2017) found that using the particular gamma of Equation (Equation1) is easier to learn.

Figure 2. The overall architecture of Roof-Former, which consists of three steps: (1) Image encoding and edge node initialization (yellow); (2) Image feature fusion with enhanced segmentation refinement branch (blue); and (3) Structural reasoning with Transformer decoders (green).

Figure 2. The overall architecture of Roof-Former, which consists of three steps: (1) Image encoding and edge node initialization (yellow); (2) Image feature fusion with enhanced segmentation refinement branch (blue); and (3) Structural reasoning with Transformer decoders (green).

These lines are treated as nodes that are learned and discerned by the Transformer model. Secondly, multi-scale image features and the range of building masks are fused to filter edge candidates to decrease memory usage, as mask segmentation has been shown to assist geometric feature extraction and learning (Liu, Shi, and Ou Citation2022; Wang et al. Citation2021). Finally, the structural patterns of edges are learned through image and geometry decoders, ultimately enabling the classification of each edge node.

3.1. Image encoding and edge node initialization

A set of corner candidates is firstly detected from the input image with a size equal to H×W×3, where H and W denote the height and width of the image, and 3 refers to the image bands, respectively. In the vertex detection network, we detect the pixels as the vertex candidates of the Transformer nodes. To minimize memory costs, each 4×4 super-pixel is assigned as a node in our network instead of a pixel in the 256×256 image space. Each node's feature (fcoord) is built with an additional Multilayer perceptron (MLP) by summing the coordinate features of the 16 pixels that make up a super-pixel. As shown in , the corner detector which is an adaptation of edge classification architecture, is illustrated with image size 256×256. When the fusion of image features is complete, a CNN decoder transforms the 64×64×256 feature maps into the final 256×256 corner probability score. Convolution layers, upsampling layers, and a final linear layer for confidence map generation make up the CNN decoder. To yield the final vertex detection results, we apply non-maximum suppression to the confidence map. As a result, each pair of vertices functions as an edge candidate and becomes a Transformer node, with the feature fcoord initialized by the 256-dimensional trigonometric positional encoding. To generate corner labels, we begin by creating a label map that matches the resolution of the input image. Subsequently, we address the issue of class imbalance by applying a Gaussian blur with a sigma value of 2 to the label map. This blur operation helps alleviate the class imbalance while preserving the overall information.

Figure 3. The corner detection model adapted from edge detection architecture.

Figure 3. The corner detection model adapted from edge detection architecture.

We acquire the image feature map from a backbone with reduced dimensions. Distinct from HEAT, we present an enhanced pyramid structure in the Transformer framework, namely, a Feature Pyramid Transformer (FPT). Our backbone contains four stages that yield feature maps at varying scales. There is a consistent design throughout all stages, consisting of a patch embedding layer and Li Transformer encoding layers, where i refers to the stage number, i=1,,4. Each of the Li encoder layers is made up of an attention layer and a feed-forward layer. We use a linear spatial reduction attention layer to replace the encoder's multi-head attention layer. It takes a query Q, a key K, and a value V as input and returns a refined feature. It uses average pooling before the attention operation to decrease the spatial dimension (i.e.H×W) to a fixed size (i.e. Pi×Pi). A linear computational and memory cost, such as a convolutional layer, is thus gained by the FPT employing linear spatial reduction attention. Following the produced feature maps of varying scales or channels in numerous stages, the model can enhance the performance in downstream tasks.

Following a pyramid structure, the output resolution of the four stages gradually decreases from high with 4-stride to low with 32-stride. Pi denotes the patch size of the stage i. We ensure an even division of the input feature map Fi1RHi1×Wi1×Ci1 into Hi1Wi1Pi2 patches at the beginning of stage i. Afterward, each patch is flattened and projected onto a Ci-dimensional embedding. In conjunction with the linear projection, the form of the embedded patches is observed as Hi1Pi×Wi1Pi×Ci, where the height and width are Pi times smaller than the input. The dimensions and width in the latter are Pi times less than those of the input. To create spatial relations, positional embeddings are concatenated with image features.

Each node is integrated with image features extracted from the backbone by adapting a deformable attention technique (Zhu et al. Citation2020). We generate sampling sites around an edge as well as their attention-weights for image feature fimg aggregation at each level l(=1,,3) of the feature pyramid in the image encoder. We finally opted for an 8-way multi-head attention strategy.

3.2. Image feature fusion with enhanced segmentation refinement branch

To improve the accuracy of the large-scale roof structure mapping, we include an additional semantic segmentation branch along the Transformer. We produce the semantic binary label using building outline in the building segmentation branch. After the backbone network, we convert backbone features into the embedding features of vertices and segmentation maps. Changes in various embedding features are made to forecast vertice heatmap and the segmentation mask for the segmentation map of the building polygons. For the mask branch, two convolutional layers are traversed using the shared feature maps extracted from the backbone network. The sigmoid activation function is used in the output layer to get the aided segmentation map.

We further introduce an attention feature fusion module (AFFM) that aggregates geometric primitive structure priors via the use of segmentation branches to enhance feature fusion across tasks (). The module is constructed on a multi-scale channel attention module (MS-CAM). The central concept is that channel attention can be applied at multiple scales by adjusting the spatial pooling size. To maintain a lightweight model, we incorporate the local context into the global context within the attention module. We add to it the efficient attention mechanism to enrich the feature fusion of mask and edges at both global and local scales. A local channel context aggregator, named point-wise convolution (PWConv), relies only on point-wise channel interactions at each spatial position. Using the bottleneck structure as below, we obtain the local channel context L(X)RC×H×W, hence conserving the parameters: (2) L(X)=B(PWConv(δ(B(PWConv(X)))))(2)(2) where X refers to the input feature, B denotes Batch Normalization and δ the Rectified Linear Unit. Given the local channel context L(X) and global channel context G(X), the refined feature is obtained as: (3) X=XM(X)=Xσ(L(X)G(X))(3)(3) where M(X) indicates the attentional weights, ⊕ the broadcasting addition, σ is the Sigmoid function, and ⊗ the element-wise multiplication.

Figure 4. Illustration of the proposed AFFM. C and r denote the channel number and the channel reduction ratio, respectively. The refined feature X is enhanced by the extracting local channel context (L(X) in blue box) and global channel context (G(X) in green box) in MS-CAM.

Figure 4. Illustration of the proposed AFFM. C and r denote the channel number and the channel reduction ratio, respectively. The refined feature X′ is enhanced by the extracting local channel context (L(X) in blue box) and global channel context (G(X) in green box) in MS-CAM.

The embedding feature maps of vertices and segmentation map X and Y are taken into consideration next. The AFFM is expressed using the multi-scale channel attention module M: (4) Z=M(XY)X+(1M(XY))Y(4)(4) where ⊎ denotes the initial feature integration and Z the fused feature. For the sake of simplicity, this paper uses element-wise summation as the initial integration.

After fusing these features, the network can apply mask-level guidance to limit the scope of candidate primitives. Additionally, the AFFM is used to combine the edge candidates with the aided segmentation map in order to generate line proposals. When deciding whether or not to keep a primitive, the aided segmentation map is fused with the candidate primitives from the primitive detection branch. This is done by taking the primitive's location and state relative to other features in the image. Once the retrieved candidate primitives satisfy the segmentation layer's range, they are activated, while otherwise, they will be suppressed.

3.3. Edge filtering and structural reasoning

After integrating image and mask features with edge nodes, we generate a fused feature by integrating a conventional add-norm layer and a feed-forward network (FFN), as in the original Transformer. We also eliminate unsuited candidates by putting f through a 2-layer MLP followed by a sigmoid function and generating a probability score. Top-N candidates are kept, where N is three times the number of vertex candidates.

Two weight-sharing Transformer decoders, namely: Image-aware decoder and a Geometry decoder are used to categorize every edge node candidate as either correct or not. Each edge candidate is modeled as a node and assigned the fused feature f by the decoder. Each edge candidate has one of the three states: (T) A GT label is given as true; (F) A GT label is given as false; or (U) A GT label is unknown and the network needs to infer. Concretely, we represent the state as a one-hot encoding vector, convert to 256-dimension by a linear layer, concatenate with fcoord, and downscale to 256-dimension by another linear layer. The image-aware decoder is comprised of six layers that consist of a self-attention mechanism, an edge image feature fusion module, and a feed-forward network. Each layer uses an 8-way multihead attention mechanism. At test time, we perform iterative label inference with the image-aware edge decoder. Edges with confidence larger than 0.5 are used to produce the final predictions. The geometric decoder accurately possesses the same architecture and shares the weights without using image information. It enhances the global geometric reasoning and performance of the image-aware decoder. The geometric decoder is only used in training with the BCE loss, and only the image-aware decoder is used for testing. We use the same masked training and iterative inference as in the original work.

4. Experiments

4.1. Dataset

We performed experiments on two datasets (Nauata and Furukawa Citation2020; Zhao, Persello, and Stein Citation2022) to verify the performance and robustness of our method. The images in the two datasets have different spatial resolutions and levels of scene complexity, and they cover vast areas in distinct locations.

  • Vectorizing world building dataset (VWB): The VWB data set, a subset of the SpaceNet public data set, has a spatial resolution of approximately 30 cm. The data set comprises 2D planar graphs of building roof structures, annotated for 1010, 670, and 321 buildings from Atlanta, Paris, and Las Vegas, respectively. Each image patch is cropped and resized to 256×256 pixels, containing a single building instance. The entire data set is divided into 1601 patches for training and 400 patches for testing.

  • The Enschede dataset (ENS): Dataset covers a part region of Enschede, the Netherlands (Zhao, Persello, and Stein Citation2022). The high-resolution true ortho imagery was generated by the Dutch Cadaster (Kadaster) using data obtained from the Beeldmateriaal initiative, with aerial images having a spatial resolution of 8cm. The corresponding vectorized annotations were obtained from the publicly available BAG data set.Footnote1 The inner roofline was manually labeled, and the dataset preparation is implemented automatically. The prepared dataset has 3648 image patches. There are 2924 and 742 patches, with a size of 512×512 pixels, for training and testing, respectively.

4.2. Evaluation metrics

We applied two evaluation schemes for evaluation, including pixel-wise and vector-wise metrics. Specifically, We compute a heat map based average precision (APH) and an F1-score (FH) for each of the vertex and edge primitives. We also apply the mean structural Average Precision (msAP), metrics defined on vectorized wireframes for both vertex (msAPV) and edges (msAPE) (Zhao, Persello, and Stein Citation2022). We also applied floating point operations per second (FLOPS) to measure the computing performance of each method.

Additionally, we employed K-fold cross-validation, a widely used technique in machine learning and data analysis. It provides a robust method for model evaluation, particularly when working with limited datasets. By using different validation data for each fold, the network's generalization ability is enhanced.

4.3. Experimental setup

The loss balancing weights for the three edge Binary cross-entropy (BCE) losses all equal 1.0, while the vertex prediction BCE weight equals 0.05. The Adam optimizer is used to train our model. We set an initial learning rate as 2e4, and a weight decay factor of 1e5. For the last 25 epochs, the learning rate decays by a factor of 10. Our network is trained for 400 epochs. Roof-Former can be trained from end to end without the requirement for a separate preparatory extraction phase. The PyTorch environment was used for all studies. All training and testing were carried out on a single GTX 2080Ti GPU with 12 GB of memory.

We compared our method against four competing methods: ConvMPN (Zhang, Nauata, and Furukawa Citation2020), HAWP (Xue et al. Citation2020), RSGNN (Zhao, Persello, and Stein Citation2022), and HEAT (J. Chen, Qian, and Furukawa Citation2022). Each model was trained and evaluated using the same split.

Furthermore, we set k = 5 for k-fold cross-validation. It involves dividing the dataset into five equal-sized folds. In each iteration, one fold is held out as the validation set, while the remaining four folds are used for training the model. During each iteration, four of the subsets are utilized for training, while the remaining subset serves as the test set. This process is repeated five times, with each fold serving as the validation set once. The resulting metrics are recorded for each fold. Finally, the evaluation results from these five iterations are aggregated by averaging them, offering a comprehensive evaluation of the model's overall performance and generalization ability.

5. Results and discussion

5.1. Quantitative analysis

shows the experimental results. Roof-Former surpasses the competing methods on all precision and F1-scores. Specifically, compared to HEAT, our method has greatly enhanced the vertex and line segment outcomes, and the vertex and edge heat map F1-scores have risen by 2.0 and 1.9 points on the VWB dataset, respectively. The msAP for vertices and edges equals 43.1 and 42.4, respectively, which is higher than the HEAT and other methods. The results indicate that our method increases accuracy detecting roof geometric features.

Table 1. Quantitative results on the VWB dataset and the Enschede dataset are provided, where APH and FH represent the heatmap-based average precision and F1-score for both vertex and edge primitives.

Non-Transformer methods rely predominately on image features and do not acquire global geometric reasoning across query nodes, resulting in a large number of false edges and building reconstructions that do not resemble buildings. Among the competing methods, HEAT achieves compelling F1-scores for the vertices and the edges. RSGNN exhibits higher edge metrics but poor vertex precision as compared to HEAT. HAWP and ConvMPN perform poorly in general. A potential reason is that the methods frequently generate an excessive number of extraneous vertices or entirely miss sections of the graphs. ConvMPN provides competitive results for the pixel-based vertex precision on the VWB dataset, but performs poorly on the other metrics. The performance gap is especially noticeable for edges, which involve high-level geometry reasoning.

As shown in , the application of 5-fold cross-validation has demonstrated improvements in specific metrics compared to validation using the entire dataset. This improvement is attributed to the smaller training sets used in each fold. K-fold cross-validation leverages the available training data to assess accuracy by simulating independent datasets through sampling and applying the model to them. Consequently, accuracy may be overestimated with a smaller training set due to increased variance, while validation against the complete dataset yields more robust results. Our findings indicate that the training process, involving fitting on randomly generated patches through extensive data augmentation and random cropping from a diverse database, is highly efficient for limited imaging data without signs of overfitting.

5.2. Qualitative analysis

provides qualitative comparisons. The first two rows depict the prediction results for the ENS dataset, whereas the last two rows show the prediction results for the VWB dataset. The last column of the figure shows the results predicted by Roof-Former. The reconstruction quality of Roof-Former is observed to be superior to that of competing methods, even when dealing with massive and complex buildings, as evidenced by its results being closer to the label in the second column. Carefully scrutinizing the structures, Roof-Former is particularly effective at determining global information and maintains overall prediction consistency and geometric validity e.g. showing less hanging edges, and not being distracted by background buildings. ConvMPN and HEAT algorithms exhibit limitations in establishing accurate geometric connections and ensuring complete detection of primitives. Given that our primary framework built based on HEAT, the results indicate that integrating the segmentation branch can enhance the effectiveness of roof structure extraction.

Figure 5. Sample results on the Enschede dataset (first two rows) and VWB datasets (the third and fourth row). Roof-Former has been found to produce better reconstruction results for massive and complex buildings, with its output closer to the label in the second column. In comparison, other methods such as ConvMPN and HEAT results contain geometric imperfections such as narrow triangles, self-intersecting edges, and colinear edges.

Figure 5. Sample results on the Enschede dataset (first two rows) and VWB datasets (the third and fourth row). Roof-Former has been found to produce better reconstruction results for massive and complex buildings, with its output closer to the label in the second column. In comparison, other methods such as ConvMPN and HEAT results contain geometric imperfections such as narrow triangles, self-intersecting edges, and colinear edges.

ConvMPN results contain narrow triangles, self-intersecting edges, and colinear edges. Their shortcomings originate from their method of separately inferring the edges. Their edge metrics are significantly less rigorous than ours, which entails further global structure reasoning. Planar graphs are vulnerable to being disrupted by duplicate parts and missing crucial vertices for methods like HAWP and RSGNN. Our method is able to improve planar reconstruction accuracy by exploiting both image information and geometric patterns using an effective Transformer framework.

5.3. Ablation studies

We conducted multiple ablation experiments on the VWB validation set to demonstrate the efficacy of the proposed method. Based upon the baseline module HEAT, we added different components progressively. evaluates the three components on vertex and edge detection reported on the F1-score. The metrics presented in the final row of the table demonstrate the performance of Roof-Former. The results confirm that the components contribute significantly to consistently improving the metrics.

Table 2. Ablation study for the components of Roof-Former, evaluated with vertex or edge F1 scores.

Results show that the FPT significantly reduces the computation overhead (GFLOPs) of the model by 39%, while keeping the detection accuracy. Results also show that the framework can benefit from the segmentation refinement branch, resulting in more accurate predictions. Additionally, the performance improved by 0.6% and 0.5% when using AFFM. It indicates that the geometric structure learning process can be enhanced by effective segmentation and feature fusion.

5.4. Discussion

The improved Transformer-based method demonstrates effective global structural reasoning abilities, making it suitable for complex roof structure extraction tasks. Roof-Former enables accurate predictions by utilizing multi-scale encoded information and considering the global architecture while communicating with other entities. Additionally, it allows for the use of spatial data associated with nodes, and the added segmentation map reinforces the spatial structure of vertex and edge primitives.

Our method still has several shortcomings. First, it fails to recover from vertices overlooked by the vertex detector. Absent vertices result in an absent incident graph structure or a degraded geometry. Second, it adopts a piece-wise linear structure and cannot deal with curved buildings. Third, the effectiveness of a model may decrease when transferring it to oblique, relatively low-resolution satellite images due to the difficulty in detecting and analyzing the features and patterns caused by the angle and quality of the images. Future research will explore Generative adversarial networks (GANs) to improve instance data augmentation and more effective training processes to improve the proposed method.

In this study we did not incorporate the extraction of roofs of connected buildings. Preliminary tests indicated that current methods for extracting roof structures of multiple targets are insufficient. Therefore, large-scale multi-object roof structure mapping remains a challenging task. Unlike single-object roof structure extraction, multi-object mapping requires consideration of complex features of remote sensing imagery as well as the correspondence and relationship between basic geometric elements and complex building objects. Future research is necessary to explore the implementation of multi-task branches such as mask and orientation prediction, and to investigate the potential of the associative embedding technique for improving the accuracy and effectiveness of multi-object roof structure extraction.

The generated planar graph can more accurately depict the structural and topological relationships between buildings, hence facilitating following tasks like 3D reconstruction and urban visualization. To illustrate this, we selected four extracted 2D roof structures of the Enschede dataset for testing. We obtained the height values of the roof primitives from the nDSM data, and then made the reconstruction with reference to Peters et al. (Citation2021). The sample results are shown in . Based upon this, we can see clear possibilities to extend the Roof-Former to a wider range of modeling and applications.

Figure 6. Sample reconstruction results of building objects based upon our extracted roof structure and nDSM.

Figure 6. Sample reconstruction results of building objects based upon our extracted roof structure and nDSM.

6. Conclusion

This paper introduces an improved planar reconstruction method for vectoring 2D roof structure directly from a single image. This method is built upon HEAT, and is enhanced by applying a feature pyramid Transformer and introducing a collaborative branch of semantic segmentation into primitives extraction. Compared with HEAT, the vertex and edge heat map F1-scores have increased by 2.0% and 1.9% on the VWB dataset, and 1.5% and 0.7% on the Enschede dataset, respectively. Qualitative evaluations demonstrate that our method makes improvements over existing state-of-the-art methods. We also conclude that applying 5-fold cross-validation yields more robust and reliable performance estimates while reducing the risk of overfitting. Moreover, high-quality vector annotations are necessary to train the keypoint detector effectively. However, noisy annotations, such as misaligned or missing ones, can hinder the detection process, causing the omission of object corners during matching. Consequently, the network produces polygons that either lack certain corners or completely reject the entire object footprint. In the future, we will focus on improving the results of detecting keypoints. We will also continue to explore more efficient and effective training methods, such as the introduction of self-supervised learning and knowledge distillation.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The experiments conducted in this paper are based on two publicly available datasets, which can be accessed at Nauata and Furukawa (Citation2020) and Zhao, Persello, and Stein (Citation2022). Any inquiries regarding the datasets should be directed to the original authors.

Additional information

Funding

This work was supported by Foundation of Anhui Province Key Laboratory of Physical Geographic Environment, P.R. China [grant number 2022PGE012].

Notes

1 Key Register Addresses and Buildings https://www.pdok.nl

References

  • Biljecki, Filip, Hugo Ledoux, Jantien Stoter, and George Vosselman. 2016. “The Variants of an LOD of a 3D Building Model and Their Influence on Spatial Analyses.” ISPRS Journal of Photogrammetry and Remote Sensing 116:42–54. https://doi.org/10.1016/j.isprsjprs.2016.03.003.
  • Carion, Nicolas, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. “End-to-End Object Detection with Transformers.” In European Conference on Computer Vision, 213–229. Springer.
  • Chen, Jiacheng, Yiming Qian, and Yasutaka Furukawa. 2022. “HEAT: Holistic Edge Attention Transformer for Structured Reconstruction.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3866–3875, New Orleans, Louisiana, USA.
  • Chen, Qi, Lei Wang, Steven L. Waslander, and Xiuguo Liu. 2020. “An End-to-End Shape Modeling Framework for Vectorized Building Outline Generation From Aerial Images.” ISPRS Journal of Photogrammetry and Remote Sensing 170:114–126. https://doi.org/10.1016/j.isprsjprs.2020.10.008.
  • Chen, Liang-Chieh, Yi Yang, Jiang Wang, Wei Xu, and Alan L Yuille. 2016. “Attention to Scale: Scale-Aware Semantic Image Segmentation.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3640–3649, Las Vegas, Nevada, USA.
  • Esmaeily, Zahra, and Mehdi Rezaeian. 2023. “Building Roof Wireframe Extraction From Aerial Images Using a Three-Stream Deep Neural Network.” Journal of Electronic Imaging 32 (1): 013001–013001. https://doi.org/10.1117/1.JEI.32.1.013001.
  • Fu, Jun, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. 2019. “Dual Attention Network for Scene Segmentation.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3146–3154, Long Beach Convention & Entertainment Center, Los Angeles CA, USA.
  • Hu, Jie, Li Shen, and Gang Sun. 2018. “Squeeze-and-Excitation Networks.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7132–7141, Salt Lake City, UT, USA.
  • Ji, Shunping, Shiqing Wei, and Meng Lu. 2018. “Fully Convolutional Networks for Multisource Building Extraction From an Open Aerial and Satellite Imagery Data Set.” IEEE Transactions on Geoscience and Remote Sensing 57 (1): 574–586. https://doi.org/10.1109/TGRS.2018.2858817.
  • Konishi, Scott, Alan L. Yuille, James M. Coughlan, and Song Chun Zhu. 2003. “Statistical Edge Detection: Learning and Evaluating Edge Cues.” IEEE Transactions on Pattern Analysis and Machine Intelligence25 (1): 57–74. https://doi.org/10.1109/TPAMI.2003.1159946.
  • Le, Khoa N., Kishor P. Dabke, and Gregory K. Egan. 2006. “On Mathematical Derivations of Auto-Term Functions and Signal-to-Noise Ratios of the Choi–Williams, First-and Nth-order Hyperbolic Kernels.” Digital Signal Processing 16 (1): 84–104. https://doi.org/10.1016/j.dsp.2005.04.006.
  • Lin, Tsung-Yi, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. “Feature Pyramid Networks for Object Detection.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2117–2125, Honolulu, Hawaii, USA.
  • Lin, Moule, Weipeng Jing, Chao Li, and András Jung. 2023. “Optimized Vectorizing of Building Structures with Swap: High-Efficiency Convolutional Channel-Swap Hybridization Strategy.” arXiv preprint arXiv:2306.15035.
  • Lin, Yancong, Silvia L. Pintea, and Jan C. van Gemert. 2020. “Deep Hough-Transform Line Priors.” In European Conference on Computer Vision, 323–340. Springer.
  • Liu, Zhengyu, Qian Shi, and Jinpei Ou. 2022. “LCS: A Collaborative Optimization Framework of Vector Extraction and Semantic Segmentation for Building Extraction.” IEEE Transactions on Geoscience and Remote Sensing 60:1–15.
  • Lu, Ziqiong, Linxi Huan, Qiyuan Ma, and Xianwei Zheng. 2023. “Holistic Geometric Feature Learning for Structured Reconstruction.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 21807–21817, Paris, France.
  • Lv, Xianwei, Zhenfeng Shao, Xiao Huang, Wen Zhou, Dongping Ming, Jiaming Wang, and Chengzhuo Tong. 2022. “BTS: A Binary Tree Sampling Strategy for Object Identification Based on Deep Learning.” International Journal of Geographical Information Science 36 (4): 822–848. https://doi.org/10.1080/13658816.2021.1980883.
  • Lv, Xianwei, Zhenfeng Shao, Dongping Ming, Chunyuan Diao, Keqi Zhou, and Chengzhuo Tong. 2021. “Improved Object-Based Convolutional Neural Network (IOCNN) to Classify Very High-Resolution Remote Sensing Images.” International Journal of Remote Sensing 42 (21): 8318–8344. https://doi.org/10.1080/01431161.2021.1951879.
  • Nauata, Nelson, and Yasutaka Furukawa. 2020. “Vectorizing World Buildings: Planar Graph Reconstruction by Primitive Detection and Relationship Inference.” In European Conference on Computer Vision, 711–726. Springer.
  • Peters, Ravi, Balázs Dukai, Stelios Vitalis, Jordi van Liempt, and Jantien Stoter. 2021. “Automated 3D Reconstruction of LoD2 and LoD1 Models for all 10 Million Buildings of the Netherlands.” arXiv preprint arXiv:2201.01191.
  • Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention is All You Need.” Advances in Neural Information Processing Systems 30.
  • Von Gioi, Rafael Grompone, Jeremie Jakubowicz, Jean-Michel Morel, and Gregory Randall. 2008. “LSD: A Fast Line Segment Detector with a False Detection Control.” IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (4): 722–732. https://doi.org/10.1109/TPAMI.2008.300.
  • Wang, Jianing, Hanjiang Xiong, Jianya Gong, and Xianwei Zheng. 2021. “Structured Building Extraction from High-Resolution Satellite Images with a Hybrid Convolutional Neural Network.” In 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, 2417–2420. IEEE.
  • Xiong, Biao, S. Oude Elberink, and G. Vosselman. 2014. “A Graph Edit Dictionary for Correcting Errors in Roof Topology Graphs Reconstructed From Point Clouds.” ISPRS Journal of Photogrammetry and Remote Sensing 93:227–242. https://doi.org/10.1016/j.isprsjprs.2014.01.007.
  • Xu, Yifan, Weijian Xu, David Cheung, and Zhuowen Tu. 2021. “Line Segment Detection Using Transformers Without Edges.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4257–4266, (Virtual).
  • Xue, Nan, Tianfu Wu, Song Bai, Fudong Wang, Gui-Song Xia, Liangpei Zhang, and Philip H. S. Torr. 2020. “Holistically-Attracted Wireframe Parsing.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2788–2797.
  • Zhang, Fuyang, Nelson Nauata, and Yasutaka Furukawa. 2020. “Conv-mpn: Convolutional Message Passing Neural Network for Structured Outdoor Architecture Reconstruction.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2798–2807, (Virtual).
  • Zhao, Wufan, Claudio Persello, and Alfred Stein. 2021. “Building Outline Delineation: From Aerial Images to Polygons with an Improved End-to-End Learning Framework.” ISPRS Journal of Photogrammetry and Remote Sensing 175:119–131. https://doi.org/10.1016/j.isprsjprs.2021.02.014.
  • Zhao, Wufan, Claudio Persello, and Alfred Stein. 2022. “Extracting Planar Roof Structures From Very High Resolution Images Using Graph Neural Networks.” ISPRS Journal of Photogrammetry and Remote Sensing 187:34–45. https://doi.org/10.1016/j.isprsjprs.2022.02.022.
  • Zhao, Wufan, Claudio Persello, and Alfred Stein. 2023. “Semantic-Aware Unsupervised Domain Adaptation for Height Estimation From Single-view Aerial Images.” ISPRS Journal of Photogrammetry and Remote Sensing 196:372–385. https://doi.org/10.1016/j.isprsjprs.2023.01.003.
  • Zhou, Wen, Dongping Ming, Xianwei Lv, Keqi Zhou, Hanqing Bao, and Zhaoli Hong. 2020. “SO–CNN Based Urban Functional Zone Fine Division with VHR Remote Sensing Image.” Remote Sensing of Environment 236:111458. https://doi.org/10.1016/j.rse.2019.111458.
  • Zhou, Yichao, Haozhi Qi, and Yi Ma. 2019. “End-to-End Wireframe Parsing.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 962–971, Seoul, South Korea.
  • Zhu, Xizhou, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2020. “Deformable Detr: Deformable Transformers for End-to-End Object Detection.” arXiv preprint arXiv:2010.04159.
  • Zou, Chuhang, Alex Colburn, Qi Shan, and Derek Hoiem. 2018. “Layoutnet: Reconstructing the 3D Room Layout from a Single RGB Image.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2051–2059, Salt Lake City, UT, USA.