789
Views
0
CrossRef citations to date
0
Altmetric
Articles

Improvement of automatic building region extraction based on deep neural network segmentation

, , &
Pages 393-408 | Received 24 Aug 2022, Accepted 20 Mar 2023, Published online: 06 Apr 2023

ABSTRACT

This work seeks to improve the accuracy of building region extraction, in which each pixel in a scenery image is determined to be part of a building or part of the background. Specifically, UNet++ and MANet, which are state-of-the-art deep neural networks (DNNs) for segmentation, were applied to building extraction. Our experiment using 105 scenery images in the Zurich Buildings Database (ZuBuD) showed that these networks significantly improved the F-measure by at least 1.67% as compared with conventional building extraction. To address the shortcomings of segmentation networks, we also developed a method based on refinement of the building region extracted by a segmentation network. The proposed method demonstrated its effectiveness by significantly increasing the F-measure by at least 1.15%. Overall, the F-measure was improved by 3.58% as compared with conventional building extraction.

1. Introduction

A smartphone user's current position can be obtained by using a global navigation satellite system (GNSS) with a map application, thus enabling users to reach their destinations efficiently (Zangenehnejad & Gao, Citation2021). This navigation approach is useful for users to check whether they are on the right path to a destination. However, its performance is degraded by blockage or reflection of satellite signals by high-rise buildings (Tang et al., Citation2021). To overcome this issue, imaging-based positioning systems have been developed, as discussed in Fond et al. (Citation2021) and Toft et al. (Citation2020).

The systems proposed in those works use scenery images, whose main subjects are buildings, to measure the current position through building identification. Hence, the building identification accuracy will need to be improved to increase the measurement accuracy of those systems. One technique for this purpose is building extraction, in which the image pixels in a building's region are extracted and those in background regions are removed, as suggested in Fang et al. (Citation2019) and Futagami et al. (Citation2020). Such building extraction is also required for building surface inspection and 3D building reconstruction applications, as mentioned in Futagami et al. (Citation2020) and Ueno et al. (Citation2016).

The major techniques for building extraction are based on segmentation networks, which have been enthusiastically studied with the advent of deep learning technology (Femiani et al., Citation2018; Futagami et al., Citation2020). In particular, networks with an encoder-decoder structure have been applied to building extraction. One frequently used network is SegNet (Badrinarayanan et al., Citation2017), which has outperformed other networks such as UNet (Ronneberger et al., Citation2015) and fully convolutional networks (FCNs) (Long et al., Citation2015) on the task of building extraction, as discussed in Femiani et al. (Citation2018). The usual segmentation networks are trained on a building image dataset with pixel-wise annotation, which perfectly indicates the building and background regions and is generated through manual interaction. Because of this requirement, the cost of deploying accurate networks (Mo et al., Citation2022) for building extraction has increased. Hence, there is a need for low-cost yet highly reliable building extraction.

To reduce the deployment cost for segmentation networks, a small, open image dataset, CamVid (Brostow et al., Citation2008), was used in Futagami et al. (Citation2020). In addition, transfer learning (Ribani & Marengoni, Citation2019), which helps to train networks with a limited amount of data, has been applied to segmentation networks. However, the experiment in Iwai et al. (Citation2020) demonstrated that the deployed segmentation network underperformed building extraction by a clustering-based method, which does not require a building image dataset with pixel-wise annotation. Instead, the clustering-based method's algorithm leverages knowledge of the building regions in scenery images. An example of such knowledge is that the building region is typically at the centre of an image, because the building is the image's main subject. The results in Iwai et al. (Citation2020) suggest that segmentation networks cannot automatically find features that are as effective as those obtained by leveraging knowledge from a limited amount of training data. Thus, such knowledge is expected to be effective when a limited amount of training data is available.

In addition, networks based on a limited amount of training data have been reported to have lower accuracy than semi-automatic building extraction, which uses a simple manual input such as a bounding box for a building (Iwai et al., Citation2020). Because networks based on rich training data have human-level accuracy (Men et al., Citation2021), building extraction based on a limited amount of training data will need to exceed the performance of semi-automatic building extraction as a first step toward practical use.

Hence, this work aims to develop low-cost, accurate building extraction based on segmentation networks. For this aim, we applied state-of-the-art segmentation networks to the task of building extraction. In addition, we propose a knowledge-based method that can deal with the shortcomings of segmentation networks in building extraction, which were carefully analyzed in detail for this work.

The contributions of this paper are summarized as follows.

(1)

We apply state-of-the-art segmentation networks, which have shown improved accuracy in other fields or tasks, to building extraction.

(2)

We propose a method to deal with the observed shortcomings of segmentation networks for building extraction. The proposed method's algorithm is mainly designed to decrease the erroneous determination of parts of the actual building region as part of the background region.

The structure of this paper is as follows. In Section 2, we describe the related work on building extraction. Then, in Section 3, we outline the algorithm for the proposed method. In Section 4, we describe an experiment to evaluate our method on building extraction, before concluding in Section 5 with a summary.

2. Related work

This section outlines the related work on building extraction, which is applied to images like those shown in . Sections 2.1 and 2.2 outline the clustering-based method and segmentation networks, respectively. Then, Section 2.3 analyzes the shortcomings of segmentation networks.

Figure 1. Examples of scenery images containing buildings.

Figure 1. Examples of scenery images containing buildings.

2.1. Clustering-based method

The algorithm of the clustering-based method is based on the knowledge that a building is located an image's centre. In the first step, colour segmentation is applied to generate colour clusters. (b) shows the differently coloured clusters computed from the image shown in (a).

Figure 2. Process of the clustering-based method. (a) Input image. (b) Colour clusters. (c) Candidates. (d) Extracted building.

Figure 2. Process of the clustering-based method. (a) Input image. (b) Colour clusters. (c) Candidates. (d) Extracted building.

In the second step, each colour cluster is analyzed to detect those at the top and bottom of the image. The pixels in those clusters are specified as background candidates on the basis of their coordinates. The pink and green areas in (c) are the respective candidates for the background and the building, as detected from (b).

In the third step, GrabCut (Rother et al., Citation2004), which can refine the shapes of both the building and background regions according to colour similarity, is used to create the final output. GrabCut iteratively applies the max-flow min-cut algorithm (BoyKov & Kolmogorov, Citation2004), which is based on graph theory, by modelling each region's colour distribution. The initial classification of each region, which is required for running GrabCut, is based on the candidates obtained in the second step. (d) depicts the building extracted by applying GrabCut with initialization on the candidates shown in (c).

2.2. Segmentation networks

This subsection describes SegNet, UNet++ (Zhou et al., Citation2019), MANet (Fan et al., Citation2020), VGGNet (Simonyan & Zisserman, Citation2014), and EfficientNet (Tan & Le, Citation2019, Citation2021), which are used as reliable segmentation networks that have been extensively applied to various tasks.

2.2.1. SegNet

For building extraction, SegNet has traditionally been used in the literature because of its high accuracy as compared with other networks like UNet and FCN. SegNet has an encoder-decoder structure based on downsampling and upsampling layers. The upsampling layer uses max-pooling indices, which capture the location of the maximum feature value at the downsampling layer.

2.2.2. UNet++

UNet++ also has an encoder-decoder structure but is based on a connection scheme in which low-level feature maps from the encoder are combined with high-level feature maps from the decoder. Compared with the original UNet, UNet++ has a more complicated connection, such that low-level feature maps from different sampling scales are densely combined.

2.2.3. MANet

MANet also has an encoder-decoder structure and uses connection like UNet, in that feature maps extracted from the encoder and decoder are combined at the same sampling scale. However, this network also contains a position-wise attention block (PAB) and a multi-scale fusion attention block (MFAB). The PAB can weight feature maps on the basis of an attention map, which indicates the informative positions in the feature map. The MFAB can weight each channel of the feature map through two squeeze-and-excitation blocks.

2.2.4. VGGNet

The Visual Geometry Group network (VGGNet) is an extensively used encoder that is mainly based on convolutional layers, each of which is followed by a rectified linear unit (ReLU) layer. The ReLU layer, which includes an activation function, applies a thresholding operation to each value in the feature maps. Because of its reliability, VGGNet is traditionally used for various networks as a baseline to evaluate efficiency.

2.2.5. EfficientNet

EfficientNet is an encoder that is mainly based on mobile inverted bottleneck convolution (MBConv) layers, and it has been applied in various fields and image segmentation tasks. MBConv forms a skip connection that combines the feature maps created before and after convolution. It also includes a squeeze-and-excitation layer to weight the feature map in each channel.

2.3. Shortcomings of segmentation networks

In Femiani et al. (Citation2018), it was reported that SegNet outperformed other segmentation networks like FCN and UNet for building extraction. On that basis, in Futagami et al. (Citation2020), SegNet was trained on CamVid to implement low-cost building extraction. To achieve high accuracy, VGGNet was trained on the ImageNet open dataset (Deng et al., Citation2009) (containing more than 1.2 million images in 1000 categories) and used as the encoder for SegNet.

However, as reported in Futagami et al. (Citation2020), the accuracy of SegNet was lower than that of the clustering-based method. (a) depicts the building that SegNet extracted from the image in (a). This figure suggests that further improvement is required. Hence, we applied recently developed, state-of-the-art segmentation networks, i.e. UNet++ and MANet. (b,c) depict the buildings extracted by UNet++ and MANet, respectively, where the encoder in each case was EfficientNet with transfer learning from ImageNet. Although these images suggest that UNet++ and MANet both tended to decrease the incorrect determination of pixels in the actual building regions as background pixels, further improvement is required for the use of segmentation networks in building extraction.

Figure 3. Building extracted by three segmentation networks. (a) SegNet. (b) UNet++. (c) MANet.

Figure 3. Building extracted by three segmentation networks. (a) SegNet. (b) UNet++. (c) MANet.

implies that one of the shortcomings of the segmentation networks is the unclear boundary area between the building and background. In fact, we have knowledge that the contour structure of the actual building is not complicated, unlike that of the extracted building. In addition, we can see that certain inner areas of the actual building region are erroneously determined as the background. Although some buildings are located behind objects such as trees, we also have knowledge that inner areas with a similar colour to the surrounding areas tend to be actual building regions. Hence, on the basis of the above knowledge, we propose a method that can overcome these shortcomings, which are frequently observed for various images.

3. Proposed method

This section describes our proposed method, based on segmentation networks, to address the shortcomings discussed above. A flowchart of the proposed method is shown in . As mentioned above, we exploited knowledge-based features of scenery images and building regions to design our method's algorithm. The problem of erroneous determination of background regions is overcome by enlarging the building candidate region. In addition, the problem of unclear boundaries is overcome by applying GrabCut, which can refine the building's contour structure.

Figure 4. Flowchart of the proposed method.

Figure 4. Flowchart of the proposed method.

The proposed method's basic concept derives from a coarse-to-fine framework that is also used in Wang et al. (Citation2018). In that work, salient objects (not buildings) were finely extracted from coarsely extracted objects. Our method here also extracts fine building regions from the coarse building region provided by a segmentation network. The following subsections explain each procedure.

3.1. Determine building candidate region

In the first step, the segmentation network processes the input image. However, the extracted building regions must be improved because of the shortcomings described in Section 2.3.

Hence, in the second step, the connected components of the extracted building regions are identified by applying the reliable Scan Array Union Find (SAUF) algorithm (Wu et al., Citation2009). Because the building is assumed to be the image's main subject, the connected component that includes the actual building regions tends to be larger than the connected components that include background regions. Thus, the largest connected component is determined as the initial building candidate. The sizes of the connected components are computed on the basis of the number of pixels in each component.

(a) shows the connected components (in different colours) computed from the image in (b). The black areas are groups of pixels that were not in the building as determined by the segmentation network. The pink area represents the largest connected component, which is thus determined as the main subject. (b) depicts the initial building candidate region obtained from (a).

Figure 5. Illustration of the proposed method's procedure. (a) Connected components. (b) Initial buildling candidate. (c) Convex hulls. (d) Initial classification. (e) Extracted building.

Figure 5. Illustration of the proposed method's procedure. (a) Connected components. (b) Initial buildling candidate. (c) Convex hulls. (d) Initial classification. (e) Extracted building.

3.2. Enlarge building candidate region

The inner areas of the initial building candidate include parts of the actual background regions, as depicted in (b). Thus, this procedure enlarges the initial building candidate to address these incorrect inclusions.

In the first step, a convex hull, which is the smallest polygon including all of the initial building candidate, is computed. For this computation, we use an algorithm by Sklansky (Citation1972) because of its reliability. The pixels in the convex hull are then determined as the updated building candidate. The green lines in (c) represent the convex hull of the initial building candidate region in (b).

In the second step, pixels in the boundary areas around the updated building candidate are additionally specified as part of the final building candidate. Each pixel, with coordinate p, that is not in the updated building candidate is assessed via its distance from the updated building candidate. This distance d is computed as follows: (1) d=minnp,qnh,(1) where qn({q0,,qN1}) denotes the coordinate for each pixel forming the contour structure of the updated building candidate, and the contour structure comprises N pixels. In addition, p,qnh uses a spatial distance definition in a normed vector space (Minkowski distance), and the particular definition is determined by a characteristic parameter h. For example, the Euclidean distance and the Manhattan distance are used for the cases of h = 1, 2, respectively, and the Chebyshev distance is used for the case of h=.

Pixels with a small distance d tend to be close to the contour of the updated building candidate. Thus, the pixels for which d is smaller than a specific value are additionally specified as part of the final building candidate. This threshold value d¯ is given as follows: (2) d¯=α1Nnb¯,qnh,(2) where α is a coefficient to control the threshold value, and b¯ denotes the coordinate of the updated building candidate region's centroid, which is computed as follows: (3) b¯=bBb|B|,(3) where B denotes the set of coordinates in the updated building candidate. Hence, pixels with distance d smaller than d¯ are included in the final building candidate.

Our implementation uses the Chebyshev distance (i.e. h= in Equations (Equation1) and (Equation2)). In computing Equation (Equation1) for each pixel, eight-neighbour dilation from the updated building candidate is applied d¯ times to simplify and accelerate the procedure. The interior of the red lines in (c) represents the final building candidate, as obtained by applying the eight-neighbour dilation to the interior of the green lines.

3.3. Apply GrabCut

In this procedure of applying GrabCut, the initial classification is based on the final building candidate. The enlargement procedure above is meant to exclude pixels in the actual background from the building candidate. Thus, all pixels that are not included in the final building candidate should be in the actual background regions. Accordingly, GrabCut is set not to determine the pixels at the final building candidate as the background region. In contrast, the pixels included in the final building candidate are allowed to be determined as the background region.

(d) depicts GrabCut's initial classification, in which the pink and green areas represent the background and building, respectively. The extracted building is depicted in (e). In other words, the proposed method transforms the coarsely extracted building shown in (d) into the finely extracted building shown in (e).

compares the extraction results shown in (b) and (e), which were obtained by (a) UNet++ and (b) our proposed method. The red and green pixels represent incorrectly determined parts of the background and building regions, respectively. As seen in , the numbers of red and green pixels were markedly decreased by applying the proposed method. This comparison suggests that our method could address the shortcomings of the segmentation network for this image. In the next section, we thoroughly evaluate the proposed method's effectiveness.

Figure 6. Comparison of extraction results. (a) UNet++. (b) Proposed method.

Figure 6. Comparison of extraction results. (a) UNet++. (b) Proposed method.

4. Experiment

This section describes our quantitative evaluation to demonstrate the effectiveness of the proposed building extraction method. After describing the experimental conditions and results in Sections 4.1 and 4.2, respectively, we discuss the results in Section 4.3.

4.1. Experimental conditions

In this subsection, we outline the dataset and metrics that we used, and then we summarize our implementations of the methods for evaluation.

4.1.1. Dataset

We used 105 scenery images that were randomly selected from the Zurich Buildings Database (ZuBuD) (Shao et al., Citation2003), as in Futagami et al. (Citation2020) and Ueno et al. (Citation2016). Examples of these images were shown in . Note that we applied a rotation operation to certain images so that the buildings could be captured vertically. To facilitate evaluation, we manually generated ground-truth images that perfectly indicated the actual building and background regions.

4.1.2. Metrics

The accuracy of each building extraction method was computed in terms of the precision, recall, and F-measure, which are defined as follows: (4) precision=truepositivetruepositive+falsepositive,(4) (5) recall=truepositivetruepositive+falsenegative,(5) (6) Fmeasure=2precisionrecallprecision+recall.(6) Here, the values for true and false positives respectively indicate the numbers of pixels that were correctly and incorrectly determined to be in the building region. Likewise, the value for false negatives indicates the number of pixels that were incorrectly determined to be in the background region. The precision and recall are each respectively decreased by incorrect determination of the actual building and background regions. As a result, there is an unavoidable tradeoff between the precision and recall. Thus, we used the F-measure, which is the harmonic mean of the precision and recall, as the most comprehensive, discriminative metric to evaluate each method.

4.1.3. Implementations

For this evaluation, we implemented the clustering-based method, SegNet, UNet++, MANet, the proposed method, and interactive GrabCut. Each of the existing methods was implemented mainly according to the original work. All methods ran on a local computer with an Intel Core i9-10900X CPU @ 3.70 GHz, 96 GB of memory, and an NVIDIA GeForce RTX 3060 graphics card with 12 GB of VRAM.

In the clustering-based method, a variational Bayesian Gaussian mixture model (VBGMM) was used for colour segmentation on the basis of an experiment reported in Futagami et al. (Citation2020). The hue, saturation, value (HSV) colour space was also used for colour representation, and the input image's resolution was reduced during colour segmentation.

For the segmentation networks, the training conditions were based on a previous work (Futagami et al., Citation2020). The encoder for SegNet was VGG-16, which is a type of VGGNet, and the encoder for both UNet++ and MANet was EfficientNetV2-L, which is a type of EfficientNet. Both encoders were transfer learned from ImageNet, and the decoder was trained on the 367 training images in CamVid. As discussed in Futagami et al. (Citation2020), the use of ImageNet and CamVid can reduce the deployment cost because of their public availability.

In the proposed method, SegNet, UNet++, and MANet were used for its segmentation network. These networks were implemented with the conditions described above, while other parameters were determined experimentally. Further details were also given in Section 3.

Lastly, interactive GrabCut, which was also evaluated in Futagami et al. (Citation2020), was initialized by using each building's manually provided bounding box. Unlike the other methods, interactive GrabCut is not an automatic method, but we included it for comparison with the proposed method because of its effectiveness. Note that a building's bounding box is based on the actual building region in a ground-truth image. In (a), the green lines depict the bounding box, and (b) shows the building extracted via interactive GrabCut.

Figure 7. Illustration of interactive GrabCut. (a) Bounding box. (b) Extracted building.

Figure 7. Illustration of interactive GrabCut. (a) Bounding box. (b) Extracted building.

4.2. Experimental results

We compared the accuracies of each method at two stages to clarify our contributions as described in Section 1. At the first stage, the clustering-based method, SegNet, UNet++, and MANet were compared to investigate the effectiveness of the state-of-the-art segmentation networks. At the second stage, SegNet, UNet++, MANet, the proposed method, and interactive GrabCut were compared to investigate our method's effectiveness.

We used the Wilcoxon signed-rank test, which is a nonparametric paired statistical test, to find significant differences in the extraction accuracy at a significance level of 5%. We followed this approach because all the data analyzed in this experiment exhibited non-normality according to a Shapiro-Wilks test at the 5% significance level.

lists the extraction accuracies of each method. At the first stage, SegNet significantly decreased the F-measure by 1.42% as compared with the clustering-based method. In contrast, UNet++ and MANet significantly increased the F-measure by at least 1.67%. As compared with SegNet, UNet++ and MANet decreased the precision by at most 0.41%. However, significant differences were not found. These results confirm the effectiveness of these state-of-the-art segmentation networks as compared with the conventional building extraction methods developed in the literature.

Table 1. Comparison of the extraction accuracies with each method.

In comparison with MANet, UNet++ nonsignificantly increased the F-measure by 0.76%, because the recall significantly increased by 1.57%. This result may derive from the fact that the dense connection of UNet++, as explained in Section 2.2.2, provides more flexible feature fusion at the decoder than MANet does. However, the images shown in , which were obtained by applying UNet++ to the images shown in , suggest that the shortcomings described in Section 2.3 still frequently occurred.

Figure 8. Buildings extracted by UNet++.

Figure 8. Buildings extracted by UNet++.

As for the second stage, in comparison with SegNet, UNet++, and MANet, the proposed method, when based on any of those networks, significantly increased the F-measure by at least 1.15%. This was because the recall significantly increased by at least 3.38%, thus demonstrating that the proposed method worked as expected. However, we note that the precision decreased by at most 0.91%, and we will need to deal with this degradation in future works to achieve further improvement.

As compared with interactive GrabCut, the proposed method when based on UNet++ or MANet nonsignificantly increased the F-measure by at least 0.38% because the precision of those networks was significantly higher, by at least 1.99%. In addition, although the proposed method when based on SegNet could not increase the F-measure, a significant difference was not found. Whereas the automatic building extraction methods in the literature could not outperform interactive GrabCut, the proposed method achieved higher accuracy when it was based on UNet++ or MANet, thereby demonstrating its effectiveness.

shows two examples of the extracted buildings for each method. For both images, the clustering-based method and UNet++ tended to erroneously determine part of the actual building region as part of the background. In contrast, the proposed method (based on UNet++) could avoid this erroneous determination, like interactive GrabCut. For these images, our evaluation demonstrated that the proposed method increased the F-measure by at least 2.85%.

Figure 9. Buildings extracted by each method.

Figure 9. Buildings extracted by each method.

depicts the buildings extracted from the images shown in by the proposed method when based on UNet++. A comparison of and further indicates that the proposed method increased the recall as expected. Specifically, for these images, the quantitative comparison demonstrated that the proposed method increased the recall and F-measure by at least 5.15% and at least 3.24%, respectively, as compared with the UNet++ baseline.

Figure 10. Buildings extracted by the proposed method when based on UNet++.

Figure 10. Buildings extracted by the proposed method when based on UNet++.

4.3. Discussion

Here, to provide more focused and useful findings, we discuss the proposed method's importance and effectiveness. The images in the evaluation dataset were classified as low-precision or low-recall images on the basis of the extraction accuracies of SegNet, UNet++, and MANet. The low-precision images were those whose precision was lower than the recall, whereas the low-recall images were those recall was lower than the precision.

summarizes the results of this classification, indicating that the low-recall images accounted for at least 47% of the images. This suggests that the shortcomings of segmentation networks, as described in Section 2.3, frequently occur. Thus, the proposed method is an important step in improving the performance of building extraction.

Table 2. Numbers of low-precision and low-recall images in the dataset.

In addition, the low-recall images with SegNet accounted for 81% of the images, which was at least 19% larger than the proportions for UNet++ and MANet. This phenomenon strongly accounted for the F-measure difference between SegNet and the proposed method when it was based on SegNet, because the proposed method was designed to improve the recall. The comparison results in demonstrated that the difference for SegNet was 4.60%, which was at least 2.69% larger than the differences for UNet++ and MANet.

5. Conclusions

In this work, we aimed to improve the accuracy of building extraction based on a segmentation network, which can be implemented with low cost. Accordingly, we applied various state-of-the-art segmentation networks proposed in recent years to building extraction. In an experiment using 105 images in the ZuBuD dataset, our proposed method significantly increased the F-measure by at least 1.67% as compared with the conventional building extraction methods. However, the state-of-the-art segmentation networks require further improvement because of their shortcomings.

Hence, we proposed an algorithm to address those shortcomings. As a result, the proposed method significantly increased the F-measure by at least 1.15% as compared with the baseline segmentation networks. In addition, our method showed effectiveness in increasing the F-measure as compared with semi-automatic building extraction based on manual input. In summary, this work has demonstrated a total increase in the F-measure by 3.58% in comparison with conventional building extraction.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Notes on contributors

Noboru Hayasaka

Noboru Hayasaka He received B.S., M.S., and Ph.D. degrees from Hokkaido University in 2002, 2004, and 2007, respectively. At present, he is an associate professor at Osaka Electro-Communication University. His current research interests include signal processing and speech recognition.

Yuki Shirazawa

Yuki Shirazawa He received a B.S. degree from Osaka Electro-Communication University in 2022 and is now an M.S. student there. His current research interest is image processing.

Mizuki Kanai

Mizuki Kanai He has been a B.S. student at Osaka Electro-Communication University since 2019. His current research interest is image processing.

Takuya Futagami

Takuya Futagami He received B.S., M.S., and Ph.D. degrees from Osaka University in 2013, 2015, and 2021, respectively. At present, he is a lecturer at Aichi Gakuin University. His current research interests include image processing and image recognition.

References

  • Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12), 2481–2495. https://doi.org/10.1109/TPAMI.34
  • BoyKov, Y., & Kolmogorov, V. (2004). An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9), 1124–1137. https://doi.org/10.1109/TPAMI.2004.60
  • Brostow, G. J., Shotton, J., Fauqueur, J., & Cipolla, R. (2008). Segmentation and recognition using structure from motion point clouds. In European conference on computer vision (pp. 44–57).
  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In Proceeding of IEEE conference on computer vision and pattern recognition (pp. 248–255).
  • Fan, T., Wang, G., Li, Y., & Wang, H. (2020). Ma-net: A multi-scale attention network for liver and tumor segmentation. IEEE Access, 8, 179656–179665. https://doi.org/10.1109/Access.6287639
  • Fang, W., Ding, Y., Zhang, F., & Sheng, V. S. (2019). DOG: A new background removal for object recognition from images. Neurocomputing, 361(7), 85–91. https://doi.org/10.1016/j.neucom.2019.05.095
  • Femiani, J., Para, W. R., Mitra, N., & Wonka, P. (2018). Facade segmentation in the wild. ArXiv Preprint, arXiv:1805.08634.
  • Fond, A., Berger, M. O., & Simon, G. (2021). Model-image registration of a building's facade based on dense semantic segmentation. Computer Vision and Image Understanding, 206, Article ID 103185. https://doi.org/10.1016/j.cviu.2021.103185
  • Futagami, T., Hayasaka, N., & Onoye, T. (2020). Fast and robust building extraction based on HSV color analysis using color segmentation and GrabCut. SICE Journal of Control, Measurement, and System Integration, 13(3), 97–106. https://doi.org/10.9746/jcmsi.13.97
  • Iwai, M., Futagami, T., Hayasaka, N., & Onoye, T. (2020). Acceleration of automatic building extraction via color-clustering analysis. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 103(12), 1599–1602. https://doi.org/10.1587/transfun.2020SML0004
  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431–3440).
  • Men, K., Chen, X., Yang, B., Zhu, J., Yi, J., Wang, S., Li, Y., & Dai, J. (2021). Automatic segmentation of three clinical target volumes in radiotherapy using lifelong learning. Radiotherapy and Oncology, 157, 1–7. https://doi.org/10.1016/j.radonc.2020.12.034
  • Mo, Y., Wu, Y., Yang, X., Liu, F., & Liao, Y. (2022). Review the state-of-the-art technologies of semantic segmentation based on deep learning. Neurocomputing, 493(7), 626–646. https://doi.org/10.1016/j.neucom.2022.01.005
  • Ribani, R., & Marengoni, M. (2019). A survey of transfer learning for convolutional neural networks. In 32nd SIBGRAPI conference on graphics, patterns and images tutorials (pp. 44–57).
  • Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 234–241).
  • Rother, C., Kolmogorov, V., & Blake, A. (2004). GrabCut: interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 23(3), 309–314. https://doi.org/10.1145/1015706.1015720
  • Shao, H., Svoboda, T., & Van Gool, L. (2003). Zubud-zurich buildings database for image based recognition. In Computer vision lab, Swiss Federal Institute of Technology, Switzerland, Tech. Rep (p. 260).
  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. ArXiv Preprint, arXiv:1409.1556.
  • Sklansky, J. (1972). Measuring concavity on a rectangular mosaic. IEEE Transactions on Computers, 21(12), 1355–1364. https://doi.org/10.1109/T-C.1972.223507
  • Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning (pp. 6105–6114).
  • Tan, M., & Le, Q. (2021). Efficientnetv2: Smaller models and faster training. In International conference on machine learning (pp. 10096–10106).
  • Tang, W., Wang, Y., Zou, X., Li, Y., Deng, C., & Cui, J. (2021). Visualization of GNSS multipath effects and its potential application in IGS data processing. Journal of Geodesy, 95(9), 1–13. https://doi.org/10.1007/s00190-021-01559-9
  • Toft, C., Turmukhambetov, D., Sattler, T., Kahl, F., & Brostow, G. J. (2020). Single-image depth prediction makes feature matching easier. In European conference on computer vision (pp. 473–492).
  • Ueno, D., Yoshida, H., & Iiguni, Y. (2016). Automated GrowCut with multilevel seed strength value for building image. Transactions of the Institute of Systems, Control and Information Engineers, 29(6), 266–274. https://doi.org/10.5687/iscie.29.266. (in Japanese).
  • Wang, Y., Ren, T., Zhong, S. H., Liu, Y., & Wu, G. (2018). Adaptive saliency cuts. Multimedia Tools and Applications, 77(17), 22213–22230. https://doi.org/10.1007/s11042-018-5859-y
  • Wu, K., Otoo, E., & Suzuki, K. (2009). Optimizing two-pass connected-component labeling algorithms. Pattern Analysis and Applications, 12(2), 117–135. https://doi.org/10.1007/s10044-008-0109-y
  • Zangenehnejad, F., & Gao, Y. (2021). GNSS smartphones positioning: Advances, challenges, opportunities, and future perspectives. Satellite Navigation, 2(1), 1–23. https://doi.org/10.1186/s43020-021-00054-y
  • Zhou, Z., Siddiquee, M. M. R., Tajbakhsh, N., & Liang, J. (2019). Unet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Transactions on Medical Imaging, 39(6), 1856–1867. https://doi.org/10.1109/TMI.42