387
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Vegetation segmentation using oblique photogrammetry point clouds based on RSPT network

ORCID Icon, ORCID Icon, ORCID Icon, ORCID Icon & ORCID Icon
Article: 2310083 | Received 10 Oct 2023, Accepted 19 Jan 2024, Published online: 05 Feb 2024

ABSTRACT

Vegetation segmentation via point cloud data can provide important information for urban planning and environmental protection. The point cloud dataset is obtained using light detection and ranging (LiDAR) or RGB-D images. Oblique photogrammetry has received little attention as another important source of point cloud data. We present a pointwise annotated oblique photogrammetry point-cloud dataset that contains rich RGB information, texture, and structural features. This dataset contains five regions of Bengbu, China, with more than twenty thousand samples in this paper. Obviously, previous indoor point cloud semantic segmentation models are no longer applicable to oblique photogrammetry point clouds. A random sampling point transformer (RSPT) network is proposed to enhance vegetation segmentation accuracy. The RSPT model offers both efficient and lightweight architecture. In RSPT, random point sampling is utilized to downsample point clouds, and a local feature aggregation module based on self-attention is designed to extract additional representation features. The network also incorporated residual and dense connections (ResiDense) to capture both local and comprehensive features. Compared to state-of-the-art models, RSPT achieves notable improvements. The intersection over union (IoU) metric increased from 96.0% to 96.5%, the F1-score increased from 90.8% to 97.0%, and the overall accuracy (OA) increased from 91.9% to 96.9%.

1. Introduction

Vegetation segmentation holds great significance for rapid 3D modelling of cities (Zhou and Neumann Citation2008; Wang, Peethambaran, and Chen Citation2018), urban spatial data infrastructure construction (Wu et al. Citation2017; Alonzo, Bookhagen, and Roberts Citation2014), and environmental protection (Chen et al. Citation2021). As an important technique, remote sensing with different spatial resolutions is widely used to monitor vegetation (Lin and Hyyppa Citation2012; Guan et al. Citation2016; Liang et al. Citation2016; Xu et al. Citation2018a; Citation2018b). However, 2D images are not sufficiently sensitive to the vertical structure of vegetation (Guo et al. Citation2022), which limits information extraction on vegetation structural features. Vegetation extraction using point clouds can directly yield 3D coordinates of the vegetation surface (Huo, Lindberg, and Holmgren Citation2022; Zhang et al. Citation2023_Reference) and realize a 3D representation of the vegetation (Qin et al. Citation2022; Pang et al. Citation2021). The advantage of point clouds is that they can fully use the spatial structural features of vegetation and further improve vegetation segmentation accuracy. Point clouds are acquired by light detection and ranging (LiDAR) based mobile laser scanning (MLS) (Vallet et al. Citation2015; Behley et al. Citation2019), terrestrial laser scanning (TLS) (Hackel et al. Citation2017), or RGB-D images (Firman Citation2016; Armeni et al. Citation2017). LiDAR scanning may result in incomplete or inaccurate point-cloud data, which is expensive (Li et al. Citation2016). The limited measurement range of RGB-D data restricts their ability to capture the outdoor environment (Dai et al. Citation2017). Most cities have established realistic 3D models as their development gradually transforms from digital cities to smart cities. As an intermediate product of building realistic 3D models, oblique photogrammetry point clouds offer the advantages of low cost, high efficiency, and high accuracy. In this study, an oblique photogrammetry point-cloud dataset was constructed using unmanned aerial vehicle (UAV) imagery from Bengbu, China. It is easily accessible from UAV imagery owing to recent progress in structure from motion (SfM), Multiview stereo (MVS), and UAV techniques (Wu et al. Citation2011; Frahm et al. Citation2010). This dataset enables comprehensive scene visualization and exhibits resilience against occlusions.

Currently, machine learning is the primary tool used for vegetation extraction. Self-learning algorithms have shown promising results in point-cloud semantic segmentation. The irregular point cloud format makes directly applying convolutional neural networks (CNNs) challenging. Most researchers convert point clouds into regular three-dimensional (3D) voxel grids or image collections. Project-based methods (Lawin et al. Citation2017; Su et al. Citation2015; Ma et al. Citation2020; Luo et al. Citation2019) map point clouds into images captured from various viewpoints. A 2D CNN is used to extract features from these images. Geometric details may be lost during projection, and these methods face challenges when performing fine-grained point-cloud segmentation tasks. Voxel-based methods (Maturana and Scherer Citation2015; Liu et al. Citation2019; Zou et al. Citation2017; Luo et al. Citation2018) convert irregular point clouds into regular voxels, followed by applying a 3D CNN to voxel data to extract point cloud features. The loss of fine-grained details may result from using a low-resolution voxel, whereas using a high-resolution voxel can lead to significant computational overhead when transforming irregular point clouds into regular voxels (Guo et al. Citation2020).

PointNet (Qi et al. Citation2017a) was designed to handle unordered point clouds directly. It utilizes a multilayer perceptron (MLP) to extract local features, aggregated by symmetric functions. Although PointNet has achieved good results in point cloud segmentation, it does not consider the relationship between points and local neighbourhood information. PointNet++ (Qi et al. Citation2017b), which has a hierarchical network architecture, was proposed for extracting local features based on PointNet. PointNet++ effectively addresses the issue of local feature extraction, but still fails to consider the relationships between points. PA-Net (Ren et al. Citation2022) combines an attention mechanism with a CNN and proposes a point attention module and feature attention module to extract local features with long-range dependencies selectively. The output of the point attention module is an N × N matrix, which poses a considerable computational burden for large-scale semantic segmentation tasks. RandLA-Net (Hu et al. Citation2020) designs a local feature aggregation module and attentive pooling to handle semantic segmentation tasks on large-scale point cloud data. A dilated residual block is also designed to reduce feature loss during downsampling. However, this approach does not consider global feature information. The spatial contextual feature (SCF) module was presented in SCF-Net (Fan et al. Citation2021) to further learn features from large-scale scenes. Both the geometric and feature distance are applied, and the global information extraction capability of the proposed method is enhanced. The RFCF was proposed by Gong et al. (Citation2021), who introduced a multiscale supervised approach to the point cloud segmentation problem. This approach decomposes the segmentation problem into a global context recognition task and a series of progressively receptive region encoding inference processes. Additionally, the authors propose a complementary feature densification method to provide more dynamic features for RFCF predictions. These methods share the limitations of CNNs, which struggle to capture long-range global features effectively. Regardless of the positional distance in the input sequence, the self-attention mechanism in the transformer enables it to capture the dependencies between any pair of input features. This capability allows the transformer to extract long-range global features effectively. Moreover, the Transformer processes the input sequence in a parallelized and order-independent manner, making it well suited for handling unordered point cloud data (Vaswani et al. Citation2017). A hierarchical structure was constructed in PT (Zhao et al. Citation2021) to extract global features from the downsampled point clouds using vector self-attention. Zhao et al. replaced multiplication with subtraction in the self-attention layer to reduce the computational burden and consider the positional relationships of points. Max pooling is also utilized in PT to aggregate the features of neighbouring points. PCT (Guo et al. Citation2021) extracts local features from neighboring points and feeds these features into the transformer layer based on offset attention to learn the global features. Offset attention is inherently permutation-invariant and more suitable for point cloud learning. A significant computational burden can be imposed due to the FPS, especially for large-scale semantic segmentation tasks. Although transformers have shown promising performance in indoor segmentation tasks, little research has been conducted on semantic segmentation for large-scale scenes.

Inspired by these studies, the random sampling point transformer (RSPT) network was proposed for outdoor segmentation tasks. In RSPT, random point sampling is utilized to downsample point clouds, and a local feature aggregation module based on vector self-attention is designed to extract more representative global features. Attentive pooling is used to aggregate neighbouring point features. Moreover, ResiDense connections are utilized to connect same-level point transformer blocks to capture local features and consider contextual information simultaneously. We make the following main contributions.

  • We construct an oblique photogrammetry point cloud dataset for training and testing the deep learning models.

  • We propose a vegetation segmentation network called RSPT, which has an effective local feature aggregation module based on vector self-attention to preserve complex local structures, to process oblique photogrammetry point clouds directly.

2. Materials

2.1. Data acquisition

The Bengbu dataset was constructed using the structure from motion technique with multiview stereo vision (SfM-MVS) on UAV images (Westoby et al. Citation2012). Initially, we flew the UAV to the target area along the planned flight path and captured the image data. The imagery was captured using a Guangzhou Zhonghaida iFYL V5 vertical takeoff and landing fixed-wing drone equipped with five iCam-Q5 mini cameras. We employed Digital Photo Smart (DP-Smart) as our SfM-MVS software to process the captured images. This software enabled us to perform photogrammetry processing and registration, resulting in accurate point-cloud data generation.

Control points were carefully selected and positioned onsite prior to image acquisition, followed by meticulous marking and measurement procedures. The distribution of control points is shown in . In the image collection process, blockwise aerial photography was applied. The flight lines covered at least 250 m beyond the boundary of the survey area to ensure edge object stereoimaging. The basic parameters of the technical aerial photography design are listed in .

Figure 1. Schematic of control point deployment. The red triangular grid represents the control points. The region delineated by the blue lines represents the survey area or mapping extent.

Figure 1. Schematic of control point deployment. The red triangular grid represents the control points. The region delineated by the blue lines represents the survey area or mapping extent.

Table 1. Basic parameters for aerial photography technique design.

2.2. Data annotation

The Bengbu dataset is annotated pointwise using the presegmentation clustering method for 3D labelling. The computer aided design (CAD) for the target areas were created by hired professionals. CAD drawings depict different categories of separate layers. Using the CAD drawings as a reference, we performed segmentation on the point cloud using third-party software and manually annotated each category. The annotation results are shown in . Our dataset is clearly categorized into three classes: vegetation, buildings, and roads. (a) and (b), show that this dataset includes more than three categories. We categorized the patients into three classes for this study.

Figure 2. Data annotation process. (a) Original point data, (b) CAD drawing, and (c) annotated point cloud. Vegetation, buildings and roads represent the three categories in Figure c. DGX (contour lines), DLDW (independent feature), DLSS (road infrastructure), DMTZ (geomorphic features), GCD (elevation points), GXYZ (pipeline facilities), JMD (residential area), SXSS (water system facilities) and ZBTZ (vegetation features) represent the different layers in Figure b.

Figure 2. Data annotation process. (a) Original point data, (b) CAD drawing, and (c) annotated point cloud. Vegetation, buildings and roads represent the three categories in Figure c. DGX (contour lines), DLDW (independent feature), DLSS (road infrastructure), DMTZ (geomorphic features), GCD (elevation points), GXYZ (pipeline facilities), JMD (residential area), SXSS (water system facilities) and ZBTZ (vegetation features) represent the different layers in Figure b.

2.3. Parsing and statistics

The Bengbu dataset includes five regions identified in Bengbu, China: urban areas (UA), industrial parks (IP), university towns (UT), townships (TS), and countryside (CS). It contains 432 million points and has a density of 33 points per square metre, with a total coverage of 13.1 km2. These point cloud data involve geographic location and RGB information about the target objects. In this paper, we combine geolocation and RGB information into one input, i.e. each point is characterized as a vector containing geolocation information and RGB information. The dataset was categorized into three classes: vegetation, buildings, and roads. A descriptive summary of the points in these five regions is presented in .

Table 2. Introduction to the dataset.

3. Methods

3.1. Overview

Each network will downsample a large-scale scene with millions of points while preserving its original features. In the RSPT network, we employ a random sampling method for efficient point clouds downsampling. Attentive pooling has also been proposed to prevent the loss of crucial features. It maintains a balance between efficiency and performance.

RSPT maintains the basic framework of the point cloud semantic segmentation model and includes three modules: a feature extraction block, a downsampling block, and an upsampling block. shows the basic architecture of the RSPT. What distinguishes our network from other models is that our network is based on self-attention, pointwise transformations, and pooling, thus departing completely from the CNN architecture. In this study, the network consisted of five encoding and four decoding layers. However, the specific number of encoding and decoding layers can be adjusted according to the task. Every encoder layer is a feature extraction layer composed of three-point transformer blocks and a downsampling block. Every decoder layer is a feature propagation layer composed of a depoint transformer block and an upsampling block. The downsampling ratio in each stage is one-fourth that of the previous stage, except for the first stage, in which downsampling is not performed. The corresponding dimensions for each stage are [32, 64, 128, 256, 512].

Figure 3. The RSPT network for semantic segmentation. N is the number of input points.

Figure 3. The RSPT network for semantic segmentation. N is the number of input points.

3.2. Random sampling method

Different sampling methods significantly influence the running speed and memory consumption of the network. introduces several commonly used sampling methods and compares their computational complexities. The computational complexity of RSs is independent of the number of points, whereas that of FPSs (Qi et al. Citation2017b), ISIDs (Groh, Wieschollek, and Lensch Citation2019) and PSDs (Bridson Citation2019) increases rapidly with an increase in the number of point clouds. Random sampling is more suitable for RSPT.

Table 3. Comparison of representative sampling methods used for point cloud processing. Thecomplexity O denotes the computational complexity when sampling M points from a large-scale point cloud P with N points, and K denotes the number of nearest neighbours.

A schematic illustration of the transition-down module is shown in . Now, we assume that the input of the downsampling module is p1 and that the output is p2. We perform random sampling in P1 to sample P2 P1. To pool the feature vectors from P1 to P2, the k-nearest neighbours (KNN) algorithm is used to find the neighbouring points for each point in P1and then obtain P2 through attentive pooling.

Figure 4. RS block. The top panel shows the steps of point cloud downsampling. The bottom panel shows the attentive pooling mechanism, which can weight and aggregate features adaptively.

Figure 4. RS block. The top panel shows the steps of point cloud downsampling. The bottom panel shows the attentive pooling mechanism, which can weight and aggregate features adaptively.

To aggregate the features of k points around the sampling point in the downsampling module, we encounter the problem of fixing any number of elements in set A to a single output y. In existing studies (Qi et al. Citation2017a; Zeng et al. Citation2022), the main approach for selecting representative features from neighbouring points is to use max pooling, average pooling, or a combination of both. These pooling operations are predefined to capture partial information and can result in the loss of fine-grained features from neighbouring points. By contrast, we design an aggregation function f with learnable weights W:y=f(A,W), which is permutation invariant. Therefore, this approach is suitable for disordered point clouds. The simple proof of permutation invariant is as follows: (1) [y1,yd,yD]=f({x1,xn,xN},W)(1) In Equation 1, the dth entry of the output y is computed as follows: (2) yd=n=1Nond=n=1N(xndsnd)=n=1N(xdnecdnj=1necjd)=n=1N(xdne(xnwd)j=1ne(xjwd))=n=1N(xnde(xnwd))j=1Ne(xjwd)(2) In Equation 2, wd represents the d-th column of the weight matrix W. Both the numerator and denominator in this equation involve summation over permutation-equivariant terms. The value yd, as well and the entry vector y remains invariant to different permutations of the deep feature set A={x1,x2,,xn,,xN}.

Attention pooling can automatically capture point-to-point positional relationships without relying on predefined neighbourhood relationships. In addition, attention pooling can automatically focus on important regions through the attention mechanism due to the sparse nature of the point cloud, expanding or narrowing the receptive field to capture more critical features.

Attention pooling is performed through the following steps.

Computing attention scores: Assuming that a point and its adjacent k points form a set of points Fi={fi1fikfiK}, we propose the function g() to learn attention scores for each feature. The function g() consists of a shared MLP, followed by softmax, formally defined as follows: (3) SiK=g(fik,W)(3) where W is the weight parameter learnable through a shared MLP.

Weighted summation: The learned attention scores evaluate the correlation between points that automatically select important features. The calculation formula is shown in Equation 4. (4) fi=K=1K(fikSik)(4) In summary, for a set of given points P, attentive pooling learns to aggregate the geometric features and descriptor features from the K nearest points of each point pi in P.

3.3. Local feature aggregation

ResiDense Connection: Inspired by previous studies (Hu et al. Citation2020; Chen et al. Citation2017; Du et al. Citation2021), we designed a feature aggregation module consisting of three-point transformer blocks and incorporated residual connections to enhance the feature representation capability. shows the residual and dense connections between the point transformer blocks. The residual connection adds the output of the previous layer to the input of the subsequent layer, enabling feature reuse and hence segmentation performance. The dense connection adds the output of each previous block to the final output of the module, thereby alleviating the problem of vanishing or exploding gradients in deep networks. The lower-level point transformer block stores local and neighbouring information, whereas the higher-level point transformer block captures features from a larger context or a wider range. By utilizing both types of connections, the ResiDense module enables aggregating features to simultaneously capture local features and consider contextual information, improving performance in point cloud semantic segmentation tasks.

Figure 5. Proposed ResiDense Module, ⊕: concatenate, ⊗: elementwise addition and concatenate.

Figure 5. Proposed ResiDense Module, ⊕: concatenate, ⊗: elementwise addition and concatenate.

Point transformer block: The point transformer block is used to extract features from the input point cloud; this block consists of a point transformer layer and two linear layers, as shown in . The point transformer layer models the contextual relationships within the input sequence. The point transformer block enables information exchange among local features, generating update feature vectors as the output. This block aggregates the content of the local point cloud feature vectors and their layout in 3D, allowing effective information exchange and feature generation. In each feature extraction layer, n-point transformer blocks are concatenated to learn features jointly. By stacking multiple-point transformer blocks, the feature extraction block fuses and combines the features of the input sequence. Each point transformer block receives a feature representation from the previous layer and outputs a higher-level abstract representation.

Point transformer layer: The essence of the point transformer layer, which serves as our core layer, is self-attention. There are two types of self-attention operators: scalar attention (Vaswani et al. Citation2017) and vector attention (Zhao, Jia, and Koltun Citation2020). In this layer, we employ vector self-attention. The calculation formula for the point transformer layer in this network is as follows. (5) yi=XjX(i)ρ(γ(φ(Xi)ψ(Xj)+δ))(α(Xi)+δ)(5) where XiX is a set of neighbouring points of xi obtained through the transition-down block, and xj is a set of neighbouring points of xi. yi is the output of the point transformer layer, which contains semantic information and structural features. φ, ψ, and α are linear projections. δ is a function corresponding to the positions of the points. ρ is the softmax function. γ is an MLP layer that performs nonlinear transformations on the aggregated features. The point transformer layer calculates the subtraction between the matrix transformed by φand ψ and utilizes the output as an attention weight to aggregate the features transformed by α. In addition, we added position encoding δ to γ and α.

As illustrated in , we employ a multihead attention mechanism to enhance the expressive power of the model in the transformer layer, enabling it to learn more diverse and complex features. Under the multihead mechanism, the input sequence data are divided into multiple heads, and each head performs independent computations to obtain different outputs. These outputs are subsequently concatenated to form the final output.

Figure 6. Point transformer layer. (a) Diagram of the point transformer vector attention mechanism. (b) Diagram of the multihead attention mechanism. (c) Details of the point transformer vector attention mechanism. Q, K and V are obtained by mapping the input matrix three times. k is the number of neighbouring points. d is the feature dimension of the input. d1 is the dimension of Q, K, V. d’ is the output dimension. h represents the number of heads in the multihead attention mechanism.

Figure 6. Point transformer layer. (a) Diagram of the point transformer vector attention mechanism. (b) Diagram of the multihead attention mechanism. (c) Details of the point transformer vector attention mechanism. Q, K and V are obtained by mapping the input matrix three times. k is the number of neighbouring points. d is the feature dimension of the input. d1 is the dimension of Q, K, V. d’ is the output dimension. h represents the number of heads in the multihead attention mechanism.

Equations 6–8 show the formulas for multihead attention. In Equation 6, ℎ represents the number of heads, headi denotes the output of the ith head, and W is the output transformation matrix. The output of each head ihead can be represented as Equation 7, where WiQ, WiK, and WiV are the query, key, and value transformation matrices for the ith head, respectively. Attention is the attention computation function and is defined in Equation 8. (6) MultiHead(Q,K,V)=Contact(head1,..,headh)W(6) (7) headi=Attention(QWiQ,KWiK,VWiV)(7) (8) Attention(Q,K,V)=softamx(QK+δ)(V+δ)(8)

4. Experiments

4.1. Training and test strategy

The Bengbu dataset used in this study consisted of five scenes used for model training. However, because of the absence of vegetation types in the CZ and limited vegetation types in the CLJ, we selected some areas from CQ and GYY as test areas, referred to as Area1 and Area2, respectively. The test sets were not used for model training, and the area between the training and test sets was determined using the k-fold cross-validation method.

The vegetation category in the dataset includes both high and low vegetation point clouds. In contrast, the building category includes various types of building point clouds such as factories, high-rise apartments, low-rise apartments, and tiled houses. These different categories of point clouds provide diverse and representative samples of real-world scenes encountered in the Bengbu dataset, allowing for a robust evaluation of the model’s performance in different environments and scenarios.

4.2. Segmentation on the Bengbu dataset

The main objective of this section is to investigate the effectiveness of RSPT in handling semantic segmentation tasks involving oblique photogrammetric point clouds. Several mainstream point cloud semantic segmentation models were used to process the oblique photogrammetry point cloud dataset of Bengbu City, which is provided in this paper, to achieve this goal. Then, we compared the performance of these models with that of the RSPT. A comparison of the models’ performance with the RSPT will provide insights into the potential advantages of the RSPT as a method for point cloud semantic segmentation in oblique photogrammetry datasets.

4.2.1. Implementation and metric

The experiments were performed on a PC with Ubuntu. The model was trained using PyTorch on a GeForce RTX 3090 GPU. We used the adaptive moment estimation (Adam) optimizer with a momentum of 0.9, and the initial learning rate was set to 0.001. The parameter for the k-nearest neighbour was configured to be 16.

To evaluate the effectiveness of the oblique photogrammetry point clouds and the models, we adopted evaluation metrics, including the mean intersection over union (mIoU), OA, intersection over union (IoU), accuracy, F1-score, precision, and recall, to measure their performance.

4.2.2. Comparative experiments

The effectiveness of the proposed feature extraction method for oblique photogrammetry point clouds is validated using several mainstream point-cloud semantic segmentation models. and present the results of these experiments, which show the semantic segmentation performances of the models on the Bengbu City oblique photogrammetry point cloud dataset. All the tested models demonstrated semantic segmentation performance, with RandLaNet achieving the best results. In test set 1, the mIoU reached 96.0%, the OA reached 91.4%, and the vegetation extraction IoU reached 93.5%. In test set 2, the mIoU reached 96.1%, OA reached 91.9%, and vegetation extraction IoU reached 96.0%. These results indicate that the proposed feature extraction method is effectively captures meaningful features from oblique photogrammetry point-cloud data and achieves high accuracy in semantic segmentation tasks.

Table 4. Vegetation segmentation results for Test Area 1(%).

Table 5. Vegetation segmentation results for Test Area 2(%).

and present the performance of the proposed method compared with the five mainstream algorithms. Our algorithm achieved better results in terms of semantic segmentation performance. Particularly, in test set 2, our algorithm achieved very good results, with almost complete extraction of vegetation information in the scene, because the building types in test set 2 are relatively simple, resulting in minimal impact on vegetation extraction. These results suggest that our proposed method can accurately extract vegetation information from oblique photogrammetry point-cloud data, and outperforms mainstream algorithms in the tested scenario.

shows the semantic segmentation results of PointNet, PointNet++, RandLaNet, SCF-Net, and RSPT on Test Set 1. It is evident that the RSPT has a better vegetation extraction effect than the other networks, as it can accurately extract all the vegetation points. However, there may still be cases where vegetation points located at higher positions are misclassified as buildings and vegetation points on the ground are misclassified as roads. Nevertheless, this situation has significantly improved in RSPT. This further validates the effectiveness of RSPT in extracting local and global distribution features. According to the segmentation effect image analysis shown in , the RSPT network continues to perform well. Unlike in test area 1, in test area 2, SCF-Net recognized the walls of buildings as vegetation.

Figure 7. Segmentation results on test area 1. a Comparison of the semantic segmentation results for the translation models. (b, c, d) Enlarged views of (a). (b: red box; c: white box; d: black box; green: trees; yellow: buildings; blue: roads)

Figure 7. Segmentation results on test area 1. a Comparison of the semantic segmentation results for the translation models. (b, c, d) Enlarged views of (a). (b: red box; c: white box; d: black box; green: trees; yellow: buildings; blue: roads)

Figure 8. Segmentation results on test area 2. a Comparison of the semantic segmentation results for the translation models. b shows the different views of a.

Figure 8. Segmentation results on test area 2. a Comparison of the semantic segmentation results for the translation models. b shows the different views of a.

We verified that our algorithm also performed well in a complex scene in which buildings and vegetation were mixed in the point cloud dataset. As shown in , the scene contained a total of 5,509,160 points, including 1,325,796 vegetation points and 3,853,009 building points. The vegetation extraction accuracy achieved by our algorithm was 76.0%. There may be instances of misrecognition where the peripheral walls of the lower levels of buildings are mistakenly recognized as vegetation.

Figure 9. Tree segmentation results in complex scenes. (a) Raw point clouds and (b) segmentation results. The red box indicates where the low wall was misidentified as vegetation.

Figure 9. Tree segmentation results in complex scenes. (a) Raw point clouds and (b) segmentation results. The red box indicates where the low wall was misidentified as vegetation.

The comparative experiments conducted in this study showed that the oblique photogrammetry point cloud dataset provided in this study yielded favourable results in semantic segmentation using several mainstream models, indicatings that the proposed oblique photogrammetry point-cloud feature extraction method effectively extracts point cloud features. Additionally, the semantic segmentation model based on the attention mechanism outperformed CNN-based models in terms of feature extraction from oblique photogrammetry point clouds, suggesting that attention-based approaches may be a promising direction for the further improvement of point cloud feature extraction and semantic segmentation tasks in the context of oblique photogrammetry.

4.2.3. Ablation study

We conducted numerous controlled experiments to examine specific decisions in RSPT. These studies involved semantic segmentation tasks performed on the Bengbu dataset and tested on area 1.

(1∼4) Position encoding. This unit studies the choice of position encoding δ. shows that the performance decreases significantly in the absence of position encoding. The effect is greater with absolute position encoding than without position encoding. When relative position coding is added only to the attention generation branch or only to the feature transformation branch, the performance is degraded compared to the performance of our RSPT network. These results show the importance of using relative positional coding and adding it to the attention generation and feature transformation branches.

Table 6. The mean IoU of all ablated networks based on our full RSPT.

(5∼7) Replacing attentive pooling by the max/average/sum of the max and average pooling. Attentive pooling learns to aggregate local features automatically during downsampling. We use max pooling, average pooling and the sum of max and average pooling to replace attentive pooling. shows that the effect of the max/average/sum of max and average pooling on the test set is less pronounced than that of attentive pooling. Therefore attentive pooling is important in RSPT.

(8∼10) Resident connection. To validate the effectiveness of the residual connections, we quantitatively investigated the performance of different connections on the test area. In this section. we remove residence connections or use residual or dense connections to connect point transformer blocks shows that the network has a rather limited improvement when residence connections are not used.

Table 7. Ablation study: number of neighbours k in the attentive pooling. (%)

In , we investigate the setting number of neighbours k in attentive pooling on test area 1 and the best results are achieved when k is set to 16. The experimental results prove that it is not true that a large value of k is better. When the value of k is small, the attentive pooling cannot learn enough effective local information; however, when k is large, the neighbourhood provides many nonrelevant points far away from the specific sampling point. In addition, a larger k-value results in more computational pressure. Therefore, a suitable k value will balance between model effectiveness and computational burden.

4.3. Segmentation on the campus3d dataset

4.3.1. Data and metrics

The Campus3d dataset is a point cloud dataset generated by photogrammetry of UAV images of the National University of Singapore (NUS) campus; it includes 6 regions; the FASS, YIH, RA, UCC, PGP and FOE regions. Each scene contains hundreds of millions of points. The dataset provides semantic annotation results at five different levels contains rich semantic categories. We followed the data preparation procedure of Li et al. (Citation2020). FASS, YIH, RA, and UCC are the training sets, PGP is the validation set and FOE is the test set. For the evaluation metrics, we use the IoU to evaluate each category and the mIOU to evaluate the effect of all categories.

4.3.2. Performance comparison

As shows, Li et al conducted experiments on a variety of different hierarchical learning (HL) methods, including multiclassifier (MC), MC + hierarchical ensemble (HE), multitask (MT), MCnc (a loss without consistency loss branch), and MT + HE methods, to demonstrate the effectiveness of their proposed MT + HE methods. We compare RSPT with the results of the test experiment of Li et al to verify our model’s ability to segment a wider range of application datasets. The experimental results show that our model outperforms the best performing models on the C1 and C2 levels of semantic annotation in most cases. However, our test results are lower than those of DGCNN in terms of higher-level semantic labelling. The performance of RSPT gradually decreases as the variety of semantic annotations increases, indicating that segmentation becomes more difficult as labelled instances decrease and become sparsely distributed. Overall, our proposed model performs better than the method proposed by Li et al. Notably, some of the data in are from Li et al. (Citation2020).

Table 8. Test results (class IoU%) for different methods.

5. Conclusion

The main contributions of this study include constructing of an oblique photogrammetry point-cloud dataset for training and testing deep learning models. Additionally, the RSPT network was proposed for vegetation segmentation in large scenes, this network features an effective local feature aggregation module based on vector self-attention to preserve complex local structures. Random point sampling is used in RSPT to reduce computational consumption. A transformer-based local feature extraction module was designed to extract more representative features, where we incorporated residual and dense connections to achieve feature reuse. By relying solely on attention mechanisms for segmenting vegetation in large scenes, this approach represents a novel approach compared to traditional CNN-based methods. We evaluated the RSPT network and several mainstream models on the Bengbu dataset. Compared to those of the state-of-the-art models, the IoU increased from 96.0% to 96.5%, the F1-score increased from 90.8% to 97.0%, and the OA increased from 91.9% to 96.9%. The experimental results indicated that the RSPT network achieved robust and efficient results for vegetation segmentation. Moreover, our work is highly important for constructing point cloud datasets and promoting applications utilizing oblique photogrammetry technology. Utilizing 3D point clouds generated during the real-world 3D modelling reduces the cost of constructing point-cloud datasets. Deep learning-based methods for automatically classifying of oblique photography features have led to significant improvements in applications, such as intelligent DLG (digital line graph) feature vectorization in oblique photogrammetry products.

Acknowledgments

We would like to express our great appreciation to the editors and two anonymous reviewers for constructive comments that helped improve the manuscript.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The data supporting the findings of this study are available from the Bengbu Geotechnical Engineering and Surveying Institute. Restrictions apply to the availability of these data, which were used under the license for this study. Data are available from the author/[email protected]. cn, with permission from the Bengbu Geotechnical Engineering and Surveying Institute.

Additional information

Funding

This study was supported by the National Natural Science Foundation of China. This research was funded by the National Natural Science Foundation of China (grant number 41971311).

References

  • Alonzo, M., B. Bookhagen, and D. Roberts. 2014. “Urban Tree Species Mapping Using Hyperspectral and Lidar Data Fusion.” Remote Sensing of Environment 148: 70–83. https://doi.org/10.1016/j.rse.2014.03.018.
  • Armeni, I., S. Sax, A. R. Zamir, and S. Savarese. 2017. Joint 2D-3D-Semantic Data for Indoor Scene Understanding. https://arxiv.org/abs/1702.01105.
  • Behley, J., M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall. 2019. “SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences.” Paper present at the Proceedings of the International Conference on Computer Vision, Seoul, Korea, October 2019.
  • Bridson, R. 2019. “Fast Poisson Disk Sampling in Arbitrary Dimensions.” Paper present at the Conference of Association for Computing Machinery. New Orleans, USA, May 2019.
  • Chen, Y., J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. 2017. “Dual Path Networks.” Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1707.01629.
  • Chen, Y., R. Wu, C. Yang, and Y. Lin. 2021. “Urban Vegetation Segmentation Using Terrestrial LiDAR Point Clouds Based on Point non-Local Means Network.” International Journal of Applied Earth Observation and Geoinformation 105: 102580. https://doi.org/10.1016/j.jag.2021.102580.
  • Dai, A., A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Niener. 2017. “ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes.” Paper present at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Hawaii, USA, July 2017.
  • Du, J., G. Cai, Z. Wang, S. Huang, J. Su, J. M. Junior, J. Smit, and J. Li. 2021. “ResDLPS-Net: Joint Residual-Dense Optimization for Large-Scale Point Cloud Semantic Segmentation.” Isprs Journal of Photogrammetry and Remote Sensing 182: 37–51. https://doi.org/10.1016/j.isprsjprs.2021.09.024.
  • Fan, S., Q. Dong, F. Zhu, Y. Lv, P. Ye, and F.-Y. Wang. 2021. “SCF-Net: Learning Spatial Contextual Features for Large-Scale Point Cloud Segmentation.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14504–14513. https://doi.org/10.1109/cvpr46437.2021.01427.
  • Firman, M. 2016. “RGBD Datasets: Past, Present and Future.” Paper present at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, USA, June 2016.
  • Frahm, J. M., P. Fite-Georgel, D. Gallup, T. Johnson, and M. Pollefeys. 2010. “Building Rome on a Cloudless Day.” Paper present at the European Conference on Computer Vision, HeraKlion, Greece, September, 2010.
  • Gong, J., J. Xu, X. Tan, H. Song, Y. Qu, Y. Xie, and L. Ma. 2021. Omni-supervised Point Cloud Segmentation via Gradual Receptive Field Component Reasoning. Paper present at the Proceedings of the Conference on Computer Vision and Pattern Recognition. Napa, USA, June 2021.
  • Groh, F., P. Wieschollek, and H. Lensch. 2019. “Flex-Convolution: Million-Scale Point-Cloud Learning Beyond Grid-Worlds.” Paper present at the Asian Conference on Computer Vision. Munich, Germany, September 2019.
  • Guan, H., J. Li, S. Cao, and Y. Yu. 2016. “Use of Mobile LiDAR in Road Information Inventory: A Review.” International Journal of Image and Data Fusion 7 (3): 219–242. https://doi.org/10.1080/19479832.2016.1188860.
  • Guo, M.-H., J.-X. Cai, Z.-N. Liu, T.-J. Mu, R. R. Martin, and S.-M. Hu. 2021. “PCT: Point Cloud Transformer.” Computational Visual Media 7: 187–199. https://doi.org/10.1007/s41095-021-0229-5.
  • Guo, S., J. Li, Z. Lai, and S. Han. 2022. “CTpoint: A Novel Local and Global Features Extractor for Point Cloud.” Neurocomputing 511: 273–289. https://doi.org/10.1016/j.neucom.2022.09.056.
  • Guo, Y., H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun. 2020. “Deep Learning for 3d Point Clouds: A Survey.” Transactions on Pattern Analysis and Machine Intelligence 43: 4338–4364. https://doi.org/10.1109/tpami.2020.3005434.
  • Hackel, T., N. Savinov, L. Ladicky, J. D. Wegner, K. Schindler, and M. Pollefeys. 2017. “Semantic3D.net: A new Large-Scale Point Cloud Classification Benchmark.” ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences, 91–98. https://doi.org/10.5194/isprs-annals-iv-1-w1-91-2017.
  • Hu, Q., B. Yang, L. Xie, S. Rosa, Y. Guo, Z. Wang, N. Trigoni, and A. Markham. 2020. “Randla-net: Efficient Semantic Segmentation of Large-Scale Point Clouds.” Paper present at Proceedings of the conference on computer vision and pattern recognition. Seattle, USA, June 2020.
  • Huo, L., E. Lindberg, and J. Holmgren. 2022. “To Towards Low Vegetation Identification: A New Method for Tree Crown Segmentation from LiDAR Data Based on a Symmetrical Structure Detection Algorithm (SSD).” Remote Sensing of Environment 270: 112857. https://doi.org/10.1016/j.rse.2021.112857.
  • Lawin, F. J., M. Danelljan, P. Tosteberg, G. Bhat, F. S. Khan, and M. Felsberg. 2017. “Deep Projective 3D Semantic Segmentation.” Paper present at the International Conference on Computer Analysis of Images and Patterns. Faro, Portugal, August 2017.
  • Li, X., C. Li, Z. Tong, et al. 2020. “Campus3d: A Photogrammetry Point Cloud Benchmark for Hierarchical Understanding of Outdoor Scene.” Paper present at the Proceedings of the 28th ACM International Conference on Multimedia. Beijing, People’s Republic of China, October 2020.
  • Li, M., L. Nan, N. Smith, and P. Wonka. 2016. “Reconstructing Building Mass Models from UAV Images.” Computers & Graphics 54: 84–93. https://doi.org/10.1016/j.cag.2015.07.004.
  • Liang, X., V. Kankara, J. Hyyppa, Y. Wang, A. Kukko, H. Haggren, X. Yu, et al. 2016. “Terrestrial Laser Scanning in Forest Inventories.” ISPRS Journal of Photogrammetry and Remote Sensing 115: 63–77. https://doi.org/10.1016/j.isprsjprs.2016.01.006.
  • Lin, Y.i., and J. Hyyppa. 2012. “Multiecho-recording Mobile Laser Scanning for Enhancing Individual Tree Crown Reconstruction.” IEEE Transactions on Geoscience and Remote Sensing 50 (11): 4323–4332. https://doi.org/10.1109/TGRS.2012.2194503.
  • Liu, Z., H. Tang, Y. Lin, and S. Han. 2019. “Point-voxel cnn for Efficient 3d Deep Learning.” Advances in Neural Information Processing Systems, 32. https://doi.org/10.1109/access.2020.3023423.
  • Luo, Z., J. Li, Z. Xiao, Z. G. Mou, X. Cai, and C. Wang. 2019. “Learning High-Level Features by Fusing Multi-View Representation of MLS Point Clouds for 3D Object Recognition in Road Environment.” Journal of Photogrammetry and Remote Sensing 150: 44–58. https://doi.org/10.1016/j.isprsjprs.2019.01.024.
  • Luo, H., C. Wang, C. Wen, Z. Chen, D. Zai, Y. Yu, and J. Li. 2018. “Semantic Labeling of Mobile LiDAR Point Clouds via Active Learning and Higher Order MRF.” Transactions on Geoscience and Remote Sensing 56 (7): 3631–3644. https://doi.org/10.1109/TGRS.2018.2802935.
  • Ma, L., Y. Li, J. Li, Y. Yu, J. Marcato, W. N. Goncalves, and M. A. Chapman. 2020. “Capsule-based Networks for Road Marking Extraction and Classification from Mobile LiDAR Point Clouds.” Transactions on Intelligent Transportation Systems. https://doi.org/10.1109/TITS.2020.2990120.
  • Maturana, D., and S. Scherer. 2015. “Voxnet: A 3d Convolutional Neural Network for Real-Time Object Recognition.” Paper present at the International Conference on Intelligent Robots and Systems (IROS). Hamburg, Germany, September 2015.
  • Pang, Y., W. Wang, L. Du, Z. Zhang, X. Liang, Y. Li, and Z. Wang. 2021. “Nyström-based Spectral Clustering Using Airborne LiDAR Point Cloud Data for Individual Tree Segmentation.” International Journal of Digital Earth 14 (10): 1452–1476. https://doi.org/10.1080/17538947.2021.1943018.
  • Qi, C. R., H. Su, K. Mo, and L. J. Guibas. 2017a. “Pointnet: Deep Learning on Point Sets for 3d Classification and Segmentation.” Paper present at the Proceedings of the Conference on Computer Vision and Pattern Recognition. Hawaii, USA, July 2017.
  • Qi, C. R., L. Yi, H. Su, and L. J. Guibas. 2017b. “Pointnet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space.” Advances in Neural Information Processing Systems, 30. https://doi.org/10.1109/cvpr.2017.16.
  • Qin, H., W. Zhou, Y. Yao, and W. Wang. 2022. “Individual Tree Segmentation and Tree Species Classification in Subtropical Broadleaf Forests Using UAV-Based LiDAR, Hyperspectral, and Ultrahigh-Resolution RGB Data.” Remote Sensing of Environment 280: 113143. https://doi.org/10.1016/j.rse.2022.113143.
  • Ran, H., J. Liu, and C. Wang. 2022. “Surface Representation for Point Clouds.” Paper present at the Conference on Computer Vision and Pattern Recognition. Shenzhen, People’s Republic of China, October 2022.
  • Ren, D., Z. Wu, J. Li, P. Yu, J. Guo, M. Wei, and Y. Guo. 2022. “Point Attention Network for Point Cloud Semantic Segmentation.” Science China Information Sciences 65: 192104. https://doi.org/10.1007/s11432-021-3387-7.
  • Su, H., S. Maji, E. Kalogerakis, and E. Learned-Miller. 2015. “Multi-view Convolutional Neural Networks for 3d Shape Recognition.” Paper present at the Proceedings of the IEEE International Conference on Computer Vision. Santiago, Chile, December 2015.
  • Vallet, B., M. Bredif, A. Serna, B. Marcotegui, and N. Paparoditis. 2015. “TerraMobilita/IQmulus Urban Point Cloud Analysis Benchmark.” Computers & Graphics 49: 126–133. https://doi.org/10.1016/j.cag.2015.03.004.
  • Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł Kaiser, and I. Polosukhin. 2017. “Attention is all you Need.” Advances in Neural Information Processing Systems 30. https://doi.org/10.5040/9781350101272.00000005.
  • Wang, R., J. Peethambaran, and D. Chen. 2018. “LiDRA Point Clouds to 3-D Urban Models: A Review.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 11 (2): 606–627. https://doi.org/10.1109/JSTARS.2017.2781132.
  • Westoby, M. J., J. Brasington, N. F. Glasser, M. J. Hambrey, and J. M. Reynolds. 2012. “Structure-from-Motion’photogrammetry: “a low-Cost, Effective Tool for Geoscience Applications.”.” Geomorphology 179: 300–314. https://doi.org/10.1016/j.geomorph.2012.08.021.
  • Wu, C., S. Agarwal, B. Curless, and S. M. Seitz. 2011. “Multicore Bundle Adjustment.” Paper present at the Conference on Computer Vision and Pattern Recognition, Colorado Springs, USA, June 2011.
  • Wu, B., B. Yu, Q. Wu, S. Yao, F. Zhao, W. Mao, and J. Wu. 2017. “A Graph-Based Approach for 3D Building Model Reconstruction from Airborne LiDAR Point Clouds.” Remote Sensing 9: 92. https://doi.org/10.3390/rs9010092.
  • Xu, S., S. Xu, N. Ye, and F.a. Zhu. 2018a. “Individual Stem Detection in Residential Environments with MLS Data.” Remote Sensing Letters 9 (1): 51–60. https://doi.org/10.1080/2150704X.2017.1384588.
  • Xu, S., S. Xu, N. Ye, and F. A. Zhu. 2018b. “Automatic Extraction of Street Trees’ Nonphotosynthetic Components from MLS Data.” International Journal of Applied Earth Observation and Geoinformation 69: 64–77. https://doi.org/10.1016/j.jag.2018.02.016.
  • Zeng, Z., Y. Xu, Z. Xie, W. Tang, J. Wan, and W. Wu. 2022. “LEARD-Net: Semantic Segmentation for Large-Scale Point Cloud Scene.” International Journal of Applied Earth Observation and Geoinformation 112: 102953. https://doi.org/10.1016/j.jag.2022.102953.
  • Zhang, Z., Y. Fan, Z. Jiao, B. Fan, J. Zhou, and Z. Li. 2023. “Vegetation Ecological Benefits Index (VEBI): A 3D Spatial Model for Evaluating the Ecological Benefits of Vegetation.” International Journal of Digital Earth 16 (1): 1108–1123. https://doi.org/10.1080/17538947.2023.2192527.
  • Zhao, H., J. Jia, and V. Koltun. 2020. “Exploring Self-Attention for Image Recognition.” Paper present at the Proceedings of the Conference on Computer Vision and Pattern Recognition. Seattle, USA, June 2020.
  • Zhao, H., L. Jiang, J. Jia, P. H. Torr, and V. Koltun. 2021. “Point Transformer.” Paper present at the International Conference on Computer Vision, Sanya, People’s Republic of China, February 2021.
  • Zhou, Q.-Y., and U. Neumann. 2008. “Fast and Extensible Building Modeling from Airborne LiDAR data.” Paper present at the 16th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Portsmouth, UK, January 2020.
  • Zou, X., M. Cheng, C. Wang, Y. Xia, and J. Li. 2017. “Tree Classification in Complex Forest Point Clouds Based on Deep Learning.” Geoscience and Remote Sensing Letters 14 (12). https://doi.org/10.1109/LGRS.2017.2764938.