651
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Full-automatic high-precision scene 3D reconstruction method with water-area intelligent complementation and mesh optimization for UAV images

ORCID Icon, ORCID Icon, ORCID Icon, , & ORCID Icon
Article: 2317441 | Received 15 Nov 2023, Accepted 06 Feb 2024, Published online: 16 Feb 2024

ABSTRACT

Fast and high-precision urban scene 3D modeling is the foundational data infrastructure for the digital earth and smart cities. However, due to challenges such as water-area matching difficulties and issues like data redundancy and insufficient observations, existing full-automatic 3D modeling methods often result in water-area missing and many small holes in the models and insufficient local-model accuracy. To overcome these challenges, full-automatic high-precision scene 3D reconstruction method with water-area intelligent complementation on depth maps and mesh optimization is proposed. Firstly, SfM was used to calculated image poses and PatchMatch was used to generated initial depth maps. Secondly, a simplified GAN extracted water-area masks and ray tracing was used achieve high-precision auto-completed water-area depth values. Thirdly, fully connected CRF optimized water-areas and arounds in depth maps. Fourthly, high-precision 3D point clouds were obtained using depth map fusion based on clustering culling and depth least squares. Then, mesh was generated and optimized using similarity measurement and vertex gradients to obtain refined mesh. Finally, high-precision scene 3D models without water-area missing or holes were generated. The results showed that: to compare with the-state-of-art ContextCapture, the proposed method enhances model completeness by 14.3%, raises average accuracy by 14.5% and improves processing efficiency by 63.6%.

1. Introduction

Fast and high-precision urban scene three-dimensional (3D) modeling is the foundational data infrastructure for the digital earth and smart cities (Li et al. Citation2016; Guo et al. Citation2017; Craglia et al. Citation2012; Xiao et al. Citation2015). Moreover, the completeness of 3D models is also vital for the development of digital cities and real-scene 3D representations of China (Lu et al. Citation2014). Among them, full-automatic extraction and restoration of water-area, optimization of 3D point clouds and 3D model can be an important part of making 3D models reach a high level of completeness and improving the aesthetics and usability of urban scene 3D models (Mishra et al. Citation2023; Duan et al. Citation2021).

Against such background, scholars have persistently dedicated their efforts to delve into methods for enhancing both the completeness, the accuracy and efficiency of 3D reconstruction models. Due to the difficulty of matching the water-area and the lack of data redundancy and observation, the existing 3D modeling methods often have the problem that the water-areas in the model are easy to be missing, have holes or have insufficient accuracy in the local area. water-area present unique challenges in 3D modeling due to their weak textures, high reflectivity and repetitive imagery patterns (Jiang, Jiang, and Guo Citation2022). Traditional 3D modeling methods often struggle to extract feature points and lines from such weakly textured unmanned aerial vehicle (UAV) images (Xiao et al. Citation2015), as shown in , it displays the missing portion of the 3D reconstruction results in the water-area. In regions with weak textures, there are few feature points for visual extraction and making matching difficult. Additionally, dense matching is susceptible to issues like parallax and depth breaks. These limitations can result in modeling failures and holes in the final model. However, there are still the following deficiencies: (1) In terms of modeling completeness, it is difficult to achieve accurate matching of models due to the presence of weak textures, such as water-area etc. The insufficiency of the weak texture region causes the modeling algorithm to be difficult to capture and accurately present the detailed features of the object surface resulting in holes in the model, which reduces the overall modeling completeness. (2) In terms of modeling accuracy, the problem of model holes becomes a major factor affecting local accuracy. These holes may originate from the omission of specific regions or details during the modeling process, resulting in inaccurate holes in the model in these regions, which affects the overall modeling accuracy. (3) In terms of modeling efficiency, some modeling software or algorithms run slowly when dealing with large-scale or high-complexity scenarios, which affecting the efficiency of modeling. This may result in modelers needing to spend more time to complete manual model editing, limiting the possibility of widely applying 3D modeling techniques in fact. In general, existing full-automatic 3D modeling methods generally suffer from poor overall model completeness (missing water-area, more model holes etc.), insufficient local model accuracy and inefficient processing.

Figure 1. Missing water-areas in the scene 3D model.

Figure 1. Missing water-areas in the scene 3D model.

To solve the above three issues, the following solutions were proposed: Firstly, the overall model completeness of the full-automatic 3D reconstruction method is improved by the simplified GAN for water-area intelligent extraction, ray-tracing-based depth map water-area complementation and fully connected CRF based completion and optimization of the entire depth map. Secondly, 3D model accuracy of full-automatic 3D reconstruction methods is improved by whole depth map optimization, depth map fusion and mesh optimization based on fully connected CRF. Lastly, model efficiency of full-automatic 3D reconstruction methods is improved by a simplified GAN to quickly obtain the water-area mask and mesh optimization.

The main contributions of this work are as follows: (1) An intelligent water-area completion method was proposed based on deep learning networks and ray tracing on initial depth maps. For undistorted images, a simplified GAN algorithm is used to extract the water-area mask more quickly. By combining the precise boundaries of the water-area and high-precision image poses designed in this paper, the water-area's depth values in the initial depth map are completed with high precision. It enhances model completeness by 14.3% compared to the performance of ContextCapture. (2) A depth map optimization method based on the combination of images and CRF was introduced. Full-connect CRF is utilized to optimize the depth maps of the water-area and its surroundings. Subsequently, depth map fusion is performed: a depth value clustering method based on connecting points is used to eliminate depth values with large errors, followed by depth least squares to calculate high-precision 3D coordinates of the point cloud. It enhances model average accuracy by 14.5% compared to the performance of ContextCapture. (3) An improved mesh optimization method based on similarity measure normal and vertex gradient optimization was designed. By optimizing based on similarity measure and vertex gradients combined with image information, the model is refined. To compare with the state-of-art ContextCapture, the proposed method enhances model completeness by 14.3%, raises average accuracy by 14.5% and improves processing efficiency by 63.6%.

The rest of this paper is structured as follows. Section 2 reviews the previous literature, finds the remaining scientific issues and proposes the research ideas for the issues and the research significance of this paper. Section 3 outlines the research methodology used in this paper, including full-automatic high-precision scene 3D reconstruction method with water-area intelligent complementation and mesh optimization for UAV images. Section 4 compares method performance and discusses the results of the better models. Section 5 performs experimental analysis and cause analysis. Finally, Section 6 concludes the study and put forward the prospects of future studies.

2. Related work

UAV Oblique photogrammetry is a cost-effective and flexible data acquisition method that provides new data sources with distinct advantages, detailed scene information in the images, multiple views from different angles and significantly varying image scales (Wang et al. Citation2023). The scientific community, industry and software developers are interested in using UAV oblique images to construct 3D models and fully exploit the advantages provided by this technology (Toschi et al. Citation2017; Jiang, Jiang, and Wang Citation2022). However, existing methods are not good for water-area reconstruction, due to the fact that this water-area texture is weakly textured and fluid, 3D reconstruction is not effective. Due to challenges such as water-area matching difficulties and issues like data redundancy and insufficient observations, existing full-automatic 3D modeling methods often result in water-area missing and many small holes in the models and insufficient local-model accuracy, existing methods for full-automatic modeling are not high in model completeness, local accuracy or efficiency.

Traditional 3D modeling methods (Kerstein et al. Citation2011; Remondino et al. Citation2011; Gallup, Frahm, and Pollefeys Citation2010) are not suitable for modeling water-areas and other similar regions. This is because water-areas are characterized by weak textures, making it difficult to extract feature points and resulting in modeling failures with holes. Conventional 3D modeling methods are usually difficult to extract feature points and lines from these weakly textured Unmanned Aerial Vehicle (UAV) images, resulting in sparse reconstruction failures and difficulties in obtaining reliable initial values. As well as difficulty in providing feature information for subsequent dense matching. Also, holes in the model appear due to occlusion, especially in the area under the bridge resulting in reduced model completeness. In common dense matching algorithms (Jiang, Jiang, and Guo Citation2021), depth or disparity is propagated locally or semiglobally, making it difficult to establish matches in water-areas. Alternatively, in depth consistency verification, these areas may be identified as invalid values or holes. The presence of some under-bridge occlusions resulted in some algorithms (Li et al. Citation2016) not being able to model effectively, making the 3D model of the scene less complete. While traditional 3D full-automatic software such as ContextCapture 3D (Zhang and Murat Maga Citation2023) excels in various environments, it faces challenges with reconstruction failures, particularly in extensive water-areas. These methods can lead to poor model completeness and problems such as poor local accuracy.

In order to improve the completeness of the scene 3D model, especially to automate the restoration of the water-area in the model, it can be considered to extract the precise boundaries of the water-area from the images using machine learning methods and then use certain methods to repair and reconstruct the water-area in the scene 3D model. Machine learning methods can be used to extract water-area boundaries based on high-resolution remote sensing images (Murali and Govindan Citation2013; Sabri et al. Citation2019). Semantic extraction assigns a class label to each pixel in an image to support advanced semantic understanding of the image (Liu and Fang Citation2020). In DeepLabV3+ (Li et al. Citation2019), a spatial pyramid pooling module is introduced between the encoder and decoder to leverage multiscale features. Some recent methods, including SharpMask (Pinheiro et al. Citation2016), U-Net (Ronneberger, Fischer, and Brox Citation2015) and RefineNet (Lin et al. Citation2017), embed hierarchical feature representations extracted from multiple layers of the encoder into the corresponding decoder for better extraction. However, these deep learning methods require many of training samples. Notably, these augmented images are often synthetic, with nonshadow backgrounds similar to the original training images and adversarial neural networks can support semisupervised learning for object extraction. Networks such as the Generative Adversarial Network (GAN) (Creswell et al. Citation2018; Goodfellow et al. Citation2020; Chen et al. Citation2020) can facilitate semisupervised learning, allowing us to easily use and improve their generalization capabilities by training them with a large amount of unlabeled water-area data. In the realm of deep learning for aerial image extraction, several challenges emerge, including data imbalance, labeling complexities, the intricacy of environmental features and limited dataset availability. This paper presents a strategy to address these challenges by integrating aerospace and UAV images for training. The approach involves training on publicly available datasets while augmenting the training process with its own partially labeled datasets. This hybrid training methodology aims to enhance the accuracy and robustness of water-area extraction. Provide inspection knowledge support for model completeness modeling.

In the field of UAV 3D modeling, the difficulty of matching weak textures such as water-area region leads to low accuracy of reconstructed model, not doing depth map fusion leads to low accuracy of 3D dense point cloud and not doing mesh optimization leads to low accuracy of mesh vertices, which leads to lower accuracy of 3D model of the scene. In previous studies, (Li et al. Citation2018; Tian et al. Citation2019) mitigated the matching problem in weakly textured regions by improving feature extraction, but no segmentation information was introduced resulting in a 3D model that was not yet high enough. And (Fuhrmann and Goesele Citation2011) introduced depth map fusion technique to improve the accuracy of dense point cloud, but did not optimize the point cloud as well as the depth map, which made the presence of more noisy point cloud led to the low accuracy of the fused point cloud. While (Kang et al. Citation2019; Park and Lee Citation2019) generated a mesh by constructing mesh method, but no net optimization was made for the mesh, resulting in poor accuracy of the mesh model. However, the problem of poor local model accuracy still exists in the scene 3D model.

In typical application scenarios, the precise representation of water-area information is often not critical and is therefore rarely emphasized. Instead, virtual digital water-area models are commonly used to represent water-areas in the 3D environment of a scene (Mo et al. Citation2018). While the overall visual effects of virtual scenes can be guaranteed, the real information about water-areas, such as water-area color, floating objects and surrounding environments, cannot be visualized. Traditional 3D water-area scenes cannot meet this demand. Software tools for manual editing in 3D modeling, such as DP-Modeler and SVS-Modeler (CitationYanmei et al.; Dou and Ding Citation2020), have been developed to assist in repairing the preliminary reconstruction results of water-areas in oblique 3D models. However, using these tools requires significant manual involvement and the entire process of water-area reconstruction is time-consuming. In the context of automated water-area repair, a method proposed in (Liu et al. Citation2022) utilizes existing 3D point clouds in the water-area for repair. By employing an eight-neighborhood traversal algorithm, boundary points of the water-area are extracted and reconstruction is performed using mesh generation. It also brings the problem that semi-automated water-area restoration is time-consuming and difficult to meet the needs of automated modeling.

Despite the continuous development of 3D modeling technology, there are still some problems that need to be solved: In general, existing full-automatic 3D modeling methods generally suffer from poor overall model completeness (missing water-areas, more model holes, etc.), insufficient local model accuracy and inefficient processing. (1) The model completeness is not high, mainly because of the weak texture areas such as the water-area, which leads to matching difficulties and the existence of occlusion is easy to form local holes (especially under the bridge), resulting in model holes. (2) The local accuracy is not high, the weak texture such as water-area is difficult to match resulting in low accuracy of the reconstruction model; no depth map fusion resulting in low accuracy of the 3D dense point cloud; no mesh optimization resulting in low accuracy of the mesh vertices. (3) Full-automatic scene 3D model modeling is not efficient. Many modeling algorithms run slowly when dealing with large-scale or high-complexity scenes, which affects the efficiency of modeling. Therefore, this study proposes full-automatic high-precision scene 3D reconstruction method with water-area intelligent complementation and mesh optimization for UAV images. The study analyzes the reasons for the existence of model holes in the 3D full-process and finds a suitable complementation scheme, then uses the optimization scheme to optimize the accuracy of the depth map and the mesh model and finally realizes the full-automatic high-precision full-process rapid complementation of the 3D model.

3. Method

Please refer to for the detailed implementation process of the full-automatic water-area intelligent complementation, which involves depth map complementation and mesh optimization.

Figure 2. Flow-chat of the proposed Full-automatic High-precision Scene 3D Reconstruction Method (Never-Water-Restoration Method) with Water-area Intelligent Complementation and Mesh Optimization.

Figure 2. Flow-chat of the proposed Full-automatic High-precision Scene 3D Reconstruction Method (Never-Water-Restoration Method) with Water-area Intelligent Complementation and Mesh Optimization.

This study describes a method that uses images and their corresponding position and orientation as inputs. In the initial phase, image positions are determined using Structure from Motion (SFM) (Schonberger and Frahm Citation2016) and depth maps are obtained using the Patchmatch stereo matching algorithm from Colmap (Bleyer, Rhemann, and Rother Citation2011). In the second phase, for undistorted images, a simplified GAN algorithm is used to quickly extract the water-area mask. By combining the precise boundary of the water-area and the high-precision image pose through the designed high-precision water-area automatic completion method in this paper, the depth values for the water-area part in the initial depth map are completed. Then, a Conditional Random Field (CRF) (Liu, Lin, and Shen Citation2015) is used to optimize and enhance the supplemented depth map with image information. By merging the depth maps, the depth value clustering method based on connecting points removes depth values with large errors, followed by the least squares method to compute high-precision three-dimensional coordinates for the point cloud. In the fourth phase, a mesh refinement process driven by similarity measure normal and vertex gradient optimization to refine the water-area connection areas. A texture mapping algorithm (Li et al. Citation2020) is applied to generate the final water-area model.

3.1. High-precision position and orientation solving and initial depth map generation for UAV images

This paper involves a series of processes to analyze UAV images. It begins with SfM (Schonberger and Frahm Citation2016) to extract feature points, match them and estimate camera parameters and a 3D point cloud. Subsequently, the high-precision position and orientation of the UAV are rapidly computed. Following this, the camera parameters are applied to correct the original image, resulting in an undistorted image.

In this paper, we adopt an iterative approach to assess the similarity of undistorted images using the PatchMatch stereo (Bleyer, Rhemann, and Rother Citation2011) matching method. The primary concept behind this method is to progressively enhance depth estimation accuracy by matching blocks of pixels, thereby creating an initial depth map. This initial depth map serves as the foundation for subsequent 3D reconstruction processes.

3.2. Water-area intelligent completion on initial depth maps

3.2.1. Fast water-area mask extraction based on simplified GAN for undistorted images

Undistorted images are utilized as input for the extraction of water-area boundaries and the generation of a water-area mask using a faster Generative Adversarial Network (GAN). To enhance the performance of water-area extraction, a multitask mean-teacher model is employed for semisupervised water-area extraction, utilizing unlabeled data to gather multiple pieces of information about the water-area. Please refer to for an overview of the water-area extraction network's general construction.

Figure 3. General Flowchart of a Fast Water-area Extraction Network without Water-area Count Supervision.

Figure 3. General Flowchart of a Fast Water-area Extraction Network without Water-area Count Supervision.

Unlike labeled images, numerous unlabeled water-area images can be readily sourced from aerial datasets. Consequently, when training with a limited amount of labeled data, the utilization of additional unlabeled data becomes imperative to enhance water-area segmentation performance. The framework's activation function employed aligns with the one used in (Chen et al. Citation2020), with the primary modification involving network utilization. In this paper, counts are supervised for culling in order to enhance their rapid extraction. A notable change includes the elimination of count extraction and adjustments made to the loss function. Refer to for a detailed network structure of water-area extraction.

Figure 4. Schematic of Multi-Task Adversarial Neural Network for fast water-area extraction.

Figure 4. Schematic of Multi-Task Adversarial Neural Network for fast water-area extraction.

Labeled and unlabeled data are integrated using the average teacher semisupervised learning technique. The approach involves a multitask convolutional neural network (MT-CNN) that addresses two tasks: water-area extraction and water-area edge extraction. Both the student network and teacher network are implemented as MT-CNN. The predicted water-area information from the student and teacher networks is compared using a multitask consistency loss. This process leverages both labeled and unlabeled data to improve the performance of the network in detecting water-areas and edges.

An input water-area image is processed using a ResNeXt-101(Pant, Yadav, and Gaur Citation2020) convolutional neural network to generate feature maps (EF1, EF2, EF3, EF4 and EF5) at various scales. By combining the shallowest (EF1) and deepest (EF5) features, a new feature map (DF1) is created for edge prediction. Four water-area maps are predicted using DF2, DF3, DF4 and DF5 and an additional four water-area maps are generated using RF2, RF3, RF4 and RF5. Ultimately, a water-area image is produced using the fine feature map Sf. (1) Sf=Pred(k=25RFk)(1) Predicting Pred is achieved by three 3 × 3 convolutional layers, 1 × 1 convolutional layer and a sigmoid activation function on features.

Different loss functions are adopted for tasks with labeled data. (2) Ls(x)=Lrs+αLes(2) where: (3) Lrs=j=19ΦBCE(Pr(j),Gr)(3) (4) Les=ΦBCE(Pe,Ge)(4) where Pr(j) represents one of the nine predicted water-area maps and Pe is the predicted water-area edge map. ΦBCE and ΦMSE are binary cross-entropy loss and MAE loss functions, respectively. During network training, weights are empirically set α=10.

For unlabeled data, both the student and teacher networks are utilized to obtain results for two tasks, which are nine water-area region maps (denoted as Sr1 to Sr9) and water-area edge maps (denoted as Se).

Multitask learning and semisupervised self-integration models are applied to water-area extraction. The total loss of the network is: (5) Lc(y)=Lrc+Lec(5) where: (6) Lrc=j=19ΦMSE(Srj,Trj)(6) (7) Lec=ΦMSE(Se,Te)(7) The overall loss function is: (8) Ltotal=i=1NLs(xi)+λj=1MLc(yj)(8) where N and M are the number of labeled and unlabeled images in the training set, respectively.

During the training process, the objective is to minimize the total loss (L_total) to train the student network. Furthermore, the parameters of the teacher network are updated at each training step using an exponential moving average (EMA) strategy. The parameters of the teacher network at t training iterations are: (9) θt=ηθt1+(1η)θt(9) where θt represents the student network parameters at t training iterations. EMA attenuation η is empirically set to 0.99.

3.2.2. High-precision auto-completion of water-area depth values on initial depth maps

With the extraction results and dense depth maps, geometric consistency and ray tracing are employed to accurately restore the depth map. illustrates the segmentation results used for boundary identification, followed by depth map completion using ray tracing. This process entails several steps, including extracting the water-area boundary from the initial depth map, utilizing high-precision positional and attitude data along with the intelligently extracted mask to obtain high-precision elevation values for the water-area boundary through projection and triangular similarity calculations. Afterward, high-precision horizontal surfaces are fitted within the water-area region. Ultimately, complete depth maps are generated based on these computed values to achieve accurate depth measurements for the water-areas.

  1. The extracted mask is projected onto the initial depth map to obtain the precise boundary of the water-area on the initial depth map.

  2. From the undistorted image, we project the depth value of the water-area boundary into the object coordinate system, yielding a series of object elevation values for the water-area. The inputs consist of pixel coordinates (ufrontier,vfrontier) paired with their respective depth values (dfrontier), as well as the camera's internal and external orientation parameters (K,R,t). The outputs provide 3D point coordinates (Xfrontier,Yfrontier,Zfrontier) that correspond to the pixel coordinates within the object area. Internal reference matrix: (10) K=[fx0u00fyv0001](10) where u0,v0 are the coordinates of the image principal point and f is the principal distance.

    The pixel coordinate system is transformed under the object square coordinate system: (11) [XfrontierYfrontierZfrontier1]=[KRKt01]1[dfrontierufrontierdfrontiervfrontierdfrontier1](11)

  3. The precise elevation values of the water-area boundary are elevation-fitted to obtain a high-precision water-area plane on the object side, denoted as Zwaterarea. The average elevation (Haverage) is calculated as: (12) Haverage=inZfrontierin(12) (13) Zwaterarea=Haverage(13) where n is the number of boundary points.

  4. Solve for the scale factor λ between the water-area and the level of the object and the three-dimensional coordinates (Xwaterarea,Ywaterarea,Zwaterarea) of the water-area. (14) [XwaterareaYwaterareaZwaterarea]=λR[uwaterareau0vwaterareav0f]+t(14)

The equation for the camera ray V of the water-area pixel point (uwaterarea,vwaterarea) is: (15) V=R[uwaterareau0vwaterareav0f](15) The solution formula is: (16) {λ=(Zwaterareat[2])/V[2]Xwaterarea=V[0]λ+t[0]Ywaterarea=V[1]λ+t[1]Zwaterarea=Haverage(16)

Figure 5. Ray tracing principles for water-areas with incomplete depths.

Figure 5. Ray tracing principles for water-areas with incomplete depths.
  • (5) To determine the water-area depth (dwaterarea) based on the 3D intersection coordinate values, we utilize the inputs of the intersection coordinate values (Xwaterarea, Ywaterarea, Zwaterarea), as well as the camera's internal and external orientation parameters (K,R,t). (17) dwaterarea[uwaterareavwaterarea1]=K[Rt][XwaterareaYwaterareaZwaterarea1](17) where uwaterarea and vwaterarea are the pixel coordinates of the water-area. The result is the dwaterarea for the depth value in the current pixel coordinates.

By following these steps, the exact boundary of the water-area in the initial depth map can be derived and all depth values within the water-area are accurately complemented. This process enables the generation of a more complete depth map for each initial depth map.

3.3. Optimization and fusion of depth maps

3.3.1. Depth-map Optimization with fully connected conditional random fields (CRF) for water-area and its around

The complemented depth map and the corresponding distortion image serve as inputs for the complementation process via the CRF. The depth map optimization method based on the fully connected CRF model considers the relationship between the depth map and color map.

To apply CRF for depth map optimization, defining an appropriate energy function is crucial. The energy function consists of two components: the data term and the smoothness term. By incorporating CRF and image information, the energy function can be formulated to guide the optimization process and achieve the desired depth map results. (18) E(D)=Edata(D)+Esmooth(D)(18) where E(D) represents the entire energy function and D represents the depth map.

The data item (Edata) measures the consistency between the depth map and the input image data. Data items can be defined using the following formulas: (19) Edata(D)=Σiωdata(i)φdata(i,D)(19) where i represents the pixel index, ωdata(i) represents the weight of the pixel and φdata(i,D) represents the potential energy function of the data item. The potential energy function of a data item can be modeled based on observation data (such as image intensities) and depth maps. The data item employed is the pixel's reprojection error, defined as the difference between the depth value in the depth map and the brightness value of the corresponding pixel in the input image. A potential energy function of the form is utilized: (20) φdata(i,D)=|I(i)Ireproj(i,D(i))|(20) where I(i) represents the brightness value of pixel i in the input image and Ireproj(i,D(i)) represents the depth value D(i) corresponding to the reprojected brightness value of pixel i.

The smoothing term (Esmooth) accounts for the spatial continuity of the water-area depth map to achieve smooth depth changes in the depth map. The smoothing term is defined using the following formula: (21) Esmooth(D)=Σijωsmooth(i,j)φsmooth(D(i),D(j))(21) where i,j represents the index of adjacent pixel pairs, ωsmooth(i,j) represents the weight of adjacent pixel pairs and φsmooth(D(i),D(j)) represents the potential energy function of the smooth term.

To address the color map texture mapping challenge, a fully connected CRF model is employed. The method offers several advantages compared to existing approaches. It achieves accurate and clear geometric structure restoration of the depth map while effectively solving the issue of redundant textures.

3.3.2. Depth-map fusion with cluster culling and depth least squares

This process begins by clustering the depth values and removing outliers through the selection of matching points with the same name, resulting in a reduction of depth estimation errors. Subsequently, the depth values are refined through depth least squares fitting, combined with image coordinates to produce a high-precision point cloud that represents the 3D positions of the object's surface. This process is executed repeatedly over the entire survey area, leading to the creation of a high-precision dense point cloud. As shown in , the depth map fusion for dense point cloud. In the figure, the red points represent some coarse differences. The green points are some effective elevation values and the blue points are the optimal elevation values after the fusion. The elevation map of each viewpoint is fused to obtain a high-precision dense point cloud in the survey area.

Figure 6. The principle of depth map fusion is employed to generate dense point clouds.

Figure 6. The principle of depth map fusion is employed to generate dense point clouds.

Connection point selection: Selection is performed using a high-precision sparse point cloud in SfM to select matching points with the same name from multiple images. These matching points have the same location between different images and are usually obtained by feature matching algorithms.

Depth value clustering: the depth values of the selected connection points are clustered and analyzed. This can be achieved by grouping the depth values into different clusters, where each cluster represents points with similar depth.

Error Rejection: Reject those points whose depth values deviate far from the center of the clusters. These points with large deviations may be due to depth estimation errors caused by mis-matching, occlusion or noise. By eliminating the outlier points, the accuracy of depth estimation can be improved.

Depth Least Squares: depth values that have been culled by errors are fitted using depth least squares or other fitting algorithms. The goal of the fitting is to find the best depth estimate to minimize the error in the depth estimate.

Generate point cloud: The calculated high-precision depth values are combined with the image coordinates of the connected points to generate high-precision point cloud data. The coordinates of each point cloud point indicate the 3D position on the object surface.

Dense point cloud generation: Repeat the above steps on the entire survey area to obtain a high-precision dense point cloud for the entire survey area.

3.4. Mesh generation and optimization

3.4.1. Generate Delaunay 3D triangles (mesh)

By depth map fusion, a classical surface reconstruction method based on Delaunay triangulation (Jancosek and Pajdla Citation2011) is utilized to mesh the point cloud, which includes (1) Delaunay Tetrahedralization; (2) Mesh Edge Weighting; and (3) Graph-Cut Mesh Construction. (1) Delaunay Tetrahedralization: Construct a sufficiently large tetrahedron with the point cloud, encompassing all points to be subdivided. Then, insert points from the point set into the tetrahedral mesh one by one. With each insertion, update the mesh to ensure the Delaunay property is maintained. (2) Mesh Edge Weighting: Assign weights to the edges based on the tracing of camera rays and the characteristics of the edges. Weights can represent the importance, stability, or other geometric properties of the edges. (3) Graph-Cut Mesh Construction: Initially, construct a graph structure by transforming the weighted tetrahedral mesh into a graph, where nodes represent the vertices of the tetrahedra, and edges represent the edges of the tetrahedra. Then, execute the graph-cut algorithm to optimize the mesh, aiming to minimize the overall energy function. Finally, based on the results of the graph cut, choose to retain or remove certain tetrahedra or their faces, thereby optimizing the overall quality of the mesh. As shown in , the dense point cloud building mesh principle is demonstrated.

Figure 7. The algorithm for generating the mesh passes through the point cloud.

Figure 7. The algorithm for generating the mesh passes through the point cloud.

3.4.2. Mesh refinement based on similarity measure normal and vertex gradient optimization

The initial mesh is optimized to obtain the final optimized mesh. The essence of 3D model surface mesh optimization is to use the grayscale information of the image to obtain the optimal vertices by finding the gradient of the grayscale with respect to the coordinates of the vertices and moving the vertices along the direction of the normal vectors of the triangulated surface. As shown in , the method utilizes triangular grid optimization driven by similarity measure normal and vertex gradient optimization. The normalized cross correlation (NCC) (Barnes et al. Citation2009) coefficient and gradient change values are computed for object space points. Based on their barycentric coordinates, weights are assigned to the triangular surface vertices. The vertices are then adjusted in the direction of the normal vector for optimization.

Figure 8. The principle of mesh refinement using similarity measure normal and vertex gradient optimization.

Figure 8. The principle of mesh refinement using similarity measure normal and vertex gradient optimization.

As shown in , which illustrates the process for improving 3D reconstruction by reprojecting the water-area image onto other visible images using the object-space triangular patch. The object-space vertex coordinates are derived using the gradient descent method based on the normalized cross-correlation coefficient of the image, resulting in the gradient change value. The vertices of the object-space triangulation are adjusted in the direction of the normal vector according to the change value, minimizing the matching cost of the region. This optimization of the triangle mesh model enhances the accuracy of the 3D reconstruction process.

Figure 9. Showcasing details of vertex optimization in the mesh.

Figure 9. Showcasing details of vertex optimization in the mesh.

The algorithm presented in this paper consists of two primary components. The first part involves calculating the gradient change values of the object's square vertices. In the second part, we construct an energy function and solve it. The gradient change values of the object's square vertices are derived by considering the weighted centered coordinates. Minimizing the matching cost guides the movement of triangular mesh vertices in the direction of the normal vector, resulting in the minimization of the energy function and the optimization of the triangular mesh.

  1. Solving for vertex change values

To achieve image information-driven triangular mesh optimization, we calculate the inter image normalized correlation coefficient for the object-square vertices, which provides us with the gradient change values for each vertex. The calculation algorithm for obtaining the gradient change value of an object-square vertex is as follows:

All the object points within the triangular plane contribute gradient change values based on their center of gravity coordinates. These values are used to assign weights to the vertices of the triangular plane, facilitating the adjustment of object vertices in the direction of the normal vector to achieve optimization.

(2)

Constructing the energy function

Using an energy function to drive initial triangle optimization. The energy function consists of two parts: the data item (Edata) and the smooth item (Esmooth). The energy function is as follows: (22) E(S)=Edata(S)+ωEsmooth(S)(22) The data item Edata in the energy function is (23) Edata(S)=nEdata(X)=ΩSh(IA,IAB)(x)d(x)(23) At the same time, the smooth term Esmooth ensures the regular shape and smooth surface of the triangular faces. The first-order Laplacian Δ~VLaplacian and the second-order Laplacian Δ~(Δ~VLaplacian) are used as the smoothing terms. (24) Esmooth=w1Δ~(Δ~VLaplacian)w2Δ~VLaplacian(24) The purpose is to ensure that the shape of the triangular surface is regular and the surface is smooth, while avoiding the occurrence of local minimum values in the energy function. w1=0.18 and w2=0.02 denote the soft weight and rigid weight, respectively, in the smoothing item. When the energy function is minimized, the triangular mesh is the optimal solution.

The approach utilizes various data items to form the energy function, including the correlation coefficient between feature points, image gradient information and projection geometric information. Laplace smoothing serves as the smoothing item, driving the movement of object space triangulation network vertices toward the normal vector direction to minimize the matching cost. By minimizing the energy function, the positions of the object space vertices are optimized, enabling grid optimization and reducing the error between the triangular mesh and the actual surface. This optimization process enhances the accuracy of 3D reconstruction results for water-shore connection areas.

3.5. 3D texture mapping based on the fully automatic plane segmentation

This section details the steps involved in texture mapping the optimized 3D mesh model generated from UAV images (Li et al. Citation2020). These steps encompass tasks such as creating texture coordinates, applying the texture image to the model, making any required texture adjustments and culminating in rendering and exporting the 3D model with texture mapping.

4. Experiments

4.1. Dataset and evaluation indicators

4.1.1. Training parameters

To mitigate the risk of overfitting and expedite the training process, a pretraining technique is employed for the parameters of MT-CNN. Specifically, the parameters of the student network are initialized using ResNeXt (Pant, Yadav, and Gaur Citation2020), a well-trained model for image classification on the ImageNet dataset. The remaining parameters of the MT-CNN are initialized randomly. Stochastic gradient descent (SGD) with a momentum of 0.9 and weight decay of 0.0005 is used for optimizing the entire network, with a total of 10,000 iterations. The learning rate is controlled using a polynomial strategy, starting at 0.005 with a power of 0.9. All images, including labeled and unlabeled ones, are resized to 1024 × 1024 to train the network on a single NVIDIA TITAN Xp GPU.

4.1.2. Experimental dataset

Training and test data: The model for the experiments presented in this paper is trained using a combination of remotely sensed satellite imagery from the remote sensing water-area dataset (https://doi.org/10.6084/m9.figshare.24559744) and an aerial dataset and the annotation is performed in a smaller number of images containing water-areas, which greatly reduces labor costs. All images in small batches of data are labeled using the LabelMe annotation tool, with the desired extraction class of the target defined as the water-area. Water-area labeling does not require a very precise coastline because the 3D reconstruction of the coastline is better and it is only necessary to roughly mark the outline of the water-areas. Many images containing unlabeled water-areas are also prepared in accordance with the semi-supervised learning characteristics of the network. Ultimately, 50,175 water-area images with dimensions of 1024*1024 pixels are used for offline training of the aforementioned network.

The UAV photogrammetry data used in this paper: The first set is urban water-area data, the second set is multi water-area terrace data and the third set is airborne laser data with supporting aerial image data. The detailed parameters are listed in . The experimental environment consists of a CPU: i7-9700K (3.60 GHz), GPU: NVIDIA TITAN Xp.

Table 1. Aerial photogrammetry data used in this paper.

As shown in , the validation data of the third set of images are laser scanning data with a resolution of 0.02338 m.

Figure 10. Laser scanning of water-area data.

Figure 10. Laser scanning of water-area data.

4.1.3. Evaluation indicators

In this paper, we use mean Intersection over Union (mIoU) to evaluate the quantitative results of water-area identification. It is the average of the intersection and concurrency ratios of each class in that dataset. (25) mIoU=1k+1i=0kpiij=0kpij+i=0kpjipii(25) where pij denotes the number of true values of i that are predicted to be j. k+1 is the number of categories (including empty categories). pii is the true number. pij and pji denote false positives and false negatives, respectively.

The formula for evaluating the accuracy of the model in this paper is Root Mean Square Error (RMSE), the error formula used in the evaluation is as follows: (26) RMSE(X)=i=1m(Xmodel,iXtrue,i)2m(26) where Xmodel,i indicates the elevation value of the control point on the model and Xtrue,i indicates the true elevation value of the control point.

The model completeness Scompleteness evaluation formula in this paper is: (27) Scompleteness=AmodelAmissingwaterareasAmodel×100%(27) where Amodel is the total area of the model and Amissingwaterareas is the missing areas of the water-area reconstruction.

The efficiency Emodeling evaluation formula in this paper is: (28) Emodeling=TA_modelingTB_modelingTA_modeling×100%(28) where TA_modeling is the A method modeling time and TB_modeling is the B method modeling time.

4.2. Water-area extraction experiments

4.2.1. Comparison of extraction effects

The test dataset considered in this study comes from a UAV airborne five-lens oblique aerial camera with image size of 6000*4000, which was used to acquire image data from different viewpoints and the extraction results of water-areas are obtained by segmenting each frame of the test dataset using a semantic extraction model trained by the network. This work compares and analyzes the method presented in this paper with the following codec networks in terms of extraction results and efficiency, as shown in .

Figure 11. Comparison of the water-area extraction effect.

Figure 11. Comparison of the water-area extraction effect.

The images depicted in illustrate the extraction results of various methods for different viewpoints. Certain codec networks, such as SegNet and U-Net, have been observed to misidentify shadow edges as water-area during image extraction. Additionally, they may erroneously identify other shoreline features as water. In contrast, the network employed in this paper exhibits greater accuracy in water-area segmentation. It produces clearer and more complete contour edges, due to its adaptability.

4.2.2. Comparison of extraction efficiency

The ratio comparison and mIoU of the method in this study with several networks for water-area extraction of images are shown in the following .

Table 2. Comparison of the extraction rate and accuracy of the water-area.

indicates the speed and accuracy of different methods in extraction. Specifically, for SegNet, UNet, DeepLabV3+, MTMT and the proposed method in this article, the speeds are 4.07, 3.74, 2.08, 2.56 and 1.43 s, respectively. The Mean Intersection over Union (MIoU) values are 0.779, 0.801, 0.822, 0.841 and 0.835. In practical work, the efficiency of DeepLabV3 + was tested in comparison with this paper. The first set of data took the proposed method 5 min 46 s, while DeepLabV3 + took 10 min 44 s. In the real time test, the time taken for water-area extraction of shore images using this method is only one-half of the time needed by the codec network when the image size is 6000*4000 pixels. Furthermore, shows that the real time performance was greatly improved by this method. The method employed in this paper enhances extraction efficiency by 44.1%.

4.3. Depth map completion and optimization experiments

4.3.1. Depth map completion experiments

The water-area depth map is complemented and expand according to the extraction results and ray tracing algorithm, as shown in .

Figure 12. Depth map complementation results: (a) and (e) are original images, (b) and (f) are segment masks (c) and (g) are original depth maps, and (d) and (h) are the completed depth maps.

Figure 12. Depth map complementation results: (a) and (e) are original images, (b) and (f) are segment masks (c) and (g) are original depth maps, and (d) and (h) are the completed depth maps.

The visualization results demonstrate that the ray-tracing-based depth value calculation effectively fills in the missing portions of the water-area. Also combining the segmented image gives a clear picture of the location of the depth map complement. This method proves to be successful in accurately estimating the depth values for areas where information may be lacking or incomplete.

4.3.2. Depth map optimization experiments

The output depth map obtained after complementation may exhibit issues such as edge mismatch, missing depth information and resulting holes. Traditional optimization algorithms often suffer from poor real-time performance. However, the proposed method addresses these challenges by introducing CRF depth map optimization based on image fusion. As shown in , comparison of experimental results before and after depth map optimization.

Figure 13. Depth map optimization results: (a) and (c) are the complemented depth maps; (b) and (d) are the optimized depth maps.

Figure 13. Depth map optimization results: (a) and (c) are the complemented depth maps; (b) and (d) are the optimized depth maps.

By employing CRF based optimization, the method effectively addresses the edge mismatch problem and fills in missing depth information, thereby reducing the presence of holes in the depth map. Furthermore, the proposed approach considers both real-time performance and visual quality, ensuring a balance between efficiency and accuracy during the depth map optimization process.

4.4. 3D water-area mesh optimization experiments

Please refer to for the experimental results of mesh optimization based on based on similarity measure normal and vertex gradient optimization.

Figure 14. Mesh refinement results based on similarity measure normal and vertex gradient optimization: (a) and (d) show the original mesh, (b) and (e) show the optimized mesh, and (c) and (f) show the 3D model after texture mapping.

Figure 14. Mesh refinement results based on similarity measure normal and vertex gradient optimization: (a) and (d) show the original mesh, (b) and (e) show the optimized mesh, and (c) and (f) show the 3D model after texture mapping.

The proposed optimization method offers significant benefits in terms of accurately reconstructing topological information and water-area boundary information. As shown in . By incorporating smoothing terms into the algorithm, the object surface is constrained effectively, resulting in more comprehensive repairs of weakly textured or blurred areas. The accuracy of the method is demonstrated in Tables 4 and 5, providing quantitative evidence of its effectiveness in achieving precise reconstruction.

4.5. Full-automatic scene 3D reconstruction experiments

displays the rendered model of the water-area elevation and provides a demonstration of the 3D model texture, the white film and the mesh structure derived from the experimental results.

Figure 15. Water-area elevation rendering and 3D model: (a) is a rendering of the interior elevation of a single tile block, where the red circled area is the water-area elevation point. (b) is scene 3D model with water-area restoration result (c) is water-area white model/grayscale model, and (d) is triangular mesh model.

Figure 15. Water-area elevation rendering and 3D model: (a) is a rendering of the interior elevation of a single tile block, where the red circled area is the water-area elevation point. (b) is scene 3D model with water-area restoration result (c) is water-area white model/grayscale model, and (d) is triangular mesh model.

After the mesh is refined based on similarity measure normal and vertex gradient optimization, the model is obtained as in shown , where the elevation is rendered to obtain model (a). The water-area elevation is relatively flat, while the water-area mesh is relatively natural and the optimized triangular net is convenient for the corresponding texture mapping.

4.5.1. Comparison of 3D modeling effect

For three sets of typical experimental data of urban and rural paddy field scenes: (1) Urban-water-area-1, (2) Multi-water-areas, and (3) Urban-water-area-2 with Laser Verification, the-state-of-art ContextCapture model and the proposed method are used to perform full-automatic and high-precision 3D modeling of urban and rural paddy field scenes, respectively.

The results of the full-automatic high-precision scene 3D modeling with water-area restoration results can be observed in .

Figure 16. Scene 3D modeling with water-area restoration results (Urban-water-area-1): (a) is ContextCapture 3D model, and (b) is the model generated by our method.

Figure 16. Scene 3D modeling with water-area restoration results (Urban-water-area-1): (a) is ContextCapture 3D model, and (b) is the model generated by our method.

Figure 17. Scene 3D modeling with water-area restoration results (multi-water-areas): (a) is ContextCapture 3D model, and (b) is the model generated by our method.

Figure 17. Scene 3D modeling with water-area restoration results (multi-water-areas): (a) is ContextCapture 3D model, and (b) is the model generated by our method.

Figure 18. Scene 3D modeling with water-area restoration results (Urban-water-area-2 with Laser Verification): (a) is ContextCapture 3D model, (b) is the model generated by our method.

Figure 18. Scene 3D modeling with water-area restoration results (Urban-water-area-2 with Laser Verification): (a) is ContextCapture 3D model, (b) is the model generated by our method.

According to the data from , for the data related to Urban-water-area-1, the ContextCapture model exhibits noticeable water-area omissions and gaps in the modeling, while the model created using the method proposed in this article shows significant improvements with better coverage and texture in the water-area. Where the zoomed-in area is the focal contrast structure, we significantly outperform ContextCapture in terms of geometric complementation. The first two sets of data employ manually redrawn areas as the ground truth for accuracy verification, while the third set of data utilizes laser scan data for accuracy evaluation. Since the laser in the center of the water-area will be missing, the range selected in this paper is at the boundary of the water-area for accuracy assessment.

4.5.2. Comparison of 3D modeling efficiency

As shown in , our method exhibits, on average, more than double the speed of the ContextCapture modeling method in terms of time efficiency. To ensure the scientific rigor and objectivity of our evaluation, we employed identical input data and data volume, as well as the same processing equipment. Furthermore, the methods and resultant data of image feature matching and aerial triangulation were also consistent. This single-variable control approach enabled us to effectively conduct an accurate end-to-end efficiency comparison between ContextCapture and the method proposed in this paper.

Table 3. Comparison of modeling time.

According to the data in , for the data related to Urban-water-area-1, Multi-water-areas and Urban-water-area-2 with Laser Verification, the modeling time for the ContextCapture model is 8.65, 14.76 and 17.60 h, respectively. In contrast, the modeling time for the method proposed in this article is 3.38, 5.12 and 6.23 h for the same data. Through analysis, it can be concluded: (1) Urban-water-areas 1 Data: Our algorithm demonstrates a substantial efficiency boost of 60.9% when compared to ContextCapture; (2) Multi-water-areas Data: In this scenario, our algorithm significantly enhances the overall model efficiency by 65.3% compared to ContextCapture; (3) Urban-water-areas with Laser-Validated 2 Data: Once again, our algorithm excels, outperforming ContextCapture by a remarkable 64.6%. These results underscore the effectiveness of our approach to scene 3D modeling with water-area restoration, achieving an impressive 63.6% improvement in efficiency.

4.5.3. Comparison of 3D modeling accuracy

The results are summarized in and , distinguishing between RMSE for overall modeling accuracy and scene 3D modeling with water-area restoration accuracy. Our method is compared with ContextCapture modeling and we all use the same SfM results for modeling. First, shows the accuracy of the water-areas in Urban-water-area-1, Multi-water-areas and Urban-water-area-2 with Laser Verification. For the water-area elevation accuracy, Urban-water-area-1 and Multi-water-areas data were used to calculate the RMSE of the water-area elevation for both methods based on manually hand-drawn control points along the shore, while Urban-water-area-2 with Laser Verification uses water-area laser control point data to calculate the RMSE of the water-area elevation. shows the comparison of the overall model accuracy and the comparison of the Multi-water-areas and Urban-water-area-2 with Laser Verification data shows that the median error in plane and median error in elevation are better than the accuracy of the ContextCapture model. For the overall model accuracy, Multi-water-areas data was used to calculate the RMSE of the overall model based on measured control points, while Urban-water-area-2 with Laser Verification used laser control point data to calculate the RMSE of the overall model.

Table 4. Accuracy evaluation of the RMSE of the water-area elevation.

Table 5. Accuracy evaluation of the RMSE of the overall model.

The first dataset was based on manually sketched areas derived from aerial triangulation results, including 47 hand-drawn points for evaluation. The second dataset, being a multi-water-area, involved calculations using 271 hand-drawn points. For the third dataset, laser scanning data was utilized, focusing solely on the point cloud data near the shore. After thinning, 331 points were used for the calculations. The second part involved overall model accuracy calculations: In the second dataset, there were 11 model control points, evenly distributed across the measurement area to ensure data representativeness and comprehensiveness. In the third dataset, direct calculations were made using laser point cloud data. To ensure data accuracy, we uniformly extracted sample data within the survey area and used 387 points from the thinned laser point cloud data for precise calculations. The thinned point cloud ensured that the verification points were evenly distributed, covering all major terrain and landform types, thus ensuring the effectiveness and reliability of the assessment.

From , it can be observed that for the data related to Urban-water-area-1, Multi-water-areas and Urban-water-area-2 with Laser Verification, the accuracy of the ContextCapture water-area model is 0.082m, 0.074m and 0.044m, respectively. The water-area model accuracy for the method proposed in this article is 0.075m, 0.067m and 0.043m for the same data. From , it is evident that for the data related to Multi-water-areas and Urban-water-area-2 with Laser Verification, the ContextCapture model has a planimetric accuracy of 0.113m and 0.034m and elevation accuracy of 0.126m and 0.051m, respectively. The method proposed in this article achieves a planimetric accuracy of 0.096m and 0.027m and elevation accuracy of 0.107m and 0.047m for the same data. Through analysis, it can be concluded: (1) For Multi-water-areas Data: Our algorithm significantly enhances the overall model accuracy, achieving a 15.0% improvement in the horizontal plane and a remarkable 20% increase in elevation compared to ContextCapture. (2) For Urban-water-area-2 with Laser Verification Data: In this scenario, our algorithm excels over ContextCapture, enhancing the overall model accuracy by 15% in the horizontal plane and 7.8% in elevation. These results demonstrate that our water-areas model substantially improves the overall accuracy, achieving an impressive 14.5% enhancement.

In fact, the actual model accuracy improvement is greater than 14.5%. This is because, in the case of ContextCapture, the water-area is missing, so when calculating model accuracy, it has an accuracy error of 0 in the water-area part. However, the method proposed in this article achieves full-automatic completion of the water-area in the model. Despite the relatively high accuracy in the completion of our method, there is still a small amount of error in the water-area part. This means that the accuracy calculation method used in this article puts our proposed method at a disadvantage. Therefore, in terms of scene modeling accuracy, compared to ContextCapture, the average accuracy improvement of our method is over 14.5%.

4.5.4. Comparison of 3D model completeness

In order to test the completeness of the model by this section is evaluated using the completeness of the model and the percentage of the overall reconstructed area that is missing. indicates the model's completeness in the reconstruction of the water-area and all of our methods outperform those of the ContextCapture model. After the aerial triangulation data is imported into the modeling algorithm, it automatically delineates the reconstruction area and subdivides it into dense Tile blocks for reconstruction. We maintained the same reconstruction area as the reference model, thereby ensuring the consistency and uniqueness of our evaluation metrics.

Table 6. Comparison of model completeness.

Based on , for the data related to Urban-water-area 1, Multi-water-areas and Urban-water-area-2 with Laser Verification, the completeness of the ContextCapture model is 81%, 78% and 86% respectively. The completeness of the model proposed in this article is 95%, 97% and 96% for the same data. Through analysis, it can be concluded: (1) Urban-water-area-1 Data: Our model completeness improves by 14% when compared to ContextCapture. (2) Multi-water-areas Data: In this scenario, our algorithm significantly enhances the overall model completeness by 19% compared to ContextCapture. (3) Urban-water-area-2 with Laser Verification Data: In this case, our algorithm outperforms ContextCapture by 10%. These results underscore the success of our scene 3D modeling with water-area restoration approach, achieving an impressive 14.3% improvement.

In fact, the actual completeness of the reconstructed model is higher than the currently calculated results. When calculating the overall area, the denominator includes the area of all reconstructed blocks. However, at the model's edges, there can be gaps and irregularities; that is to say, the real boundary of the scene area is smaller than the regular bounding box composed of tile blocks. Since the actual boundary doesn't extend as far, the calculation includes the boundaries of all reconstructed blocks, which to some extent results in a decrease in completeness. Therefore, the model completeness achieved by the method proposed in this article for full-automatic scene 3D reconstruction is slightly better than 95% to 97%.

5. Discussion

This study presents a deep learning-based method for water-area extraction from UAV images. The method consists of three main steps: water-area extraction from UAV oblique images, complementation of the depth map using prior knowledge and optimization of the depth map and water-area mesh using similarity measure normal and vertex gradient optimization and CRF. The innovations and specialties of the paper can be summarized as follows: (1) Depth Map Intelligent Completion: A simplified GAN algorithm is employed to rapidly extract the water-area mask from undistorted images. By combining the precise boundaries of the water-area domain and the high-precision position of the image, the depth value of the water-area domain is complemented using an automatic and high-precision water-area domain completion method designed in this paper. (2) Depth Map Optimization: The depth map of the water-area and its surroundings are optimized using a fully connected CRF. This optimization helps enhance the accuracy of depth information in the water-area region. Subsequently, depth values with significant errors are eliminated based on the depth value clustering method using connecting points, which refers to matching points with the same name in multiple images. Then, depth least squares calculations are applied to compute high-precision three-dimensional coordinates for the point cloud. (3) Mesh Optimization: Mesh optimization is performed using image information in combination with gradient changes. This optimization process focuses on refining the vertices of the mesh, resulting in an accurate mesh network. It ensures that the mesh closely aligns with the underlying 3D structure, enhancing the overall fidelity of the model.

In the water-area extraction experiments, from , it can be seen that for water-area domain scenes with small differences in viewpoints, several methods have good extraction of the prisms of the original image water-area domain as a whole. However, the network used in this paper is semisupervised and has strong adaptability at different image times, overcoming the interference caused by reflections and shadows along the water-area surface during image extraction in the literatures. At the same time, the extraction efficiency is improved while ensuring the extraction effect. This paper's method improves extraction efficiency by 44.1%.

In the experiments of depth map complementation and optimization, it can be seen from the results in and that the ray-tracing-based depth value computation effectively fills in the gaps within the water-area and provides an accurate complement of depth values. In addition, the depth map optimization refines the entire missing depth region. One of the challenges we address is the presence of tall vegetation in the waterfront, which may cause contour edge fluctuations and affect the accuracy of the 3D reconstruction of the water-area surface. Through large-scale water-area surface reconstruction experiments, as shown in , the mesh optimization process effectively corrected these fluctuations and provided a more accurate representation of the actual terrain contours. The water-area surface scene was verified by comparison with ContextCapture, achieving high-precision water-area surface restoration and modeling accuracy using laser point cloud data. Compared to traditional 3D reconstruction methods, our approach demonstrates significant enhancements across various aspects. It improves modeling completeness by 14.3%, average accuracy by 14.5% and reconstruction efficiency by an impressive 63.6%.

To analyze the main causes of the improvement in model completeness, average accuracy and processing efficiency of proposed method is as follows:

  1. Analysis of the Reasons for the Model Completeness Improvement of the Proposed Method. Deep depth map completion simplifies GAN for high-precision and fast water-area extraction, as well as joint water-area boundary and high-precision pose of the image, to achieve high-precision completion of the water-area in the initial depth map, which plays a key role in improving model completeness. Depth map optimization, through joint image and CRF solving, performs secondary completion for the few depth map defects that still exist after water-area completion, which also contributes to the improvement in model completeness.

  2. Analysis of the Reasons for the Model Average Accuracy Improvement of the Proposed Method. Depth map optimization, depth value clustering, coarse error removal and the least squares method for depth estimation result in high-precision point cloud data (more accurate depth values for water-areas and connecting areas), which partially contributes to the improvement in model accuracy. Mesh optimization with the integration of similarity measure normal and vertex gradient optimization leads to a more refined mesh network, which plays a major role in improving model accuracy.

  3. Analysis of the Reasons for the Model Production Efficiency Improvement of the Proposed Method. During intelligent completion of the depth map, a simplified GAN algorithm was designed, significantly improving water-area extraction efficiency. For depth map generation and fusion, PatchMatch stereo matching, depth value clustering, coarse error removal and the least squares method were employed, resulting in improved efficiency in generating high-precision dense point clouds. Especially in mesh optimization, due to the high accuracy of the initial mesh obtained and then enhanced by the similarity measure normal and vertex gradient optimization, the model is quickly fine-tuned, which plays an important role in improving the efficiency of 3D model production.

Regarding water-area extraction, there is also a need to improve extraction efficiency to accommodate automated production. Additionally, the mapping of water-area textures is one of the future challenges. 3D point cloud fusion by complementing the depth map can be applied not only in the field of water-area restoration but also in weaker texture reconstruction. The article may help to further improve the accuracy of weak textures as well as 3D modeling.

6. Conclusion

Accurate 3D modeling of real-world water-areas has always posed a challenge in the field. The reconstruction of water-area surface with completeness and accuracy is especially crucial for applications such as digital stereo models and hydraulic engineering projects. In this study, data captured by an airborne five-lens oblique aerial camera, which captures images from various angles, are utilized for reconstructing 3D models of water-areas. To identify the water-area surface, deep learning semantic extraction techniques are employed. Subsequently, the depth map is enhanced using ray tracing principles. Optimization techniques using image information are then applied to refine the depth map and mesh. The experimental results clearly indicate that our approach yields substantial improvements. It enhances modeling completeness by 14.3%, accuracy by 14.5% and reconstruction efficiency by a remarkable 63.6%.

The method in this paper enhances the overall completeness, accuracy and efficiency of the full-automatic scene 3D reconstruction algorithms by addressing the challenges of water-areas’ weak textures and fluctuating problems that which make it difficult to achieve accurate matching and 3D modeling. The method can achieve much higher completeness, precision and efficiency 3D reconstruction results for city scenes with a number of water-areas. This contributes to the full-automatic high-completeness and high-precision scene 3D reconstruction of the real-world. The main work and creations of this study are as follows:

  1. An intelligent water-area completion method was proposed based on deep learning networks and ray tracing on initial depth maps. For undistorted images, a simplified GAN algorithm is used to extract the water-area mask more quickly. By combining the precise boundaries of the water-area and high-precision image poses designed in this paper, the water-area's depth values in the initial depth map are completed with high precision. This method plays a crucial role in improving the completeness of the 3D model.

  2. A depth map optimization method based on the combination of images and CRF was introduced. Full-connect CRF is utilized to optimize the depth maps of the water-area and its surroundings. Subsequently, depth map fusion is performed: the depth value clustering method based on connecting points is used to eliminate depth values with large errors, followed by depth least squares to calculate high-precision 3D coordinates of the point cloud. This approach partially enhances model completeness and average accuracy.

  3. An improved mesh optimization method based on similarity measure normal and vertex gradient optimization was designed. By optimizing with vertex gradients combined with image information, due to the relatively high precision of the initial mesh and the mesh can be refined much faster. This method significantly contributes to the improvement of both the model's average accuracy and production efficiency. The simplified GAN quickly gets the water-area mask, which makes the model generation faster.

In future research, we will address the very few parts of the scene 3D model that have missing models or insufficient accuracy (there are about 3% missing models in this study's method). For example, the bottom of the bridge in the reconstruction model is somewhat missing due to missing or insufficient data in a small number of areas during the UAV oblique photogrammetry data acquisition process. To mitigate this issue, it can be considered: employ a suitable method to automatically detect areas of deficiency and inaccuracy within the reconstructed model; and then perform flight optimal path planning to allow the UAV to conduct close-range photogrammetry and fill-in shots, this process aims to enhance the overall completeness and accuracy of the scene 3D model.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The data that support the findings of this study are available from the corresponding author, upon reasonable request.

Additional information

Funding

This work is supported by the National Natural Science Foundation of China [grant number 42101449], the Natural Science Foundation of Hubei Province, China [grant number 2022CFB773], the Key Research and Development Project of Jinzhong City, China [grant number Y211006], the Science and Technology Program of Southwest China Research Institute of Electronic Equipment [grant number JS20200500114], the Chutian Scholar Program of Hubei Province, the Yellow Crane Talent Scheme, the Research Program of Wuhan University-Huawei Geoinformatics Innovation Laboratory [grant number K22-4201-011], the Key Laboratory of Urban Land Resources Monitoring and Simulation, Ministry of Land and Resources [grant number KF-2022-07-003] and the CRSRI Open Research Program [grant number CKWV20231167/KF].

References

  • Barnes, Connelly, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. 2009. “PatchMatch: A Randomized Correspondence Algorithm for Structural Image Editing.” ACM Trans. Graph 28 (3): 24.
  • Bleyer, Michael, Christoph Rhemann, and Carsten Rother. 2011. “PatchMatch Stereo - Stereo Matching with Slanted Support Windows.” Paper Presented at the Bmvc, https://doi.org/10.5244/C.25.14.
  • Chen, Zhihao, Lei Zhu, Liang Wan, Song Wang, Wei Feng, and Pheng-Ann Heng. 2020. “A Multi-Task Mean Teacher for Semi-Supervised Shadow Detection.” Paper Presented at the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, https://doi.org/10.1109/CVPR42600.2020.00565.
  • Craglia, Max, Kees de Bie, Davina Jackson, Martino Pesaresi, Gábor Remetey-Fülöpp, Changlin Wang, Alessandro Annoni, Ling Bian, Fred Campbell, and Manfred Ehlers. 2012. “Digital Earth 2020: Towards the Vision for the Next Decade.” International Journal of Digital Earth 5 (1): 4–21. https://doi.org/10.1080/17538947.2011.638500.
  • Creswell, Antonia, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A Bharath. 2018. “Generative Adversarial Networks: An Overview.” IEEE Signal Processing Magazine 35 (1): 53–65. https://doi.org/10.1109/MSP.2017.2765202.
  • Dou, S. Q., and S. Y. Ding. 2020. “Construction of Smart Community Based on GIS and Tilt Photogrammetry.” The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, https://doi.org/10.5194/isprs-archives-XLII-3-W10-547-2020.
  • Duan, Ping, Mingguo Wang, Yayuan Lei, and Jia Li. 2021. “Research on Estimating Water Storage of Small Lake Based on Unmanned Aerial Vehicle 3D Model.” Water Resources 48: 690–700. https://doi.org/10.1134/S0097807821050109.
  • Fuhrmann, Simon, and Michael Goesele. 2011. “Fusion of Depth Maps with Multiple Scales.” ACM Transactions on Graphics 30 (6): 1–8. https://doi.org/10.1145/2070781.2024182.
  • Gallup, David, Jan-Michael Frahm, and Marc Pollefeys. 2010. “Piecewise Planar and non-Planar Stereo for Urban Scene Reconstruction.” Paper Presented at the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/CVPR.2010.5539804
  • Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. “Generative Adversarial Networks.” Communications of the ACM 63 (11): 139–144. https://doi.org/10.1145/3422622.
  • Guo, Huadong, Zhen Liu, Hao Jiang, Changlin Wang, Jie Liu, and Dong Liang. 2017. “Big Earth Data: A new Challenge and Opportunity for Digital Earth’s Development.” International Journal of Digital Earth 10 (1): 1–12. https://doi.org/10.1080/17538947.2016.1264490.
  • Jancosek, Michal, and Tomas Pajdla. 2011. “Multi-View Reconstruction Preserving Weakly-Supported Surfaces.” Paper Presented at the CVPR 2011. https://doi.org/10.1109/CVPR.2011.5995693
  • Jiang, San, Wanshou Jiang, and Bingxuan Guo. 2022aa. “Leveraging Vocabulary Tree for Simultaneous Match Pair Selection and Guided Feature Matching of UAV Images.” ISPRS Journal of Photogrammetry and Remote Sensing 187: 273–293. https://doi.org/10.1016/j.isprsjprs.2022.03.006.
  • Jiang, San, Wanshou Jiang, and Lizhe Wang. 2022ba. “Unmanned Aerial Vehicle-Based Photogrammetric 3d Mapping: A Survey of Techniques, Applications, and Challenges.” IEEE Geoscience and Remote Sensing Magazine 10 (2): 135–171. https://doi.org/10.1109/MGRS.2021.3122248.
  • Jiang, San, Qingquan Li, Wanshou Jiang, and Wu Chen. 2022bb. “Parallel Structure from Motion for UAV Images via Weighted Connected Dominating set.” IEEE Transactions on Geoscience and Remote Sensing 60: 1–13. https://doi.org/10.1109/TGRS.2022.3222776.
  • Kang, Jihun, Sewon Lee, Yeon Sunghyun, and Seongjin Park. 2019. “Improving 3d Mesh Quality Using Multi-Directional UAV Images.” The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, https://doi.org/10.5194/isprs-archives-XLII-2-W13-419-2019.
  • Kerstein, Thomas, Martin Laurowski, Philipp Klein, Michael Weyrich, Hubert Roth, and Jürgen Wahrburg. 2011. “Optical 3d-Surface Reconstruction of Weak Textured Objects Based on an Approach of Disparity Stereo Inspection.” Paper Presented at the Proceedings of International Conference on Pattern Recognition and Computer Vision (ICPRCV 2011).
  • Li, Yunsong, Yinlin Hu, Rui Song, Peng Rao, and Yangli Wang. 2018a. “Coarse-to-fine PatchMatch for Dense Correspondence.” IEEE Transactions on Circuits and Systems for Video Technology 28 (9): 2233–2245. https://doi.org/10.1109/TCSVT.2017.2720175.
  • Li, Ziyao, Rui Wang, Wen Zhang, Fengmin Hu, and Lingkui Meng. 2019. “Multiscale Features Supported DeepLabV3+ Optimization Scheme for Accurate Water Semantic Segmentation.” IEEE Access 7: 155787–155804. https://doi.org/10.1109/ACCESS.2019.2949635.
  • Li, Deren, Xiongwu Xiao, Bingxuan Guo, Wanshou Jiang, and Yueru Shi. 2016. “Oblique Image Based Automatic Aerotriangulation and Its Application in 3D City Model Reconstruction.” Geomatics and Information Science of Wuhan University 41 (6): 711–721. https://doi.org/10.13203/j.whugis20160099.
  • Li, Shenhong, Xiongwu Xiao, Bingxuan Guo, and Lin Zhang. 2020. “A Novel OpenMVS-Based Texture Reconstruction Method Based on the Fully Automatic Plane Segmentation for 3D Mesh Models.” Remote Sensing 12 (23): 3908. https://doi.org/10.3390/rs12233908.
  • Li, Lincheng, Shunli Zhang, Xin Yu, and Li Zhang. 2018b. “PMSC: PatchMatch-Based Superpixel cut for Accurate Stereo Matching.” IEEE Transactions on Circuits and Systems for Video Technology 28 (3): 679–692. https://doi.org/10.1109/TCSVT.2016.2628782.
  • Lin, Guosheng, Anton Milan, Chunhua Shen, and Ian Reid. 2017. “Refinenet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation.” Paper Presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, https://doi.org/10.48550/arXiv.1611.06612.
  • Liu, Fangfang, and Ming Fang. 2020. “Semantic Segmentation of Underwater Images Based on Improved Deeplab.” Journal of Marine Science and Engineering 8 (3): 188. https://doi.org/10.3390/jmse8030188.
  • Liu, Fayao, Guosheng Lin, and Chunhua Shen. 2015. “CRF Learning with CNN Features for Image Segmentation.” Pattern Recognition 48 (10): 2983–2992. https://doi.org/10.1016/j.patcog.2015.04.019.
  • Liu, Xiaoxia, Fengbao Yang, Hong Wei, and Min Gao. 2022. “Shadow Removal from UAV Images Based on Color and Texture Equalization Compensation of Local Homogeneous Regions.” Remote Sensing 14 (11), https://doi.org/10.3390/rs14112616.
  • Lu, Dengsheng, Guiying Li, Wenhui Kuang, and Emilio Moran. 2014. “Methods to Extract Impervious Surface Areas from Satellite Images.” International Journal of Digital Earth 7 (2): 93–112. https://doi.org/10.1080/17538947.2013.866173.
  • Mishra, Vishal, Ram Avtar, A. P. Prathiba, Prabuddh Kumar Mishra, Anuj Tiwari, Surendra Kumar Sharma, Chandra Has Singh, Bankim Chandra Yadav, and Kamal Jain. 2023. “Uncrewed Aerial Systems in Water Resource Management and Monitoring: A Review of Sensors, Applications, Software, and Issues.” Advances in Civil Engineering, https://doi.org/10.1155/2023/3544724.
  • Mo, Nan, Ruixi Zhu, Li Yan, and Zhan Zhao. 2018. “Deshadowing of Urban Airborne Imagery Based on Object-Oriented Automatic Shadow Detection and Regional Matching Compensation.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 11 (2): 585–605. https://doi.org/10.1109/JSTARS.2017.2787116.
  • Murali, Saritha, and V. K. Govindan. 2013. “Shadow Detection and Removal from a Single Image Using LAB Color Space.” Cybernetics and Information Technologies 13 (1): 95–103. https://doi.org/10.2478/cait-2013-0009.
  • Pant, Gaurav, D. P. Yadav, and Ashish Gaur. 2020. “ResNeXt Convolution Neural Network Topology-Based Deep Learning Model for Identification and Classification of Pediastrum.” Algal Research 48: 101932. https://doi.org/10.1016/j.algal.2020.101932.
  • Park, Haekyung, and Dongkun Lee. 2019. “Comparison Between Point Cloud and Mesh Models Using Images from an Unmanned Aerial Vehicle.” Measurement 138: 461–466. https://doi.org/10.1016/j.measurement.2019.02.023.
  • Pinheiro, Pedro O. Tsung-Yi Lin, Ronan Collobert, and Piotr Dollár. 2016. Learning to Refine Object Segments. Paper Presented at the Computer Vision–ECCV 2016, 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. https://doi.org/10.1007/978-3-319-46448-0_5.
  • Remondino, Fabio, Luigi Barazzetti, Francesco Nex, Marco Scaioni, and Daniele Sarazzi. 2011. “UAV photogrammetry for mapping and 3d modeling – current status and future perspectives.” The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XXXVIII-1/C22: 25–31. https://doi.org/10.5194/isprsarchives-XXXVIII-1-C22-25-2011.
  • Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional Networks for Biomedical Image Segmentation. Paper Presented at the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. https://doi.org/10.1007/978-3-319-24574-4_28.
  • Sabri, My Abdelouahed, Siham Aqel, and Abdellah Aarab. 2019. “A Multiscale Based Approach for Automatic Shadow Detection and Removal in Natural Images.” Multimedia Tools and Applications 78 (9): 11263–11275. https://doi.org/10.1007/s11042-018-6678-x.
  • Schonberger, Johannes L. and Jan-Michael Frahm. 2016. “Structure-from-motion Revisited.” Paper Presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, https://doi.org/10.1109/CVPR.2016.445.
  • Tian, Mao, Bisheng Yang, Chi Chen, Ronggang Huang, and Liang Huo. 2019. “HPM-TDP: An Efficient Hierarchical PatchMatch Depth Estimation Approach Using Tree Dynamic Programming.” ISPRS Journal of Photogrammetry and Remote Sensing 155: 37–57. https://doi.org/10.1016/j.isprsjprs.2019.06.015.
  • Toschi, Isabella, M. M. Ramos, Erica Nocerino, F. Menna, F. Remondino, K. Moe, D. Poli, K. Legat, and Francesco Fassi. 2017. “Oblique Photogrammetry Supporting 3D Urban Reconstruction of Complex Scenarios.” The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLII-1/W1: 519–526. https://doi.org/10.5194/isprs-archives-XLII-1-W1-519-2017.
  • Wang, Ruoqi, Guiying Li, Yagang Lu, and Dengsheng Lu. 2023. “A comparative analysis of grid-based and object-based modeling approaches for poplar forest growing stock volume estimation in plain regions using airborne LIDAR data.” Geo-spatial Information Science, 1–19. https://doi.org/10.1080/10095020.2023.2169199.
  • Xiao, Yong, Cheng Wang, Jing Li, Wuming Zhang, Xiaohuan Xi, Changlin Wang, and Pinliang Dong. 2015. “Building Segmentation and Modeling from Airborne LiDAR Data.” International Journal of Digital Earth 8 (9): 694–709. https://doi.org/10.1080/17538947.2014.914252.
  • Yanmei, Yang, Wang Ying, Shi Lei, Tao Sirong, and Li Hong. Construction of Refined 3D Model Based on DP-Modeler.” Bulletin of Surveying and Mapping 5: 106. https://doi.org/10.13474/j.cnki.11-2246.2021.0152.
  • Zhang, Chi, and A. Murat Maga. 2023. “An Open-Source Photogrammetry Workflow for Reconstructing 3D Models.” Integrative Organismal Biology 5 (1): obad024. https://doi.org/10.1093/iob/obad024.