Full article: UAV image matching from handcrafted to deep local features

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Local feature matching between images is a challenging task, particularly when there are significant appearance variations, such as extreme viewpoint changes. In this work, we present LoFTRS, a deep learning-based image matching framework that integrates semantic constraints into the matching process. Our key insight is that a local feature matcher with deep layers can capture more human-intuitive and simpler-to-match features. In addition to image segmentation module, we also propose a detector-free Transformer module. It uses vector-based attention to model relevance among all features and achieves efficient and effective long-range context aggregation. Transformer module applies a relative position encoding to explicitly disclose relative distance information, further improving the representation of features. We evaluate the performance of LoFTRS comparing to various popular handcrafted and deep learning-based methods. We investigate the relationship between matching quality and the performance of subsequent processing steps, such as the accuracy and completeness of the model generated by SfM. The experimental results show that the proposed LoFTRS achieves equal or better image matching performance in terms of matching score, average track length, RMSE, and the number of 3D points.

KEYWORDS:

Introduction

Unmanned Aerial Vehicle (UAV) have been used on a large scale as new remote sensing platforms for oblique photogrammetry and computer vision (Pajares, Citation2015). With the advantages of flexible data acquisition and low economic cost, high resolution and multi-view images can be provided by carrying various non-surveying cameras for different application areas (Daakir et al., Citation2017), including object recognition (Chiabrando et al., Citation2018), route inspection (Jiang et al., Citation2017), and agricultural planning (Habib et al., Citation2016). After data acquisition is complete and before application, image local feature matching is a critical step in many computer vision fields (Jiang & Jiang, Citation2017). For example, in Structure from Motion (SfM) and Multi-View Stereo (MVS) (Radenovic et al., Citation2016; Schonberger & Frahm, Citation2016; Schonberger et al., Citation2015; Schönberger et al., Citation2016), image retrieval (Philbin et al., Citation2007; Sattler et al., Citation2016; Tolias et al., Citation2016; Torii et al., Citation2015), and image-based localization (Sattler et al., Citation2011, Citation2016; Zeisl et al., Citation2015).

When performing 3D reconstruction of real scene datasets in the absence of camera poses and known scene structures, a variety of problems may be encountered that can affect the quality and accuracy of the reconstruction. Real-world data usually contains noise, illumination variations, and distortions, which can lead to inaccuracies in the generated 3D models; occlusion and masking between objects can lead to missing certain regions, affecting the completeness of the reconstruction; illumination variations in different times or conditions can lead to color and luminance differences between the images, which in turn affects the consistency of the reconstruction (Vijayanarasimhan et al., Citation2017); some objects may lack sufficient texture information, resulting in a lack of detail in the reconstructed model; camera or object motion during shooting may lead to image blurring, which may affect depth estimation and matching (Huang et al., Citation2018); inaccurate camera parameter estimation may lead to geometric transformation errors, which affects the accuracy of the reconstruction; processing of large-scale real-scene datasets requires high computational resources, and data management and processing may become complex.

In the last decades, Scale-invariant feature transform (SIFT) (Lowe, Citation2004) and its variants, such as RootSIFT (Arandjelović & Zisserman, Citation2012) and DSP-SIFT (Dong & Soatto, Citation2015), have been the most used local features in image matching due to their invariance to changes in scale, rotation, and illumination, as well as their high robustness to changes in viewpoints. To handle different data and improve the performance of local features, there are some methods for different applications in photogrammetry and remote sensing, such as L2-SIFT (Sun et al., Citation2016) and AB-SIFT (Sedaghat & Ebadi, Citation2015).

Recently, the rise of neural networks has been widely applied in the field of computer vision, e.g. in target detection and recognition (Krizhevsky et al., Citation2017; Sharif Razavian et al., Citation2014). As a result, neural networks have also been applied to the problem of feature descriptor learning (Simo-Serra et al., Citation2015) to filter more discriminative representations for local features, e.g. SuperPoint (Hu & Mordohai, Citation2012) and D2Net (Jegou et al., Citation2008).

The results show a significant improvement over traditional handcrafted methods (SIFT, Speeded Up Robust Features (SURF) (Bay et al., Citation2006) or DAISY (Tola et al., Citation2009)). However, these methods are usually trained and evaluated on Brown benchmarks (Schonberger & Frahm, Citation2016) for patch validation and classification, as described in (Schonberger et al., Citation2017) and (Jin et al., Citation2021). The intrinsic purpose of these methods is not the same as that of image matching. Meanwhile, there are many modular steps included in SfM-based workflows, such as feature extraction, descriptor matching, outlier elimination, and Bundle Adjustment (BA) (Harris & Stephens, Citation1988). Local feature matching is just a precursory step in the direction of subsequent image processing. Good results in this step do not guarantee a more substantial improvement in the final reconstructed model. Therefore, evaluating its performance in the full SfM pipeline is reasonable and necessary (Jin et al., Citation2021).

However, to the best of our knowledge, no such evaluation has been performed for feature extraction and matching in SfM-based orientation and MVS-based reconstruction for processing UAV images. Although comprehensive evaluations have been performed in (Jin et al., Citation2021) and (Schonberger et al., Citation2017), the data is from community-based imagery with poor spatial resolution of the images, and at the same time, some of the new learning-based features (Schonberger et al., Citation2017) were not evaluated, and the collective amount of data for SfM-based orientation in (Jin et al., Citation2021) is small, which is not adapted to the practical needs of UAV imagery.

In this paper, we conduct a comprehensive experimental evaluation of learned and handcrafted feature descriptors to better understand their performance. Specifically, the paper makes the following contributions:

We perform a comprehensive evaluation of classical handcrafted and deep learning-based methods for feature matching and point cloud reconstruction.

Compared to previous evaluations, we conduct a more detailed study of the matching performance of different descriptors using a wider range of evaluation criteria and scenarios. As illustrated in , we propose a deep learning-based image matching framework called LoFTRS, which incorporates semantic constraints into deep learning-based matching process. We analyze the performance of the matching process and compared the impact of Semantic Deep Matcher on image-based reconstruction to various popular handcrafted and deep learning-based methods. Particularly, we also explore the connection between the matching quality and the performance of the subsequent processing steps, e.g. the accuracy and completeness of the model generated by SfM.

Related work

This study deals with the performance evaluation of hand-crafted features and deep learning-based features in the context of SfM-based image reconstruction. illustrates the overall flow of the reconstruction. Therefore, in this section, we will review some of the work related to local feature matching, including hand-crafted features and deep learning-based features.

Figure 1. Reconstruction process.

Hand-crafted features

In the absence of deep learning-based features, traditional local features, also known as hand-crafted features, play an important part in the field of image matching. Even nowadays, in some specific environments, deep learning methods are not as effective as hand-crafted features. The advantages and disadvantages of Handcrafted (Tareen & Saleem, Citation2018) used in this article are summarized in the following table.

A good local feature detector should have the two features of detecting distinguishable features and covariance constraints, i.e. repeatedly detecting consistent features under different transformations. Typical representative algorithms include.

Harris operator (Harris) (Harris & Stephens, Citation1988): proposed a classical method for accurate detection of corner features by analyzing the local second-order matrix of an image and constructing a window translation information quantity change function. The Harris operator has rotation and contrast invariance but lacks scale invariance and is limited by the selection of the threshold value.

Scale-invariant feature transform (SIFT) (Lowe, Citation2004): used Difference of Gaussians (DoG) as a detector to select keypoints by scale space pyramid. SIFT features are scale invariant and they can be generated for even small objects. However, SIFT requires a significant amount of memory, especially when working with high-resolution images or large databases. This memory intensity can be a challenge in resource-constrained environments.

Speeded Up Robust Features (SURF): Bay’s proposed SURF (Bay et al., Citation2006) improves the extraction and description of features in a more efficient way. SURF uses a Hessian matrix determinant approximation of the image. When the Hessian matrix obtains the local maxima, it determines that the current point is a brighter or darker point than other points in the surrounding neighborhood and thus locates the position of the key point. However, SURF is not robust enough to adapt to changes in scale and rotation.

Oriented FAST and Rotated BRIEF (ORB): ORB (Rublee et al., Citation2011) feature description algorithms have much better runtime than SIFT and SURF and can be used for real-time feature detection. ORB features are based on the FAST corner-point feature point detection and description technique and are invariant to noise and perspective affine. The disadvantage is that the ORB scale transformation has a relatively low coping ability.

AKAZE: Pablo (Alcantarilla & Solutions, 2011) uses Fast Explicit Diffusion (FED) to construct the scale space faster than any other nonlinear model at the time, while being more accurate than AOS. An efficient Modified Local Difference Binary Descriptor (M-LDB) is introduced, which adds uniqueness to the gradient information of the scale space constructed by combining FED. The AKAZE algorithm is faster than the SIFT and SURF algorithms, while the repeatability and robustness are greatly improved over the ORB algorithm.

Based on the comparison of SIFT, SURF, ORB, and AKAZE (Tareen & Saleem, Citation2018), the following conclusions can be drawn:

The quantitative comparison shows that the order of feature-detector-descriptors for detecting a high quantity of features is:

ORB > SURF > SIFT > AKAZE > KAZE .

The feature detection description for each feature-point is:

ORB > ORB (1000) > SURF (64 D) > SURF (128 D) \break > AKAZE > SIFT > KAZE .

The order of efficient feature-matching per feature-point is:

ORB (1000) > AKAZE > KAZE > SURF (64 D) \break > ORB > SIFT > SURF (128 D) .

The feature detector descriptors can be ranked according to their speed of total image matching as follows:

ORB (1000) > AKAZE > KAZE > SURF (64 D) \break > SIFT > ORB > SURF (128 D) .

Deep learning-based local features

Deep learning descriptors are often considered supervised learning problems. The goal is to develop a representation that can combine two matching features as closely as possible while keeping mismatched features separate from each other in the measurement space.

Descriptor learning is also known as patch matching because it often uses local patches that have been cropped and centered on keypoints. In general, existing methods consist of two forms, namely, metric learning (Wang et al., Citation2017) and descriptor learning (Luo et al., Citation2019). Usually, these two forms are practiced together. Using original patches or created descriptors as input, metric learning algorithms often train a discriminative metric for similarity evaluation. In contrast, descriptor learning typically creates descriptor representations from unprocessed images or patches.

Deep learning-based descriptors can be seen as an extension of those based on classical learning (Schonberger et al., Citation2017). For example, recent deep methods have adopted and modified the Siamese structure (Chopra et al., Citation2005) and often use loss functions such as hinge, siamese, triple, ranking, and contrast losses. More precisely (Zagoruyko & Komodakis, Citation2015), introduced their depth comparison and showed how to learn directly from the original image pixels using a generalized patch similarity function. In this case, the similarity function is encoded using CNN models (Krizhevsky et al., Citation2017) of different siamese types. These models are then taught to distinguish between positive and negative image patches. Siamese with shared or unshared weights and a central surround form are two of several network architectures that have been tried. To learn both descriptors and metrics, MatchNet (Han et al., Citation2015) was proposed. A Siamese-like description network and a fully convolutional decision network were used to implement the method. As a result, the performance of the description is greatly improved (Jiang et al., Citation2014). proposed a unique deep ranking method to discover fine-grained image similarities. The model uses a triplet hinge loss and ranking function to define fine-grained image similarity associations. Global visual features and image semantics are captured using a multi-scale neural network design. Siamese and triple networks were used for implementation, combining triple combination and global training loss. TFeat (Balntaset al., Citation2016) suggested the use of three sets of training samples for CNN-based patch description and matching.

The method was implemented using a fast, hard negative mining method and shallow convolutional networks (Tian et al., Citation2017). used a progressive sampling technique in L2Net to improve the relative distance-based loss function in Euclidean space. To improve efficiency, the authors of this study considered the compactness of descriptors and intermediate feature mappings. By combining direct hinge-triplet loss with “hardest-in-batch” mining, HardNet (Mishchuk et al., Citation2017) improves performance over L2Net, and PN-Net (Balntaset al., Citation2016) trains both positive and negative constraints, utilizing the concepts of online augmentation and distance metric learning. Compared to hinge loss or SoftMax ratio, the proposed SoftPN loss function has a faster conversion speed. Generalized rank formula learning is introduced through descriptor learning based on average precision attention. In this technique, true matches should be prioritized over all false path matches, which are specified as a constraint and optimized using binary and real-valued local feature descriptors. For generative adversarial networks (Mirza & Osindero, Citation2014) to train discriminative but compressed binary representations of image patches, in contrast, by exploiting a pool of kernelized subspaces, Wei et al. (Citation2018) learned a discriminative depth descriptor that excludes loss functions, network structure, regularization, and hard negative mining. SOSNet (Yi et al., Citation2016) trained their network by combining local patch similarity constraints with spatial geometric constraints on the points of interest, a more modern technique that significantly enhances matching.

As mentioned in the CNN-based detectors, an increasing number of end-to-end learning methods integrate the feature description together with the detectors into the complete matching pipeline. These methods are like those that have been specifically designed for the description reviewed above. The main difference may lie in the method of training and the design of the entire network structure. The core challenge is to make the whole process differentiable and trainable. For example, LIFT uses an end-to-end CNN network to perform keypoint detection, direction estimation, and characterization simultaneously. To train interest point detectors and descriptors for multi-view geometric problems, SuperPoint (DeTone et al., Citation2018) proposes a self-supervised framework for training interest point detectors and descriptors for multiple-view geometrical problems. The fully convolutional model can operate on full-size images and simultaneously compute the positions of pixel-level interest points and their associated descriptors.

LF-Net (Ono et al., Citation2018) confines the end-to-end pipeline to a single branch to optimize the entire process in a differentiable manner. It also employs a fully convolutional network that operates on full-sized images to produce a feature-rich score map. This map can then be used to extract keypoint locations as well as their feature attributes, such as scale and orientation. Additionally, it performs a differentiable form of Non-Maximum Suppression (NMS), namely, softargmax, to improve subpixel location accuracy and enhance keypoint saliency. Similar to LF-Net, RF-Net (Shen et al., Citation2019) selects high-response pixels as keypoints on multiple scales, but the response maps are constructed by receptive feature maps.

However (Bhowmik et al., Citation2020), suggest that improving low-level matching scores does not necessarily lead to better performance in high-level vision tasks. Therefore, they embedded the feature detector in a complete vision pipeline, where the learnable parameters are trained in an end-to-end manner. The authors address the discrete nature of keypoint selection and descriptor matching by applying principles from reinforcement learning. In 2020 (Luo et al., Citation2020), presented ASFeat, which employs joint learning of local feature detectors and descriptors to explore local shape information of feature points and improve the accuracy of point detection. Another learning-based method for detection involves estimating orientation (Yi et al., Citation2016), while the spatial transformation network (STN) (Jaderberg et al., Citation2015) can also serve as a valuable reference for deep learning-based detectors with rotation invariance (Ono et al., Citation2018; Yi et al., Citation2016).

Materials and methods

As shown in , we present a detector-free local feature matching framework with transformers and semantics, called LoFTRS, which incorporates semantic constraints into the matching process. LoFTRS consists of two main branches: image segmentation module and detector-free transformer matching module.

Figure 2. Incorporating semantic constraints to Transformer-based local feature matching.

Image segmentation module

In the image segmentation module, we followed (Li et al., Citationn.d.) to uses the same detection architecture design as DINO with only minimal modifications. In the transitive decoder, Mask DINO adds a mask branch for segmentation and extends several key components used in DINO for the segmentation task. Mask DINO predicts boxes and masks with two parallel heads in a loosely coupled manner, like some traditional modulators. However, this approach can result in inconsistent predictions. To address this issue, we added a mask prediction loss to the original box and classification loss in two-part matching. This encourages more accurate and consistent matching results for a query.

Detector-free transformer module

In this work, we present PE-LoftrMatcher, a Transformer-based deep learning module that builds upon our investigation of local feature matching in detector-free methods. As shown in , PE-LoftrMatcher has designed a deep-narrow Transformer layer to capture more human-intuitive and simpler-to-match features. Additionally, position encoding (PE) has been integrated into each Transformer layer to convey position information in deep layers. Moreover, a network-based refinement block is proposed to extract more precise matches. Our key insight is that local feature matching with deep layers can capture more human-intuitive and simpler-to-match features. The detailed network architecture of PE-LoftrMatcher is shown in , PE-LoftrMatcher interleaves Slimming Transformer by L times to perform long-range context aggregation. Slimming Transformer leverages vector-based attention to model relevance among all keypoints and achieves long-range context aggregation efficiently and effectively.

Figure 3. PE-LoftrMatcher framework flowchart.

Figure 4. Slimming transformer framework diagram.

We flatten the updated enhanced features $F_{A}^{f t m}$ ， $F_{B}^{f t m}$ be the input sequence for deep feature aggregation, obtaining $F_{A}^{s e q}, F_{B}^{s e q} \in R^{N \times C}$ . Following (Sarlin et al., Citationn.d.), we view keypoints with features $F_{A}^{s e q}$ ， $F_{B}^{s e q}$ in image pairs as nodes to construct GNN, in which the global context aggregation intra-/inter-image is performed.

Vector-based attention (VAtt) layer

Instead of approximating self-attention in a context-independent way, we transform the query vector into a global query context and use element products to model the correlation between all key points. Technically, during each functional enhancement, we utilize self- or cross-attention to aggregate long-term contextual information. For self-attention, the input has intra- and inter-image features. U and R are identical (either $(F_{A}^{s e q}, F_{A}^{s e q})$ /or $(F_{B}^{s e q}, F_{B}^{s e q})$ ). For cross-attention, the input features U and R are different (either $(F_{B}^{s e q}, F_{B}^{s e q})$ or $(F_{A}^{s e q}, F_{A}^{s e q})$ ). First, SlimFormer converts the input features U and R into queries, keys, and values $Q, K, V \in R^{N \times \hat{C}}$ .

(1)

Q = U W_{Q}, K = R W_{K}, V = R W_{V}

(1)

where $W_{Q}, W_{K}, W_{V} \in R^{\hat{C} \times \hat{C}}$ denotes the learnable weight of the feature transformation. Then, we encode the relative position of the query vector Q and the key K.

(2)

\tilde{Q} = DPE (Q), \tilde{K} = DPE (K)

(2)

where $DPE (\cdot)$ denotes the relative position encoding operation as follows: Next, modeling the contextual information of input features based on $\tilde{Q}, \tilde{K}$ interactions, V, is a key issue in the transformer-like architecture. In the original Transformer, the dot-product attention mechanism leads to quadratic complexity, making it impractical to build a deer Transformer layer. One potential way to reduce computational complexity is to summarize the attention matrix before modelling its interactions.

We introduce vector-based attention that effectively models remote interactions between pixel tokens to alleviate this bottleneck. Instead of computing a quadratic attention graph that encodes all possible interactions between candidate matches, we form a compact representation of the query key interactions by computing the correlation between the global query vector and each key vector based on the attention of the vector $Q K^{T}$ . Specifically. We first use MLP to compute the weight ${\tilde{Q}}_{imp} \in R^{1 \times N}$ for each query vector:

(3)

{\tilde{Q}}_{imp} = Soft max (MLP (\tilde{Q}))

(3)

where $Soft max (\cdot)$ denotes the softmax operation

The global query vector $Q \in R^{1 \times \tilde{C}}$ is set to a linear combination of $\tilde{Q}$ :

(4)

Q = {\tilde{Q}}_{imp} \otimes \tilde{Q}

(4)

where $\otimes$ denotes matrix multiplication

We then use elementwise multiplication between the global query vector Q and each of the key vectors to model their interaction and obtain the context-aware key vector ${\tilde{K}}_{Q} \in R^{N \times \hat{C}}$ .

{\tilde{K}}_{Q} = Q ⊙ \tilde{K}

K_{Q} = {\tilde{K}}_{Qimp} \otimes {\tilde{K}}_{Q}

(5)

Λ = K_{Q} ⊙ V

(5)

Subsequently, we use the MLP and shortcut structures to derive global messages $M \in R^{N \times \tilde{C}}$ .

(6)

M = MLP (Λ) + \tilde{Q}

(6)

For convenience, we define the process of the vector-based attention layer as follows:

(7)

M = VAtt (U, R)

(7)

Feed-forward network (FFN)

Inspired by conventional Transformers, we employ a feed-forward network applied to M to extract discriminative features for effectively deep features aggregation. The feed-forward network consists of two fully connected layers and a GELU activation function. The hidden dimension between the two fully connected layers is extended by a scale rate $γ$ to learn abundant feature representation. This process can be formulated as:

(8)

FFN (U, M) = ML P_{1 / γ} (GELU (ML P_{γ / 3} ([U | | M])))

(8)

Layer scale strategy

Intuitively, people obtain different message after observing images each time, which inspires us to propose a layer-scale strategy. Specifically, in accordance with ResNet (He et al., Citationn.d.), we utilize a shortcut structure to realize. Then, we design a learnable scaling factor $ξ$ to adaptively balance original features U and enhanced message $\tilde{M}$ , which is formulated as.

(9)

\tilde{U} = U + ξ \tilde{M}

(9)

By incorporating $ξ$ into SlimFormer, SlimFormer can easily simulate the human behavior that humans acquire different matching cues each time they scan an image pair.

Self-/cross-SlimFormer

In summary, the SlimFormer is formatted as:

(10)

S lim (U, R) = U + ξFFN (U, VAtt (U, R))

(10)

We perform L times of SlimFormer for feature enhancement. During the l-th feature enhancement, we use self-/cross attention mechanism to integrate intra-/inter-image information.

Evaluation pipeline

This study follows the basic workflow of SfM-based image reconstruction, which is divided into two parts. In the first part, feature extraction is performed on each UAV image, initial matching is obtained from each pair of images, and then RANSAC is used to remove outliers. According to the workflow of feature matching, the feature evaluation is divided into two steps, which are feature extraction and feature matching. In the second part, the verified matches are fed into the classical SfM pipeline for sparse reconstruction to obtain accurate camera poses and scene structures. Subsequently, the results from sparse reconstruction are used to perform dense reconstruction.

presents a summary of the advantages and disadvantages of the classical manual methods used in this paper.

Table 1. Comparison of commonly used handcrafted methods: SIFT, SURF, ORB, and AKAZE.

Download CSV Display Table

According to the pipeline, certain key metrics are chosen and utilized to evaluate the performance of detectors and descriptors, as outlined in . These metrics are categorized into two groups, where matching represents the feature extraction and matching procedure, and reconstruction represents the SfM-based sparse reconstruction and the MVS-based dense reconstruction.

Table 2. Description of evaluation matches.

Download CSV Display Table

Feature extraction and matching

Not only did we want to use a comprehensive set of metrics that would allow us to embed the analysis into existing work, but we also wanted to evaluate the parameters associated with feature-dependent algorithms. Therefore, we propose 3 different metrics: matching score, matching time and interval distribution of the number of matching pairs. This is shown in .

Figure 5. Main methods and evaluation metrics of feature extraction and matching.

Matching score

The matching score indicates the ratio between the numbers of verified inliers and initial features extracted from images. It describes the number of initial features that will lead to a correct match; the matching score may be affected by fuzzy descriptors and matching criteria:

(11)

Matching score = \frac{Correct Matches}{Features}

(11)

Overall, the matching score describes the performance of the descriptor and is influenced by the robustness of the descriptor to transformations present in the data.

Interval distribution of the number of matching pairs: Statistics on distribution intervals of the number of matched pairs of different methods on the same data can reflect the feature extraction and matching performance of different methods more intuitively.

Matching time

The time spent by an image in the matching process is equivalent to the sum of the time costs spent in feature extraction and feature matching. The total time spent on all matching pairs is recorded here.

Sparse and dense reconstruction

After image matching, the processes of sparse and dense reconstruction have been evaluated as well, respectively. As shown in , six separate metrics are adopted to evaluate the accuracy, and completeness of reconstruction results using different image matching methods.

Figure 6. Sparse reconstruction process and evaluation indicators.

Re-projection error. For sparse point cloud reconstruction, we can judge the accuracy of relative positioning by calculating the reprojection error. The smaller the reprojection error, the more accurate the influence matching algorithm is and the more accurate the relative positioning is, which allows us to reconstruct a model with higher accuracy. There is always a distance between the image position computed from the camera’s projection matrix and the actual image position for each 3D point after calibration. We sum up the errors to construct a least squares problem and then find a good camera position to minimize it:

(12)

\min_{B_{j}, A_{i}} \sum_{i = 1}^{n} \sum_{j = 1}^{m} p_{ij} ∥ P (B_{j}, A_{i}) - a_{ij} ∥^{2}

(12)

Where $A_{i}$ and $B_{j}$ denote a 3D point and a camera, respectively; $P (B_{j}, A_{i})$ is the predicted projection of point $A_{i}$ on camera $B_{j}$ ; $a_{ij}$ is the observed image point; $| | ∙ | |$ denotes the L2-norm; $p_{ij}$ is an indicator function with $p_{ij}$ = 1 if point $A_{i}$ is visible in camera $B_{j}$ ; otherwise, $p_{ij}$ = 0.

The reprojection error considers not only the computational error between the single response matrices but also the measurement error of the image, making it suitable for evaluating sparse reconstruction results.

Average track length

Evaluating the average number of images for a 3D point is often a way of assessing the quality and reliability of a 3D reconstruction; the more times a 3D point is observed in multiple images, the higher its average number of images will be. This indicates that the point has multiple observations in different images, which can increase the accuracy and stability of the 3D point. The average image count can also reflect the sparsity of the 3D point cloud. If a point is observed in only a few images, it may mean that there is a lack of sufficient viewpoints in certain areas.

Number of registered images

This number can affect the quality of the results of the task, the efficiency of the computation, and the applicability of the algorithm. When using multi-view images for 3D reconstruction, the number of registered images determines how many viewpoints from which a particular 3D point can be observed. More registered images mean more information for reconstruction, which usually leads to a more accurate and complete 3D model.

Number of sparse point clouds

For sparse reconstruction, the number of sparse points will be directly affected by the accuracy and completeness of feature matching, which can intuitively respond to the feature detector and descriptor performance. Therefore, the completeness of the image is evaluated by the number of sparse points in this assessment.

Number of dense point clouds

For dense reconstruction, the number of dense points will be affected by the accuracy and completeness of the sparse reconstruction, which is further affected by the performance of the feature detectors and descriptors. Therefore, the number of dense points is used in this evaluation.

Reconstruction time

The time spent by the image in the reconstruction process, including the sparse reconstruction time and the dense reconstruction time.

By using the above metrics, the efficiency of feature learning, accuracy, and completeness of 3D reconstruction can be comprehensively assessed.

Results

Datasets

All four datasets were collected with a UAV platform, as described in .

Figure 7. Images of four UAV scene (a) scene 1 (b) scene 2 (c) scene 3 (d) scene 4.

Six configurations were constructed for the four scenes. Data 1 and Data 4, consisting mainly of buildings, are repetitive textured data generated from Scene 1. The former is down-sampled data, while the latter is image patch data. Data 2, an ISPRS public dataset (Rottensteiner, Citationn.d.), was generated from Scene 2 and contains a large number of plants and several buildings. Data 3 is generated from Scene 3. It contains a library building with relatively consistent lighting conditions and small rotation angles between images. Data 5 and Data 6 are generated from Scene 4, the public SWJTU_BLD (Zhu et al., Citationn.d.) dataset. Data 5 is Ground-UAV data, whereas Data 6 is mixed Ground-UAV and UAV-UAV data. The specific data are described in .

Table 3. Description of the UAV test data.

Download CSV Display Table

For performance evaluation, all reconstruction experiments were performed on a Windows PC with a 3.6 GHz Intel Core i9-9900K CPU and a 4GB NVIDIA GeForce RTX 2080 Ti graphics card.

The deep learning matching method was instead performed on an Ubuntu PC with a 2.2 GHz Intel Xeon E5–2630 CPU and a 12GB NVIDIA GTX 1080 Ti graphics card.

In general, learned descriptors can be further enhanced by training them with the UAV data. In this evaluation, we did not do this for two main reasons. First, existing benchmark datasets, such as Brown, feature wide-baseline images, including viewpoint and illumination variations. They also represent common features of UAV images. Second, some CNN models rely on synthetic datasets for pre-training intermediate sub-models. In addition, different learning models require different data formats, e.g. patch-based, and image-based datasets. For descriptor comparability, pre-trained models and their source code releases were used without retraining. This can be used to verify their adaptation to UAV images. It is worth noting that the default parameters of all the chosen methods were used to make the evaluation as fair as possible. However, there are many reasons for the failure of the reconstruction, such as non-optimized parameter configurations and unsuitable datasets. In other words, we only consider their performance on UAV images without further parameter tuning.

Also, to ensure data consistency, we chose to use the inputs of all methods in the form of images instead of the traditional faceted inputs, and we resized all datasets for all methods to work properly. For the method we used, it is shown in .

Table 4. Detailed information on the assessment methodology.

Download CSV Display Table

Comparison and evaluation of matching results

To verify the robustness and accuracy of the proposed matching methods, we use the six datasets derived from the above four UAV datasets for point matching using the 12 methods listed in , at the same time, it is visualized by histograms, as shown in :

Figure 8. Comparison of matching scores datasets.