720
Views
0
CrossRef citations to date
0
Altmetric
Review Article

UAV image matching from handcrafted to deep local features

, , &
Article: 2307619 | Received 25 Sep 2023, Accepted 16 Jan 2024, Published online: 21 Feb 2024

ABSTRACT

Local feature matching between images is a challenging task, particularly when there are significant appearance variations, such as extreme viewpoint changes. In this work, we present LoFTRS, a deep learning-based image matching framework that integrates semantic constraints into the matching process. Our key insight is that a local feature matcher with deep layers can capture more human-intuitive and simpler-to-match features. In addition to image segmentation module, we also propose a detector-free Transformer module. It uses vector-based attention to model relevance among all features and achieves efficient and effective long-range context aggregation. Transformer module applies a relative position encoding to explicitly disclose relative distance information, further improving the representation of features. We evaluate the performance of LoFTRS comparing to various popular handcrafted and deep learning-based methods. We investigate the relationship between matching quality and the performance of subsequent processing steps, such as the accuracy and completeness of the model generated by SfM. The experimental results show that the proposed LoFTRS achieves equal or better image matching performance in terms of matching score, average track length, RMSE, and the number of 3D points.

Introduction

Unmanned Aerial Vehicle (UAV) have been used on a large scale as new remote sensing platforms for oblique photogrammetry and computer vision (Pajares, Citation2015). With the advantages of flexible data acquisition and low economic cost, high resolution and multi-view images can be provided by carrying various non-surveying cameras for different application areas (Daakir et al., Citation2017), including object recognition (Chiabrando et al., Citation2018), route inspection (Jiang et al., Citation2017), and agricultural planning (Habib et al., Citation2016). After data acquisition is complete and before application, image local feature matching is a critical step in many computer vision fields (Jiang & Jiang, Citation2017). For example, in Structure from Motion (SfM) and Multi-View Stereo (MVS) (Radenovic et al., Citation2016; Schonberger & Frahm, Citation2016; Schonberger et al., Citation2015; Schönberger et al., Citation2016), image retrieval (Philbin et al., Citation2007; Sattler et al., Citation2016; Tolias et al., Citation2016; Torii et al., Citation2015), and image-based localization (Sattler et al., Citation2011, Citation2016; Zeisl et al., Citation2015).

When performing 3D reconstruction of real scene datasets in the absence of camera poses and known scene structures, a variety of problems may be encountered that can affect the quality and accuracy of the reconstruction. Real-world data usually contains noise, illumination variations, and distortions, which can lead to inaccuracies in the generated 3D models; occlusion and masking between objects can lead to missing certain regions, affecting the completeness of the reconstruction; illumination variations in different times or conditions can lead to color and luminance differences between the images, which in turn affects the consistency of the reconstruction (Vijayanarasimhan et al., Citation2017); some objects may lack sufficient texture information, resulting in a lack of detail in the reconstructed model; camera or object motion during shooting may lead to image blurring, which may affect depth estimation and matching (Huang et al., Citation2018); inaccurate camera parameter estimation may lead to geometric transformation errors, which affects the accuracy of the reconstruction; processing of large-scale real-scene datasets requires high computational resources, and data management and processing may become complex.

In the last decades, Scale-invariant feature transform (SIFT) (Lowe, Citation2004) and its variants, such as RootSIFT (Arandjelović & Zisserman, Citation2012) and DSP-SIFT (Dong & Soatto, Citation2015), have been the most used local features in image matching due to their invariance to changes in scale, rotation, and illumination, as well as their high robustness to changes in viewpoints. To handle different data and improve the performance of local features, there are some methods for different applications in photogrammetry and remote sensing, such as L2-SIFT (Sun et al., Citation2016) and AB-SIFT (Sedaghat & Ebadi, Citation2015).

Recently, the rise of neural networks has been widely applied in the field of computer vision, e.g. in target detection and recognition (Krizhevsky et al., Citation2017; Sharif Razavian et al., Citation2014). As a result, neural networks have also been applied to the problem of feature descriptor learning (Simo-Serra et al., Citation2015) to filter more discriminative representations for local features, e.g. SuperPoint (Hu & Mordohai, Citation2012) and D2Net (Jegou et al., Citation2008).

The results show a significant improvement over traditional handcrafted methods (SIFT, Speeded Up Robust Features (SURF) (Bay et al., Citation2006) or DAISY (Tola et al., Citation2009)). However, these methods are usually trained and evaluated on Brown benchmarks (Schonberger & Frahm, Citation2016) for patch validation and classification, as described in (Schonberger et al., Citation2017) and (Jin et al., Citation2021). The intrinsic purpose of these methods is not the same as that of image matching. Meanwhile, there are many modular steps included in SfM-based workflows, such as feature extraction, descriptor matching, outlier elimination, and Bundle Adjustment (BA) (Harris & Stephens, Citation1988). Local feature matching is just a precursory step in the direction of subsequent image processing. Good results in this step do not guarantee a more substantial improvement in the final reconstructed model. Therefore, evaluating its performance in the full SfM pipeline is reasonable and necessary (Jin et al., Citation2021).

However, to the best of our knowledge, no such evaluation has been performed for feature extraction and matching in SfM-based orientation and MVS-based reconstruction for processing UAV images. Although comprehensive evaluations have been performed in (Jin et al., Citation2021) and (Schonberger et al., Citation2017), the data is from community-based imagery with poor spatial resolution of the images, and at the same time, some of the new learning-based features (Schonberger et al., Citation2017) were not evaluated, and the collective amount of data for SfM-based orientation in (Jin et al., Citation2021) is small, which is not adapted to the practical needs of UAV imagery.

In this paper, we conduct a comprehensive experimental evaluation of learned and handcrafted feature descriptors to better understand their performance. Specifically, the paper makes the following contributions:

We perform a comprehensive evaluation of classical handcrafted and deep learning-based methods for feature matching and point cloud reconstruction.

Compared to previous evaluations, we conduct a more detailed study of the matching performance of different descriptors using a wider range of evaluation criteria and scenarios. As illustrated in , we propose a deep learning-based image matching framework called LoFTRS, which incorporates semantic constraints into deep learning-based matching process. We analyze the performance of the matching process and compared the impact of Semantic Deep Matcher on image-based reconstruction to various popular handcrafted and deep learning-based methods. Particularly, we also explore the connection between the matching quality and the performance of the subsequent processing steps, e.g. the accuracy and completeness of the model generated by SfM.

Related work

This study deals with the performance evaluation of hand-crafted features and deep learning-based features in the context of SfM-based image reconstruction. illustrates the overall flow of the reconstruction. Therefore, in this section, we will review some of the work related to local feature matching, including hand-crafted features and deep learning-based features.

Figure 1. Reconstruction process.

Figure 1. Reconstruction process.

Hand-crafted features

In the absence of deep learning-based features, traditional local features, also known as hand-crafted features, play an important part in the field of image matching. Even nowadays, in some specific environments, deep learning methods are not as effective as hand-crafted features. The advantages and disadvantages of Handcrafted (Tareen & Saleem, Citation2018) used in this article are summarized in the following table.

A good local feature detector should have the two features of detecting distinguishable features and covariance constraints, i.e. repeatedly detecting consistent features under different transformations. Typical representative algorithms include.

Harris operator (Harris) (Harris & Stephens, Citation1988): proposed a classical method for accurate detection of corner features by analyzing the local second-order matrix of an image and constructing a window translation information quantity change function. The Harris operator has rotation and contrast invariance but lacks scale invariance and is limited by the selection of the threshold value.

Scale-invariant feature transform (SIFT) (Lowe, Citation2004): used Difference of Gaussians (DoG) as a detector to select keypoints by scale space pyramid. SIFT features are scale invariant and they can be generated for even small objects. However, SIFT requires a significant amount of memory, especially when working with high-resolution images or large databases. This memory intensity can be a challenge in resource-constrained environments.

Speeded Up Robust Features (SURF): Bay’s proposed SURF (Bay et al., Citation2006) improves the extraction and description of features in a more efficient way. SURF uses a Hessian matrix determinant approximation of the image. When the Hessian matrix obtains the local maxima, it determines that the current point is a brighter or darker point than other points in the surrounding neighborhood and thus locates the position of the key point. However, SURF is not robust enough to adapt to changes in scale and rotation.

Oriented FAST and Rotated BRIEF (ORB): ORB (Rublee et al., Citation2011) feature description algorithms have much better runtime than SIFT and SURF and can be used for real-time feature detection. ORB features are based on the FAST corner-point feature point detection and description technique and are invariant to noise and perspective affine. The disadvantage is that the ORB scale transformation has a relatively low coping ability.

AKAZE: Pablo (Alcantarilla & Solutions, 2011) uses Fast Explicit Diffusion (FED) to construct the scale space faster than any other nonlinear model at the time, while being more accurate than AOS. An efficient Modified Local Difference Binary Descriptor (M-LDB) is introduced, which adds uniqueness to the gradient information of the scale space constructed by combining FED. The AKAZE algorithm is faster than the SIFT and SURF algorithms, while the repeatability and robustness are greatly improved over the ORB algorithm.

Based on the comparison of SIFT, SURF, ORB, and AKAZE (Tareen & Saleem, Citation2018), the following conclusions can be drawn:

The quantitative comparison shows that the order of feature-detector-descriptors for detecting a high quantity of features is:

ORB>SURF>SIFT>AKAZE>KAZE.

The feature detection description for each feature-point is:

ORB>ORB1000>SURF64D>SURF128D\break>AKAZE>SIFT>KAZE.

The order of efficient feature-matching per feature-point is:

ORB1000>AKAZE>KAZE>SURF64D\break>ORB>SIFT>SURF128D.

The feature detector descriptors can be ranked according to their speed of total image matching as follows:

ORB1000>AKAZE>KAZE>SURF64D\break>SIFT>ORB>SURF128D.

Deep learning-based local features

Deep learning descriptors are often considered supervised learning problems. The goal is to develop a representation that can combine two matching features as closely as possible while keeping mismatched features separate from each other in the measurement space.

Descriptor learning is also known as patch matching because it often uses local patches that have been cropped and centered on keypoints. In general, existing methods consist of two forms, namely, metric learning (Wang et al., Citation2017) and descriptor learning (Luo et al., Citation2019). Usually, these two forms are practiced together. Using original patches or created descriptors as input, metric learning algorithms often train a discriminative metric for similarity evaluation. In contrast, descriptor learning typically creates descriptor representations from unprocessed images or patches.

Deep learning-based descriptors can be seen as an extension of those based on classical learning (Schonberger et al., Citation2017). For example, recent deep methods have adopted and modified the Siamese structure (Chopra et al., Citation2005) and often use loss functions such as hinge, siamese, triple, ranking, and contrast losses. More precisely (Zagoruyko & Komodakis, Citation2015), introduced their depth comparison and showed how to learn directly from the original image pixels using a generalized patch similarity function. In this case, the similarity function is encoded using CNN models (Krizhevsky et al., Citation2017) of different siamese types. These models are then taught to distinguish between positive and negative image patches. Siamese with shared or unshared weights and a central surround form are two of several network architectures that have been tried. To learn both descriptors and metrics, MatchNet (Han et al., Citation2015) was proposed. A Siamese-like description network and a fully convolutional decision network were used to implement the method. As a result, the performance of the description is greatly improved (Jiang et al., Citation2014). proposed a unique deep ranking method to discover fine-grained image similarities. The model uses a triplet hinge loss and ranking function to define fine-grained image similarity associations. Global visual features and image semantics are captured using a multi-scale neural network design. Siamese and triple networks were used for implementation, combining triple combination and global training loss. TFeat (Balntaset al., Citation2016) suggested the use of three sets of training samples for CNN-based patch description and matching.

The method was implemented using a fast, hard negative mining method and shallow convolutional networks (Tian et al., Citation2017). used a progressive sampling technique in L2Net to improve the relative distance-based loss function in Euclidean space. To improve efficiency, the authors of this study considered the compactness of descriptors and intermediate feature mappings. By combining direct hinge-triplet loss with “hardest-in-batch” mining, HardNet (Mishchuk et al., Citation2017) improves performance over L2Net, and PN-Net (Balntaset al., Citation2016) trains both positive and negative constraints, utilizing the concepts of online augmentation and distance metric learning. Compared to hinge loss or SoftMax ratio, the proposed SoftPN loss function has a faster conversion speed. Generalized rank formula learning is introduced through descriptor learning based on average precision attention. In this technique, true matches should be prioritized over all false path matches, which are specified as a constraint and optimized using binary and real-valued local feature descriptors. For generative adversarial networks (Mirza & Osindero, Citation2014) to train discriminative but compressed binary representations of image patches, in contrast, by exploiting a pool of kernelized subspaces, Wei et al. (Citation2018) learned a discriminative depth descriptor that excludes loss functions, network structure, regularization, and hard negative mining. SOSNet (Yi et al., Citation2016) trained their network by combining local patch similarity constraints with spatial geometric constraints on the points of interest, a more modern technique that significantly enhances matching.

As mentioned in the CNN-based detectors, an increasing number of end-to-end learning methods integrate the feature description together with the detectors into the complete matching pipeline. These methods are like those that have been specifically designed for the description reviewed above. The main difference may lie in the method of training and the design of the entire network structure. The core challenge is to make the whole process differentiable and trainable. For example, LIFT uses an end-to-end CNN network to perform keypoint detection, direction estimation, and characterization simultaneously. To train interest point detectors and descriptors for multi-view geometric problems, SuperPoint (DeTone et al., Citation2018) proposes a self-supervised framework for training interest point detectors and descriptors for multiple-view geometrical problems. The fully convolutional model can operate on full-size images and simultaneously compute the positions of pixel-level interest points and their associated descriptors.

LF-Net (Ono et al., Citation2018) confines the end-to-end pipeline to a single branch to optimize the entire process in a differentiable manner. It also employs a fully convolutional network that operates on full-sized images to produce a feature-rich score map. This map can then be used to extract keypoint locations as well as their feature attributes, such as scale and orientation. Additionally, it performs a differentiable form of Non-Maximum Suppression (NMS), namely, softargmax, to improve subpixel location accuracy and enhance keypoint saliency. Similar to LF-Net, RF-Net (Shen et al., Citation2019) selects high-response pixels as keypoints on multiple scales, but the response maps are constructed by receptive feature maps.

However (Bhowmik et al., Citation2020), suggest that improving low-level matching scores does not necessarily lead to better performance in high-level vision tasks. Therefore, they embedded the feature detector in a complete vision pipeline, where the learnable parameters are trained in an end-to-end manner. The authors address the discrete nature of keypoint selection and descriptor matching by applying principles from reinforcement learning. In 2020 (Luo et al., Citation2020), presented ASFeat, which employs joint learning of local feature detectors and descriptors to explore local shape information of feature points and improve the accuracy of point detection. Another learning-based method for detection involves estimating orientation (Yi et al., Citation2016), while the spatial transformation network (STN) (Jaderberg et al., Citation2015) can also serve as a valuable reference for deep learning-based detectors with rotation invariance (Ono et al., Citation2018; Yi et al., Citation2016).

Materials and methods

As shown in , we present a detector-free local feature matching framework with transformers and semantics, called LoFTRS, which incorporates semantic constraints into the matching process. LoFTRS consists of two main branches: image segmentation module and detector-free transformer matching module.

Figure 2. Incorporating semantic constraints to Transformer-based local feature matching.

Figure 2. Incorporating semantic constraints to Transformer-based local feature matching.

Image segmentation module

In the image segmentation module, we followed (Li et al., Citationn.d.) to uses the same detection architecture design as DINO with only minimal modifications. In the transitive decoder, Mask DINO adds a mask branch for segmentation and extends several key components used in DINO for the segmentation task. Mask DINO predicts boxes and masks with two parallel heads in a loosely coupled manner, like some traditional modulators. However, this approach can result in inconsistent predictions. To address this issue, we added a mask prediction loss to the original box and classification loss in two-part matching. This encourages more accurate and consistent matching results for a query.

Detector-free transformer module

In this work, we present PE-LoftrMatcher, a Transformer-based deep learning module that builds upon our investigation of local feature matching in detector-free methods. As shown in , PE-LoftrMatcher has designed a deep-narrow Transformer layer to capture more human-intuitive and simpler-to-match features. Additionally, position encoding (PE) has been integrated into each Transformer layer to convey position information in deep layers. Moreover, a network-based refinement block is proposed to extract more precise matches. Our key insight is that local feature matching with deep layers can capture more human-intuitive and simpler-to-match features. The detailed network architecture of PE-LoftrMatcher is shown in , PE-LoftrMatcher interleaves Slimming Transformer by L times to perform long-range context aggregation. Slimming Transformer leverages vector-based attention to model relevance among all keypoints and achieves long-range context aggregation efficiently and effectively.

Figure 3. PE-LoftrMatcher framework flowchart.

Figure 3. PE-LoftrMatcher framework flowchart.

Figure 4. Slimming transformer framework diagram.

Figure 4. Slimming transformer framework diagram.

We flatten the updated enhanced features FAftmFBftm be the input sequence for deep feature aggregation, obtaining FAseq,FBseqRN×C. Following (Sarlin et al., Citationn.d.), we view keypoints with features FAseqFBseq in image pairs as nodes to construct GNN, in which the global context aggregation intra-/inter-image is performed.

Vector-based attention (VAtt) layer

Instead of approximating self-attention in a context-independent way, we transform the query vector into a global query context and use element products to model the correlation between all key points. Technically, during each functional enhancement, we utilize self- or cross-attention to aggregate long-term contextual information. For self-attention, the input has intra- and inter-image features. U and R are identical (either FAseq,FAseq/or FBseq,FBseq). For cross-attention, the input features U and R are different (either FBseq,FBseq or FAseq,FAseq). First, SlimFormer converts the input features U and R into queries, keys, and values Q,K,V RN×C^.

(1) Q=UWQ,K=RWK,V=RWV(1)

where WQ,WK,WV RC^×C^ denotes the learnable weight of the feature transformation. Then, we encode the relative position of the query vector Q and the key K.

(2) Q˜=DPEQ,K˜=DPEK(2)

where DPE denotes the relative position encoding operation as follows: Next, modeling the contextual information of input features based on Q˜,K˜ interactions, V, is a key issue in the transformer-like architecture. In the original Transformer, the dot-product attention mechanism leads to quadratic complexity, making it impractical to build a deer Transformer layer. One potential way to reduce computational complexity is to summarize the attention matrix before modelling its interactions.

We introduce vector-based attention that effectively models remote interactions between pixel tokens to alleviate this bottleneck. Instead of computing a quadratic attention graph that encodes all possible interactions between candidate matches, we form a compact representation of the query key interactions by computing the correlation between the global query vector and each key vector based on the attention of the vector QKT. Specifically. We first use MLP to compute the weight Q˜impR1×N for each query vector:

(3) Q˜imp=SoftmaxMLPQ˜(3)

where Softmax denotes the softmax operation

The global query vector QR1×C˜ is set to a linear combination of Q˜:

(4) Q=Q˜imp Q˜(4)

where denotes matrix multiplication

We then use elementwise multiplication between the global query vector Q and each of the key vectors to model their interaction and obtain the context-aware key vector K˜QRN×Cˆ.

K˜Q=QK˜
KQ=K˜QimpK˜Q
(5) Λ=KQV(5)

Subsequently, we use the MLP and shortcut structures to derive global messages MRN×C˜.

(6) M=MLPΛ+Q˜(6)

For convenience, we define the process of the vector-based attention layer as follows:

(7) M=VAttU,R(7)

Feed-forward network (FFN)

Inspired by conventional Transformers, we employ a feed-forward network applied to M to extract discriminative features for effectively deep features aggregation. The feed-forward network consists of two fully connected layers and a GELU activation function. The hidden dimension between the two fully connected layers is extended by a scale rate γ to learn abundant feature representation. This process can be formulated as:

(8) FFNU,M=MLP1/γGELUMLPγ/3U||M(8)

Layer scale strategy

Intuitively, people obtain different message after observing images each time, which inspires us to propose a layer-scale strategy. Specifically, in accordance with ResNet (He et al., Citationn.d.), we utilize a shortcut structure to realize. Then, we design a learnable scaling factor ξ to adaptively balance original features U and enhanced message M˜, which is formulated as.

(9) U˜=U+ξM˜(9)

By incorporating ξ into SlimFormer, SlimFormer can easily simulate the human behavior that humans acquire different matching cues each time they scan an image pair.

Self-/cross-SlimFormer

In summary, the SlimFormer is formatted as:

(10) SlimU,R=U+ξFFNU,VAttU,R(10)

We perform L times of SlimFormer for feature enhancement. During the l-th feature enhancement, we use self-/cross attention mechanism to integrate intra-/inter-image information.

Evaluation pipeline

This study follows the basic workflow of SfM-based image reconstruction, which is divided into two parts. In the first part, feature extraction is performed on each UAV image, initial matching is obtained from each pair of images, and then RANSAC is used to remove outliers. According to the workflow of feature matching, the feature evaluation is divided into two steps, which are feature extraction and feature matching. In the second part, the verified matches are fed into the classical SfM pipeline for sparse reconstruction to obtain accurate camera poses and scene structures. Subsequently, the results from sparse reconstruction are used to perform dense reconstruction.

presents a summary of the advantages and disadvantages of the classical manual methods used in this paper.

Table 1. Comparison of commonly used handcrafted methods: SIFT, SURF, ORB, and AKAZE.

According to the pipeline, certain key metrics are chosen and utilized to evaluate the performance of detectors and descriptors, as outlined in . These metrics are categorized into two groups, where matching represents the feature extraction and matching procedure, and reconstruction represents the SfM-based sparse reconstruction and the MVS-based dense reconstruction.

Table 2. Description of evaluation matches.

Feature extraction and matching

Not only did we want to use a comprehensive set of metrics that would allow us to embed the analysis into existing work, but we also wanted to evaluate the parameters associated with feature-dependent algorithms. Therefore, we propose 3 different metrics: matching score, matching time and interval distribution of the number of matching pairs. This is shown in .

Figure 5. Main methods and evaluation metrics of feature extraction and matching.

Figure 5. Main methods and evaluation metrics of feature extraction and matching.

Matching score

The matching score indicates the ratio between the numbers of verified inliers and initial features extracted from images. It describes the number of initial features that will lead to a correct match; the matching score may be affected by fuzzy descriptors and matching criteria:

(11) Matchingscore=CorrectMatchesFeatures(11)

Overall, the matching score describes the performance of the descriptor and is influenced by the robustness of the descriptor to transformations present in the data.

Interval distribution of the number of matching pairs: Statistics on distribution intervals of the number of matched pairs of different methods on the same data can reflect the feature extraction and matching performance of different methods more intuitively.

Matching time

The time spent by an image in the matching process is equivalent to the sum of the time costs spent in feature extraction and feature matching. The total time spent on all matching pairs is recorded here.

Sparse and dense reconstruction

After image matching, the processes of sparse and dense reconstruction have been evaluated as well, respectively. As shown in , six separate metrics are adopted to evaluate the accuracy, and completeness of reconstruction results using different image matching methods.

Figure 6. Sparse reconstruction process and evaluation indicators.

Figure 6. Sparse reconstruction process and evaluation indicators.

Re-projection error. For sparse point cloud reconstruction, we can judge the accuracy of relative positioning by calculating the reprojection error. The smaller the reprojection error, the more accurate the influence matching algorithm is and the more accurate the relative positioning is, which allows us to reconstruct a model with higher accuracy. There is always a distance between the image position computed from the camera’s projection matrix and the actual image position for each 3D point after calibration. We sum up the errors to construct a least squares problem and then find a good camera position to minimize it:

(12) minBj,Aii=1nj=1mpijPBj,Aiaij2(12)

Where Ai and Bj denote a 3D point and a camera, respectively; PBj,Ai is the predicted projection of point Ai on camera Bj; aij is the observed image point; |||| denotes the L2-norm; pij is an indicator function with pij = 1 if point Ai is visible in camera Bj; otherwise, pij = 0.

The reprojection error considers not only the computational error between the single response matrices but also the measurement error of the image, making it suitable for evaluating sparse reconstruction results.

Average track length

Evaluating the average number of images for a 3D point is often a way of assessing the quality and reliability of a 3D reconstruction; the more times a 3D point is observed in multiple images, the higher its average number of images will be. This indicates that the point has multiple observations in different images, which can increase the accuracy and stability of the 3D point. The average image count can also reflect the sparsity of the 3D point cloud. If a point is observed in only a few images, it may mean that there is a lack of sufficient viewpoints in certain areas.

Number of registered images

This number can affect the quality of the results of the task, the efficiency of the computation, and the applicability of the algorithm. When using multi-view images for 3D reconstruction, the number of registered images determines how many viewpoints from which a particular 3D point can be observed. More registered images mean more information for reconstruction, which usually leads to a more accurate and complete 3D model.

Number of sparse point clouds

For sparse reconstruction, the number of sparse points will be directly affected by the accuracy and completeness of feature matching, which can intuitively respond to the feature detector and descriptor performance. Therefore, the completeness of the image is evaluated by the number of sparse points in this assessment.

Number of dense point clouds

For dense reconstruction, the number of dense points will be affected by the accuracy and completeness of the sparse reconstruction, which is further affected by the performance of the feature detectors and descriptors. Therefore, the number of dense points is used in this evaluation.

Reconstruction time

The time spent by the image in the reconstruction process, including the sparse reconstruction time and the dense reconstruction time.

By using the above metrics, the efficiency of feature learning, accuracy, and completeness of 3D reconstruction can be comprehensively assessed.

Results

Datasets

All four datasets were collected with a UAV platform, as described in .

Figure 7. Images of four UAV scene (a) scene 1 (b) scene 2 (c) scene 3 (d) scene 4.

Figure 7. Images of four UAV scene (a) scene 1 (b) scene 2 (c) scene 3 (d) scene 4.

Six configurations were constructed for the four scenes. Data 1 and Data 4, consisting mainly of buildings, are repetitive textured data generated from Scene 1. The former is down-sampled data, while the latter is image patch data. Data 2, an ISPRS public dataset (Rottensteiner, Citationn.d.), was generated from Scene 2 and contains a large number of plants and several buildings. Data 3 is generated from Scene 3. It contains a library building with relatively consistent lighting conditions and small rotation angles between images. Data 5 and Data 6 are generated from Scene 4, the public SWJTU_BLD (Zhu et al., Citationn.d.) dataset. Data 5 is Ground-UAV data, whereas Data 6 is mixed Ground-UAV and UAV-UAV data. The specific data are described in .

Table 3. Description of the UAV test data.

For performance evaluation, all reconstruction experiments were performed on a Windows PC with a 3.6 GHz Intel Core i9-9900K CPU and a 4GB NVIDIA GeForce RTX 2080 Ti graphics card.

The deep learning matching method was instead performed on an Ubuntu PC with a 2.2 GHz Intel Xeon E5–2630 CPU and a 12GB NVIDIA GTX 1080 Ti graphics card.

In general, learned descriptors can be further enhanced by training them with the UAV data. In this evaluation, we did not do this for two main reasons. First, existing benchmark datasets, such as Brown, feature wide-baseline images, including viewpoint and illumination variations. They also represent common features of UAV images. Second, some CNN models rely on synthetic datasets for pre-training intermediate sub-models. In addition, different learning models require different data formats, e.g. patch-based, and image-based datasets. For descriptor comparability, pre-trained models and their source code releases were used without retraining. This can be used to verify their adaptation to UAV images. It is worth noting that the default parameters of all the chosen methods were used to make the evaluation as fair as possible. However, there are many reasons for the failure of the reconstruction, such as non-optimized parameter configurations and unsuitable datasets. In other words, we only consider their performance on UAV images without further parameter tuning.

Also, to ensure data consistency, we chose to use the inputs of all methods in the form of images instead of the traditional faceted inputs, and we resized all datasets for all methods to work properly. For the method we used, it is shown in .

Table 4. Detailed information on the assessment methodology.

Comparison and evaluation of matching results

To verify the robustness and accuracy of the proposed matching methods, we use the six datasets derived from the above four UAV datasets for point matching using the 12 methods listed in , at the same time, it is visualized by histograms, as shown in :

Figure 8. Comparison of matching scores datasets.

Figure 8. Comparison of matching scores datasets.

Table 5. Matching score.

Matching score

The performance of detectors and descriptors is first evaluated by feature matching. Matching scores are used as a metric for comparison. The results are displayed in , where the first, second, and third-ranked values are shown in red, blue, and purple, respectively. The table shows that different methods have unique approaches to cope with complex and realistic scenarios under different data conditions, resulting in superior matching scores. This illustrates the importance of selecting the appropriate method for the specific data being analysed. In data 1, DogAffnet-Hard, CapsSuperPoint, and CapsSIFT achieved better matching scores of 0.4146, 0.2, and 0.2819, respectively. This suggests that the deep learning approach is better at feature detection for simple single building than the classical approach, resulting in more match pairs. However, further investigation is needed to determine if this translates to better reconstruction results. For data 2 with a high amount of vegetation coverage, CapsSuperPoint, D2Net, and CapsSIFT achieved the highest scores with 0.2402, 0.2255, and 0.2188, respectively. It is evident that for textures with high repetition, all methods exhibit varying degrees of score decline. This highlights the limitations of existing feature detection methods on repetitive textures. However, when compared to classical methods, existing feature detection methods are less effective in detecting features on repetitive textures.

The limitations of current feature detection methods on repetitive textures are highlighted, with deep learning methods outperforming classical methods in terms of detection results and feature pairs. In the case of Data 3, which contains numerous buildings and road clusters, all detection methods struggle with the same issue of repetitive texture detection. This suggests that existing feature detection methods lack the necessary discrimination for large building and road networks. However, CapsSIFT, CapsSuperPoint, and D2Net achieved higher scores on data 3 with 0.2477, 0.2451, and 0.2084, respectively. In data 5 and 6 of the ground-space association, the effectiveness of all detection methods decreased, although the deep learning methods performed better. By comparing the matching results of these datasets, it is evident that the deep learning approach outperforms the classical approach in scenes of higher complexity and is better suited to realistic scenes.

Average track length

The statistical results of average track length are listed in . A visualized histogram is shown in . We can conclude that when considering the metrics Tie_pt, SIFT and its two variants, RootSIFT and DspSIFT are the best descriptors among all evaluated methods. This means that these three methods can provide the largest and strongest matches for the subsequent SfM reconstruction. This result, combined with the sparse point cloud 3D points (number of sparse points), shows that SIFT does not reconstruct more sparse 3D points, although it can provide the largest and strongest matches for the subsequent SfM reconstruction. However, AKAZE provides basically complete sparse reconstruction with more points than SIFT as well as large and strong matches, since AKAZE can produce more sparse 3D points and a long average track length in data 1, data 2, data 3, and data 6. When considering interval distribution of the number of match pairs, SIFT has the largest number of corresponding feature points in the 0–100 interval, followed by the 100–500 interval; correspondingly, AKAZE has the largest number of corresponding feature points in the 100–500 interval, followed by the 0–100 interval; and for the images in the UAV dataset used in this paper, the matching result with the number of corresponding feature points of 100–500 has the greatest impact on the reconstruction, and the 100–500 corresponding feature points between the images can both ensure that there are enough 2D correspondences, and also build a more complete point cloud model without consuming a lot of computational resources.

Figure 9. Average track length.

Figure 9. Average track length.

Table 6. Average track length.

Qualitative image-matching results

The above shows the qualitative image matching results of Data 1 and Data 6, which can visualize the matching performance of different methods. The first column and the second column in and represent different image pairs, showing the same and different sides of the building, respectively. The qualitative analysis of the matching results shows that the classical method has fewer matching lines and is more focused on the main building, resulting in fewer outliers. However, due to the low quantity of feature correspondences, it is difficult to obtain more sparse 3D points. On the other hand, LoFTRS produces a large number of feature correspondences with high accuracy. When compared to CapsSuperPoint, D2Net, and LoFTR, LoFTRS produces fewer but more accurate correspondences. In comparison to SIFT, SURF, ORB, AKAZE, and R2D2, LoFTRS generates a significantly greater number of correspondences. This enables the acquisition of more sparse 3D points and leads to a more comprehensive 3D model during the reconstruction process. Taking both accuracy and robustness into consideration, LoFTRS achieves the best performance among both handcrafted and deep learning-based approaches.

Figure 10. Data 1 qualitative image matching results.

Figure 10. Data 1 qualitative image matching results.

Figure 11. Data 6 qualitative image matching results.

Figure 11. Data 6 qualitative image matching results.

Distribution of match pairs

The distribution of matching pairs of different methods can be by counting the number of matching pairs of each method. Most of the classical methods are concentrated in less than 500 pairs, and most of the learning methods are concentrated in more than 500 pairs. Deep learning methods can obtain deeper features of the image through convolution, thus obtaining more matching features than classical methods. Depending on the number of corresponding feature points finally matched, we categorize the results with different numbers of corresponding feature points into four intervals: 0 ~ 100, 100 ~ 500, 500 ~ 1000, and above 1000.

Data 1 and Data 4 are characterised by consistent lighting conditions, large rotation angles between images, less than 50% overlap, high noise, and extreme similarity between different sides of the building. This can result in more outliers during feature matching. Based on experimental results in and , AKAZE, LoFTRS, SURF, SIFT, Patch2Pix, and LoFTR are advantageous methods for UAV data with low overlap. In addition, LoFTRS are evenly distributed in every interval, showing that LoFTRS is robust for UAV building data with consistent lighting condition, large rotation angles and extreme similarity between different sides of the building.

Table 7. Interval distribution of the number of match pairs.

Figure 12. Distribution of intervals for each data.

Figure 12. Distribution of intervals for each data.

Six configurations were constructed for the four scenes. Data 1 and Data 4, consisting mainly of buildings, are repetitive textured data generated from Scene 1. The former is down-sampled data, while the latter is image patch data. Data 2, an ISPRS public dataset (Rottensteiner, Citationn.d.), was generated from Scene 2 and contains a large number of plants and several buildings. Data 3 is generated from Scene 3. It contains a library building with relatively consistent lighting conditions and small rotation angles between images. Data 5 and Data 6 are generated from Scene 4, the public SWJTU_BLD (Zhu et al., Citationn.d.) dataset. Data 5 is Ground-UAV data, whereas Data 6 is mixed Ground-UAV and UAV-UAV data.

Data 2 comprises a significant number of plants and multiple buildings. Matching features for image datasets that contain green plants has proven to be a challenging task. AKAZE, LoFTRS, SURF, SIFT, Patch2pix, and LoFTR are methods that can obtain corresponding feature points at different intervals in the matching result. Based on the experimental results, LoFTRS is evenly distributed in every interval, indicating its robustness for UAV building data with many plants and several buildings. Additionally, LoFTRS produces the highest number of correspondences in intervals 100–500 and 500–1000 simultaneously, which is advantageous for 3D reconstruction.

Data 3 contains a library building with relatively consistent lighting conditions and small rotation angles between images. The dataset exhibits consistent lighting conditions, small rotation angles between images, and with over 80% overlap. Based on the experimental results, LoFTRS is evenly distributed in every interval, indicating its robustness for UAV building data with consistent lighting conditions, small rotation angles between images.

Data 5 is Ground-UAV data, whereas Data 6 is mixed Ground-UAV and UAV-UAV data. It is important to note that the SWJT ground-view images were not inherently acquired on the ground; they were also captured by a low-cost drone flying in vertical lift. However, because they have similar characteristics to other ground-view images, they are also considered ground-view images. Based on the experimental results, only LoFTRS is evenly distributed in every interval, indicating its robustness for Ground-UAV and UAV-UAV data. Additionally, LoFTRS produces the highest number of correspondences in intervals 100–500 and 500–1000 simultaneously, which is advantageous for 3D reconstruction.

Efficiency

When analysing matching results, it is essential to consider the time spent. displays the matching time statistics for the six datasets. illustrates the results of the visualization. Generally, as the data volume increases, the time spent by each method also increases. However, it is important to note that the time spent is affected by factors other than the amount of data, such as image size, parameter settings, and operating platforms. Compared to traditional classical methods, deep learning methods generally require more time. This is likely due to the time-consuming process of acquiring feature points and using pixels for convolution to obtain features. For small volumes of single buildings, AKAZE and ORB achieved the best time results using the traditional method, possibly due to the acceleration algorithm employed by these two methods. However, in the joint ground-air scenario, ORB and SURF demonstrate better time efficiency. Among the deep learning methods, LoFTRS requires the least amount of time and is more efficient than SIFT for Data 1.

Figure 13. Comparison of matching times datasets.

Figure 13. Comparison of matching times datasets.

Table 8. Matching time.

Comparison of sparse reconstruction

Number of Sparse Points: The results are presented in , the visualization is shown in with the top three values highlighted in red, blue, and purple. For the simple monolithic building data 1, R2D2, AKAZE, and SURF obtained the largest number of sparse points, with 25,577, 12788, and 10,188, respectively. In general, handcrafted methods are better suited for obtaining sparse points on this type of data and linking pairs of points based on obtaining a sufficiently large number of sparse points. On the other hand, the matched pairs acquired by deep learning lack sufficient relevance, resulting in fewer sparse points. For Data 2 with high vegetation coverage and repetitive textures, SIFT, AKAZE, and LoFTRS obtain a better number of sparse points than the other methods, which are 28,759, 21949, and 18,843, respectively. For Data 4, LoFTRS, LoFTR and SIFT obtain the largest number of sparse points than the other methods, which are 76,516, 63184, and 31,602, respectively.

Table 9. Number of sparse points and reprojection error.

Figure 14. Comparison of the number of sparse points in each data.

Figure 14. Comparison of the number of sparse points in each data.

Reprojection error

displays the reprojection error results for each method, with smaller values indicating greater accuracy and illustrates the results of the visualization. For Data 1, LoFTR, CapsSuperPoint and SURF achieve more accurate results, while for Data 3, ORB, D2Net and R2D2 achieve more accurate results. Finally, for Data 4, LoFTR, ORB and SURF achieve more accurate results. As the complexity of the data scenario increases, the deep learning-based approach can obtain feature points more accurately, resulting in more matches to some extent. In general, LoFTRS achieves more accurate sparse reconstruction than SIFT.

Figure 15. Comparison of reprojection errors for each data.

Figure 15. Comparison of reprojection errors for each data.

Number of registered images

The extent of image data included in this sparse point cloud is determined by the quantity of registered images. A more comprehensive reconstruction of the sparse point cloud is achieved with a greater number of registered images for the same dataset. The visualization is shown in . demonstrates that handcrafted methods perform well for Data 1, while only LoFTRS and R2D2 among the deep learning-based methods can reconstruct a complete point cloud. The results indicate that deep learning methods can generate more matching pairs for texture-similar data, but most of them are incorrect, resulting in poor performance in terms of the completeness of sparse reconstruction. Additionally, Data 4 can register more images than Data 1 for the same scene. By comparing the number of sparse points, it is evident that image patch data retains more image features than down-sampling data, particularly for deep learning-based LoFTRS. LoFTRS can generally register the same or a larger number of images compared to SIFT, as demonstrated in Data 1, Data 2, Data 3, and Data 4.

Figure 16. Number of registered images.

Figure 16. Number of registered images.

Table 10. Number of registered images.

Efficiency

illustrates the time spent on sparse reconstruction. shows the time spent on sparse reconstruction. The reconstruction of the monolithic building data takes significantly longer than the matching process for the handcrafted method. In comparison to the handcrafted method, LoFTRS requires less time for single-building reconstruction. In general, LoFTRS achieves the best efficiency among the deep learning-based methods and it’s more efficient than AKAZE.

Figure 17. Comparison of sparse reconstruction time cost datasets.

Figure 17. Comparison of sparse reconstruction time cost datasets.

Table 11. Sparse reconstruction time.

Comparison of dense reconstruction

Number of Dense Points: The number of dense points is significantly affected by the number of sparse points, as well as other factors such as the number of isolated points. It is important to note that some learned methods, such as Patch2Pix and LoFTR, do not perform well on sparse points, resulting in no output for dense reconstruction. presents the results of dense points for all methods, with the top three ranked values highlighted in red, blue, and purple. illustrates one of the results of the visualization. When considering Data 1, Data 2, Data 3 and Data 4, SIFT and AKAZE achieve a greater number of dense points, while only LoFTRS achieves a large number of dense points among deep learning-based methods.

Figure 18. Comparison of the number of dense points for each data.

Figure 18. Comparison of the number of dense points for each data.

Table 12. Number of dense points.

Efficiency

The time of dense reconstruction is a big expense for each method. The results of the dense reconstruction for all methods are put in , the visualization is shown in , while the methods with no dense results are not counted in time. As can be seen in the table, the traditional methods spend more time in this step, which may be since they acquire more isolated points and feature points, which takes more time in dense reconstruction. The time spent by LoFTRS is lower than that of the classical methods, excluding the learning methods that cannot obtain dense results.

Figure 19. Comparison of time cost of dense reconstruction for each data.

Figure 19. Comparison of time cost of dense reconstruction for each data.

Table 13. Dense reconstruction time.

Conclusions

In this work, we present LoFTRS, a deep learning-based image matching framework that integrates semantic constraints into the matching process. We evaluate the performance of LoFTRS comparing to various popular handcrafted and deep learning-based methods. We investigate the relationship between matching quality and the performance of subsequent processing steps, such as the accuracy and completeness of the model generated by SfM. The experimental results show that the proposed LoFTRS achieves equal or better image matching performance in terms of matching score, average track length, RMSE, and the number of 3D points. In the future, we aim to investigate whether other deep learning-based methods can achieve comparable or superior performance to handcrafted methods after retraining.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

Due to the nature of the research, due to commercial supporting data is not available.

References

  • Arandjelović, R., & Zisserman, A. (2012). Three things everyone should know to improve object retrieval. In 2012 IEEE Conference on Computer Vision and Pattern Recognition (pp. 2911–23). IEEE.
  • Balntas, V., Johns, E., Tang, L., & Mikolajczyk, K. (2016). PN-Net: Conjoined triple deep network for learning local image descriptors. ArXiv. Preprint ArXiv:1601.05030.https://doi.org/10.48550/arXiv.1601.05030
  • Balntas, V., Riba, E., Ponsa, D., & Mikolajczyk, K. (2016). Learning local feature descriptors with triplets and shallow convolutional neural networks. In British Machine Vision Conference 2016, 1, 3.
  • Bay, H., Tuytelaars, T., & Van Gool, L. (2006). Surf: Speeded up robust features. In Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part I 9 (pp. 404–417). Springer.
  • Bhowmik, A., Gumhold, S., Rother, C., & Brachmann, E. (2020). Reinforced feature points: Optimizing feature detection and description for a high-level task. Proceedings of the IEEE/CVF Conference on Computer Vision and pattern recognition (pp. 4948–4957).
  • Chiabrando, F., D’Andria, F., Sammartano, G., & Spanò, A. (2018). UAV photogrammetry for archaeological site survey. 3D models at the Hierapolis in Phrygia (Turkey). Virtual Archaeology Review, 9(18), 28–43. https://doi.org/10.4995/var.2018.5958
  • Chopra, S., Hadsell, R., & LeCun, Y. (2005). Learning a similarity metric discriminatively, with application to face verification. Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR’05) (Vol. 1, pp. 539–546). IEEE.
  • Daakir, M., Pierrot-Deseilligny, M., Bosser, P., Pichard, F., Thom, C., Rabot, Y., & Martin, O. (2017). Lightweight UAV with on-board photogrammetry and single-frequency GPS positioning for metrology applications. ISPRS Journal of Photogrammetry and Remote Sensing, 127, 115–126. https://doi.org/10.1016/j.isprsjprs.2016.12.007
  • DeTone, D., Malisiewicz, T., & Rabinovich, A. (2018). Superpoint: Self-supervised interest point detection and description. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 224–236).
  • Dong, J., & Soatto, S. (2015). Domain-size pooling in local descriptors: DSP-SIFT. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5097–5106).
  • Habib, A., Han, Y., Xiong, W., He, F., Zhang, Z., & Crawford, M. (2016). Automated ortho-rectification of UAV-based hyperspectral data over an agricultural field using frame RGB imagery. Remote Sensing, 8(10), 796.
  • Han, X., Leung, T., Jia, Y., Sukthankar, R., & Berg, A. C. (2015). Matchnet: Unifying feature and metric learning for patch-based matching. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3279–3286).
  • Harris, C., & Stephens, M. (1988). A combined corner and edge detector. In Alvey Vision Conference (Vol. 15, pp. 10–5244). Citeseer.
  • He, K., Zhang, X., Ren, S., & Sun, J. (n.d). Deep residual learning for image recognition. Retrieved from http://image-net.org/challenges/LSVRC/2015/
  • Huang, P.-H., Matzen, K., Kopf, J., Ahuja, N., & Huang, J.-B. (2018). Deepmvs: Learning multi-view stereopsis. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2821–2830).
  • Hu, X., & Mordohai, P. (2012). Least commitment, viewpoint-based, multi-view stereo. Proceedings of the 2012 Second International Conference on 3D Imaging, Modeling, Processing, Visualization & Transmission (pp. 531–538). IEEE.
  • Jaderberg, M., Simonyan, K., & Zisserman, A. (2015). Spatial transformer networks. Advances in Neural Information Processing Systems (1506), 28. https://doi.org/10.48550/arXiv.1506.02025
  • Jegou, H., Douze, M., & Schmid, C. (2008). Hamming embedding and weak geometric consistency for large scale image search. Computer Vision–ECCV 2008: 10th European Conference on Computer Vision, Marseille, France, October 12-18, 2008, Proceedings, Part I 10 (pp. 304–317). Springer.
  • Jiang, S., & Jiang, W. (2017). Efficient structure from motion for oblique UAV images based on maximal spanning tree expansion. ISPRS Journal of Photogrammetry and Remote Sensing, 132(09), 140–161. https://doi.org/10.1016/j.isprsjprs.2017.09.004
  • Jiang, S., Jiang, W., Huang, W., & Yang, L. (2017). UAV-based oblique photogrammetry for outdoor data acquisition and offsite visual inspection of transmission line. Remote Sensing, 9(3), 278. https://doi.org/10.3390/rs9030278
  • Jiang, W., Song, Y., Leung, T., Rosenberg, C., & Wang, J. (2014). Learning fine-grained image similarity with deep ranking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1386–1393). CVPR.
  • Jin, Y., Mishkin, D., Mishchuk, A., Matas, J., Fua, P., Yi, K. M., & Trulls, E. (2021). Image matching across wide baselines: From paper to practice. International Journal of Computer Vision, 129(2), 517–547. https://doi.org/10.1007/s11263-020-01385-0
  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. https://doi.org/10.1145/3065386
  • Li, F., Zhang, H., Xu, H., Liu, S., Zhang, L., Ni, L. M., & Shum, H.-Y. (n.d). Mask DINO: Towards A Unified Transformer-Based Framework for Object Detection and Segmentation. Retrieved from https://github.com/IDEA-
  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110. https://doi.org/10.1023/B:VISI.0000029664.99615.94
  • Luo, Z., Shen, T., Zhou, L., Zhang, J., & Yao, Y. (2019). Contextdesc: Local descriptor augmentation with cross-modality context. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2527–2536).
  • Luo, Z., Zhou, L., Bai, X., Chen, H., & Zhang, J. (2020). Aslfeat: Learning local features of accurate shape and localization. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6589–6598).
  • Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. ArXiv. ArXiv Preprint ArXiv:1411.1784.https://doi.org/10.48550/arXiv.1411.1784
  • Mishchuk, A., Mishkin, D., Radenovic, F., & Matas, J. (2017). Working hard to know your neighbor’s margins: Local descriptor learning loss. Advances in Neural Information Processing Systems (1705), 30.https://doi.org/10.48550/arXiv.1705.10872
  • Ono, Y., Trulls, E., Fua, P., & Yi, K. M. (2018). LF-Net: Learning local features from images. Advances in Neural Information Processing Systems (1085), 31.https://doi.org/10.48550/arXiv.1805.09662
  • Pajares, G. (2015). Overview and current status of remote sensing applications based on unmanned aerial vehicles (UAVs). Photogrammetric Engineering & Remote Sensing, 81(4), 281–330. https://doi.org/10.14358/PERS.81.4.281
  • Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2007). Object retrieval with large vocabularies and fast spatial matching. Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition (pp. 1–8). IEEE.
  • Radenovic, F., Schonberger, J. L., Ji, D., Frahm, J.-M., Chum, O., & Matas, J. (2016). From dusk till dawn: Modeling in the dark. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5488–5496).
  • Rottensteiner, F. (n.d.). ISPRS test project on urban classification and 3D building reconstruction: Evaluation of building reconstruction results.
  • Rublee, E., Rabaud, V., Konolige, K., & Bradski, G. (2011). ORB: An efficient alternative to SIFT or SURF. Proceedings of the 2011 International Conference on Computer Vision (pp. 2564–2571). IEEE.
  • Sarlin, P.-E., Detone, D., Malisiewicz, T., Rabinovich, A., & Zurich, E. (n.d.). SuperGlue: Learning feature matching with graph neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/cvpr42600.2020.00499
  • Sattler, T., Havlena, M., Schindler, K., & Pollefeys, M. (2016). Large-scale location recognition and the geometric burstiness problem. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1582–1590).
  • Sattler, T., Leibe, B., & Kobbelt, L. (2011). Fast image-based localization using direct 2d-to-3d matching. Proceedings of the 2011 International Conference on Computer Vision (pp. 667–674). IEEE.
  • Schonberger, J. L., & Frahm, J.-M. (2016). Structure-from-motion revisited. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4104–4113).
  • Schonberger, J. L., Hardmeier, H., Sattler, T., & Pollefeys, M. (2017). Comparative evaluation of hand-crafted and learned local features. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1482–1491).
  • Schonberger, J. L., Radenovic, F., Chum, O., & Frahm, J.-M. (2015). From single image query to detailed 3d reconstruction. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5126–5134).
  • Schönberger, J. L., Zheng, E., Frahm, J.-M., & Pollefeys, M. (2016). Pixelwise view selection for unstructured multi-view stereo. Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14 (pp. 501–518). Springer
  • Sedaghat, A., & Ebadi, H. (2015). Remote sensing image matching based on adaptive binning SIFT descriptor. IEEE Transactions on Geoscience and Remote Sensing, 53(10), 5283–5293. https://doi.org/10.1109/TGRS.2015.2420659
  • Sharif Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An astounding baseline for recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 806–813).
  • Shen, X., Wang, C., Li, X., Yu, Z., & Li, J. (2019). RF-Net: An end-to-end image matching network based on receptive field. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8132–8140).
  • Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., & Moreno-Noguer, F. (2015). Discriminative learning of deep convolutional feature point descriptors. Proceedings of the IEEE International Conference on Computer Vision (pp. 118–126).
  • Sun, Y., Sun, H., Yan, L., Fan, S., & Chen, R. (2016). RBA: Reduced Bundle Adjustment for oblique aerial photogrammetry. ISPRS Journal of Photogrammetry and Remote Sensing, 121, 128–142. https://doi.org/10.1016/j.isprsjprs.2016.09.005
  • Tareen, S. A. K., & Saleem, Z. (2018). A comparative analysis of SIFT, SURF, KAZE, AKAZE, ORB, and BRISK. In 2018 International Conference on Computing, Mathematics and Engineering Technologies: Invent, Innovate and Integrate for Socioeconomic Development, iCoMET 2018 - Proceedings (Vol. 2018- January, pp. 1–10). Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/ICOMET.2018.8346440
  • Tian, Y., Fan, B., & Wu, F. (2017). L2-net: Deep learning of discriminative patch descriptor in euclidean space. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 661–669).
  • Tola, E., Lepetit, V., & Fua, P. (2009). Daisy: An efficient dense descriptor applied to wide-baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(5), 815–830. https://doi.org/10.1109/TPAMI.2009.77
  • Tolias, G., Avrithis, Y., & Jégou, H. (2016). Image search with selective match kernels: Aggregation across single and multiple images. International Journal of Computer Vision, 116(3), 247–261. https://doi.org/10.1007/s11263-015-0810-4
  • Torii, A., Arandjelovic, R., Sivic, J., Okutomi, M., & Pajdla, T. (2015). 24/7 place recognition by view synthesis. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1808–1817).
  • Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., & Fragkiadaki, K. (2017). Sfm-net: Learning of structure and motion from video. ArXiv. Preprint ArXiv:1704.07804.https://doi.org/10.48550/arXiv.1704.07804
  • Wang, J., Zhou, F., Wen, S., Liu, X., & Lin, Y. (2017). Deep metric learning with angular loss. Proceedings of the IEEE International Conference on Computer Vision (pp. 2593–2601).
  • Wei, X., Zhang, Y., Gong, Y., & Zheng, N. (2018). Kernelized subspace pooling for deep local descriptors. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1867–1875).
  • Yi, K. M., Trulls, E., Lepetit, V., & Fua, P.(2016). Lift: Learned invariant feature transform. Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October11-14, 2016, Proceedings, Part VI 14 pp. (467–483). Springer International Publishing.
  • Yi, K. M., Verdie, Y., Fua, P., & Lepetit, V. (2016). Learning to assign orientations to feature points. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 107–116).
  • Zagoruyko, S., & Komodakis, N. (2015). Learning to compare image patches via convolutional neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4353–4361).
  • Zeisl, B., Sattler, T., & Pollefeys, M. (2015). Camera pose voting for large-scale image-based localization. Proceedings of the IEEE International Conference on Computer Vision (pp. 2704–2712).
  • Zhu, Q., Wang, Z., Hu, H., Xie, L., Ge, X., & Zhang, Y. (n.d.). Leveraging photogrammetric mesh models for aerial-ground feature point matching toward integrated 3D reconstruction. ISPRS Journal of Photogrammetry & Remote Sensing, 166, 26–40. https://doi.org/10.1016/j.isprsjprs.2020.05.024