Full article: Remote sensing image feature matching via graph classification with local motion consistency

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Feature matching is a classic challenge in the computer vision field. In this paper, we propose an innovative graph classification method based on neighborhood motion consistency to eliminate erroneous matches. Specifically, we transform the coordinates of feature matching points into vectors on a unified scale. For a given match, we construct a graph centered around the match and incorporating neighboring matches. Node attributes are designed to represent the similarity between the vector of the central node and those of its neighbors. To facilitate this, we develop a lightweight graph attention neural network dedicated to graph property classification, thereby predicting the accuracy of the match under consideration. To effectively train the model, we employ a random cropping strategy to generate a plethora of diverse graphs for classifier training. We evaluate our method on datasets encompassing translational remote sensing data, rotational and scaled remote sensing imagery produced via random cropping, and nonrigid fisheye datasets. Our algorithm demonstrates superior performance to current state-of-the-art methods.

KEYWORDS:

1. Introduction

Feature matching, which establishes correspondences between the same scenes or objects across two or more images, is a fundamental and challenging task in the computer vision field. Robust feature matching enables various subsequent operations, such as image stitching (Nie et al. Citation2021), 3D reconstruction (Ham, Wesley, and Hendra Citation2019), and pose estimation (Engel, Koltun, and Cremers Citation2018; Forster, Pizzoli, and Scaramuzza Citation2014), and has widespread applications across numerous scenarios in the remote sensing imagery domain. Many researchers have made contributions to feature matching, aiming to obtain more numerous and reliable matches.

To enhance accuracy and robustness, a plethora of feature matching algorithms have been devised. However, real-world situations are inherently intricate, and the expressiveness of matching schemes remains constrained. This often results in a significant number of outliers during matching, so sole reliance on the matching technique seldom yields optimal outcomes. While methods such as state estimation (Barfoot Citation2017) and RANSAC (Horn and Schunck Citation1981) can augment matching accuracy, they introduce additional computational overhead. Thus, feature matching without geometric constraints emerges as a practical alternative. To obtain more robust inliers, many filter schemes have been proposed, such as those based on geometric stability and geometric transformation models. These models perform well in rigid image transformations, such as plane models, but are not applicable in situations with occlusion or changes in viewpoint.

To mitigate these issues, we introduce an image matching algorithm based on local displacement similarity. In this algorithm, we regard the vectors composed of match pairs as displacements. Typically, a correct match is surrounded by matches exhibiting consistent displacement. By leveraging the nearest neighbor method, we identify the neighbors of the match under examination. Due to the relatively stable point relationships in remote sensing images (Lin et al. Citation2010), altering the scale and rotation does not excessively disrupt the stability between neighbors. We normalize the displacement vector of the image point relative to the image size and assess the consistency of the displacement relationship between neighboring feature matches via motion.

By constructing an undirected graph where match points act as nodes, we train a graph property classification network through a supervised learning scheme, incorporating graph attention networks (GAT) (Velickovic et al. Citation2017), an attention pooling layer, and a classifying layer. The model evaluates the graph's properties to determine whether the central node is a correct match; it can also be regarded as a binary classifier. To train this model, we also developed a random image cropping algorithm to generate a reliable and precise dataset.

In summary, this work offers two primary contributions. First, we develop a novel feature matching approach that leverages displacement vector consistency as edge weights to construct an undirected graph composed of a central node of interest and its nearest neighbors. We build a model predominantly structured by the GAT to classify the characteristics of a graph and consequently determine the correctness of the central node match via supervised learning. Second, to adequately train this model, we adopt a random cropping strategy to generate multiple diverse graphs for classifier training. This research provides innovative solutions and methodologies that contribute significantly to the domain of remote sensing image matching.

2. Related works

The process of high-accuracy feature matching usually comprises two steps (Jiang et al. Citation2021). The first step involves the use of feature point detection and matching algorithms, such as brute force matching, bundle adjustment (Triggs et al. Citation2000) or optical flow (Horn and Schunck Citation1981), to obtain rough matching results. The second step involves the use of a match filter that screens the results of the previous step to yield better matching results.

Many descriptor schemes have been proposed for the first step, including the following.

Spot Detection Type:

Scale-Invariant Feature Transform (SIFT) (Lowe Citation2004): A scale-invariant feature transformation method that detects and describes local features in multiple scale spaces, exhibiting invariance to scale, rotation, and illumination. Speeded Up Robust Features (SURF) (Bay, Tuytelaars, and Van Gool Citation2006): Similar to SIFT, but uses integral images to accelerate computation, resulting in higher speed.

Focus Detection Type:

The Harris (Mikolajczyk and Schmid Citation2004) corner detection algorithm: Detects corner features in images by calculating local grayscale variations and corner response functions. A fast corner detection algorithm (FAST) that rapidly detects corner features around pixels provides high speed and robustness.

Binary Type:

Binary Robust Independent Elementary Features (BRIEF) (Calonder et al. Citation2010): Generates binary descriptors by comparing pairs of pixels in an image for feature matching. Binary Robust Invariant Scalable Keypoints (BRISK) (Leutenegger, Chli, and Siegwart Citation2011): Combines scale-invariant features and binary descriptors, exhibiting good robustness and scalability. Oriented FAST and Rotated BRIEF (ORB) (Rublee et al. Citation2011): Combines the advantages of FAST and BRIEF, providing high speed matching and some rotational invariance; however, it lacks scale invariance.

Recent advances in the removal of false matches have enabled the development of various sophisticated techniques and algorithms, including both traditional and learning-based approaches, which are employed to enhance the accuracy and robustness of feature correspondence in images.

Resampling: Random Sample Consensus (RANSAC) attempts to find a minimal inlier set to fit a given model. Subsequent models, such as MAGSAC (Barath, Matas, and Noskova Citation2019) and MLESAC (Torr and Zisserman Citation2000), have improved the efficiency and accuracy of the original RANSAC. However, these methods’ runtimes grow exponentially with an increasing number of outliers, and the methods may fail in nonrigid situations.

Voting: Grid-based motion statistics (GMS) (Bian et al. Citation2017) uses a motion smoothness assumption and grid-based method to reduce algorithmic complexity and is suitable for dense feature point matching. Local pairwise measures (LPM) (Ma et al. Citation2019) utilizes a neighborhood topological consistency distance calculation method, and local global similarity consistency (LGSC) (Jiang et al. Citation2022) builds upon LPM with a local graph structure for graph consistency verification and matching.

Parameter Optimization: Inlier cluster finding (ICF) (Li and Hu Citation2010) uses SVM regression techniques to find corresponding function pairs, eliminating outliers accordingly. Vector field consensus (VFC) (Jiayi et al. Citation2014) performs robust vector field fitting in a reproducing kernel Hilbert space. These methods assume a globally smooth motion field, which is not always accurate due to depth discontinuities or independent motion in the scenes.

Feature matching with bounded distortion utilizes a methodology for ensuring consistency in matching by constraining deformations, focusing on finding correspondences between feature points while preserving local and global shape characteristics. In the realm of consistency-based techniques, the CODE (Lin et al. Citation2017) method improves feature correspondence by assessing the similarities and consistencies among features, yielding more accurate feature matching in images. This technique underlines the importance of evaluating both individual feature characteristics and their collective behavior for optimal matching.

Graph-based methods: such as MCDM (Ma et al. Citation2022) and mTopKRP (Jiang et al. Citation2019) are other significant contributions. MCDM employs probabilistic graphical models, treating each correspondence as a hypothetical node within a probabilistic graph model and focusing on the motion consistency of true correspondences to distinguish matches. mTopKRP selects the K-nearest neighbors for each feature point, accounting for the spatial relations and ranking order of local structures, which enhances the similarity description between feature points.

LOGO (Xia and Ma Citation2022) combines graph-based optimization techniques with a consideration of both local and global factors, using a fixed-point iterative method for more precise optimization. This method is particularly notable for its incorporation of local geometric consistency with global topological information.

Learning approaches have become increasingly prominent in feature matching. Techniques such as LMR (Yi et al. Citation2018), Learning to Find Good Correspondences (LFGC) (Zhang et al. Citation2019), OANet (Zhang et al. Citation2019), and Superglue (Sarlin et al. Citation2020) represent this trend. LMR trains a binary classifier to categorize matching pairs, focusing on eliminating mismatches to enhance matching quality. LFGC trains a network on epipolar geometry constraints and computes camera pose, sometimes at the expense of correct matches. OANet is a structured approach that effectively combines local and global contexts, thereby enhancing the accuracy and reliability of feature point correspondences. Finally, Superglue employs graph neural networks to generate more accurate correspondences, incorporating feature points’ positional information and using attention mechanisms, albeit with high computational demands.

3. Methods

In this section, we present our precise feature matching strategy, which comprises two stages. The first stage, referred to as coarse matching, utilizes established methods such as SIFT or ORB to derive a set of preliminary match results consisting of a reasonable number of inliers. The implementation of this stage is fairly straightforward given the maturity of the existing methods.

For the second stage, which we term the filtering phase or false removal (Jiang et al. Citation2021), we start by explaining graph property classification, the construction of nodes and edges on the graph, and the operations involved in pooling and classification.

We filter the coarse match results of the first stage to extract as accurate an inlier estimation as possible. To achieve this, we construct a graph based on the local motion consistency between a match and its neighboring matches. Subsequently, we establish a binary classifier using this graph for the properties of the graph neural networks.

Our constructed undirected graph can be represented as a tuple $g (V, E, z)$ , where the following hold:

$V$ is the set of all nodes, $V = {V_{1}, V_{2}, \dots, V_{n}}$ where n is the number of nodes in the graph, representing the number of central nodes and their neighbors. In our model, node attributes represent the consistency of motion between matches.

$E$ is the set of all edges, $E = {e_{11}, e_{12}, \dots, e_{mn}}$ where $mn$ denotes an undirected edge from node $m$ to node $n$ .

$z$ is the node attribute mapping function, $z : x \to h$ , which represents the node attribute update function. In a given pair of images, a graph is composed of each match and its surrounding match neighbors. These individual graphs can be collectively represented as a set $G$ . Likewise, the attribute class (label) of each graph can be represented as a set $Y$ , where $y_{i}$ represents the class label of the ith graph $g_{i}$ .

(1)

G = {g_{1}, g_{2}, \dots, g_{N}}

(1)

(2)

Y = {y_{1}, y_{2}, \dots, y_{N}}

(2)

The fundamental objective of graph classification is to develop a precise mapping function f through a trained algorithm or machine learning model that can accurately associate the attribute representation $X$ of an unlabeled graph with its corresponding class label $y_{i}$ . In this way, during the prediction phase, the learned mapping function $f$ can be utilized to predict the class of an unknown graph.

3.1. Construction of the motion consistency graph

First, feature points (key points) and feature description vectors are obtained from two images using feature extraction algorithms such as SIFT, and initial matching results are obtained through methods such as brute force matching. (3) $S = (p_{i}, q_{i})_{i = 1}^{N}$ (3) $p_{i}$ denotes the coordinates of the feature points in the original image and $q_{i}$ denote the coordinates of the feature points in the destination image. $N$ represents the total number of matched points between the two images. The pair $(p_{i}, q_{i})$ represents a match, and our task is to determine whether this match is correct.

Given a set of matched point pairs, those that are close to each other in the original image $p_{i}$ tend to exhibit similar motion characteristics to $q_{i}$ in the destination image, including proximity of initial coordinates and similar magnitudes and motion vector directions. The converse principle also applies. The central node of the graph is the match point pair whose inlier status is being verified. The images in (c) and (d) each comprise two graphs with central nodes in blue and neighboring nodes in black. The left graph in each image illustrates the normalized motion displacement pattern of matched node pairs, while the right graph depicts their motion consistency map, with node pairs represented by nodes and motion consistency by edge thickness. The thicker the edge is, the larger the weight, and thus the tighter the connection between both ends. The motion similarity between the central node and its neighboring nodes is used as the attribute of the graph nodes.

Figure 1. The central node’s matching status, across the four images, transitions from negative, in the context of random or low consistency movements in neighboring nodes (as depicted in Images (a) and (b)), to positive, when the central node and its neighbors exhibit high motion consistency (as illustrated in Images (c) and (d)).

Edges in the graph connect the central node to the nearest surrounding noncentral nodes, but no connections exist between noncentral node patterns, as shown in . These noncentral nodes represent matches or neighbors around the central node in the original image. To reduce the scale of the graph during construction, we use the nearest neighbor KD-tree to determine the noncentral nodes in the graph rather than all matches.

3.1.1. Construction of node attributes and adjacency matrix

In the context of input attributes on graphs, we replicate the matched central features to all neighboring nodes. The central node, regarded as the source node, disseminates and spreads the central matching information, as described in (Harel, Koch, and Perona Citation2006). This strategy aims to amplify the similarity and consistency among nodes within the graph, furnishing more contextual information for subsequent processing.

Normalization is performed on both coordinates of the image's to-be-determined feature points $(p_{i}, q_{i})$ . The normalization can be expressed as follows: $o_{i, x} = p_{i, x} / w_{max}$ , $o_{i, y} = p_{i, y} / h_{max}$ , where $w_{max}$ and $h_{max}$ represent the maximum width and height of the original and destination images. The same operation is also applied to $d_{i}$ . This normalization process scales all coordinates to the same plane, which is based on the maximum width and height, and ultimately yields values between 0 and 1 for further processing by the neural network.

Let $X$ represent the feature vector of the input graph’s nodes. (4) $X = [x_{1} x_{2} \dots x_{n}]$ (4) $x_{i}$ represents the attributes of the ith node $V_{i}$ in the graph. (5) $\vec{c} = d_{c} - o_{c}$ (5) (6) $\vec{n_{i}} = d_{ci} - o_{ci}, ci \in N_{c}$ (6) In Equations (5) and (6), the subscript $c$ denotes the central node, while $ci$ represents the ith neighbor of the central node. Vector $\vec{c}$ is the motion vector of the central node, and vector $\vec{n_{i}}$ is the motion vector of the ith neighbor. (7) $m_{i} = \frac{min (| \vec{c} |, | \vec{n_{i}} |)}{max (| \vec{c} |, | \vec{n_{i}} |)}$ (7) (8) $k_{i} = ReLU (\cos (\vec{c}, \vec{n_{i}})) = ReLU {\frac{\vec{c} \cdot \vec{n_{i}}}{| \vec{c} |, | \vec{n_{i}} |}}$ (8) In Equations (7) and (8), $m_{i}$ represents the consistency of the magnitude of the motion vector, while $k_{i}$ signifies the consistency of the direction between the central node and the ith neighbor. Angles exceeding 90 degrees are set to zero through ReLU. $| \cdot |$ represents the magnitude of the vector. (9) $u_{i} = ReLU (\frac{\vec{c}}{max (| \vec{c} |, | \vec{n_{i}} |)} \cdot \frac{\vec{n_{i}}}{max (| \vec{c} |, | \vec{n_{i}} |)})$ (9) (10) $r_{i} = \frac{min (| o_{ci} - o_{c} |, | d_{ci} - d_{c} |)}{max (| o_{ci} - o_{c} |, | d_{ci} - d_{c} |)}, ci \in N_{c}$ (10) In Equation (9), the term $u_{i}$ represents the consistency of the vector’s magnitude and direction. By dividing the central vector and the neighboring vector by the greater of their magnitudes, we normalize the vectors to the scale of the longer vector. This serves as a method for measuring the difference in the motion distance of feature points. Then, the difference in angles is obtained through dot product operation. The result falls between −1 and 1, and by applying the ReLU function, results that indicate reverse motion are set to zero. This yields a scalar-based approach for measuring the similarity of two vectors.

In Equation (10), the term $r_{i}$ denotes the ratio of the Euclidean distance from the ith neighbor of the central matching point to the central matching point across two images. (11) $x_{i} = [o_{c}, o_{ci}, d_{c}, d_{ci}, \vec{c_{}}, \vec{n_{i}}, m_{i}, k_{i}, u_{i}, r_{i}]^{T}$ (11) In this context, each node is characterized by the attributes of the central match and the neighboring match, and these attributes are subsequently aggregated via a graph neural network to create a consistent representation for each node.

These attributes include the starting coordinates $(o_{c}, o_{ci})$ and ending coordinates $(d_{c}, d_{ci})$ of the motion vector, the motion vector itself $(\vec{c}, \vec{n_{i}})$ , the consistency of the motion vector magnitude $m_{i}$ , the direction consistency $k_{i}$ , the unified representation of magnitude and direction $u_{i}$ , and the ratio of the Euclidean distances $r_{i}$ between the matching point and its neighbors across two images.

The attributes $m_{i}$ , $k_{i}$ , and $u_{i}$ all tend toward 1 for correct feature matches, while the $(d_{c}, d_{ci})$ , $(\vec{c}, \vec{n_{i}})$ , and $r_{i}$ values are convergent. This approach is designed to facilitate the GAT in discerning the inherent relationships among the modes. All attributes of the node fall between 0 and 1, making them suitable input parameters for the neural network.

The closest neighbor to the central node is the node itself, which is denoted as the zeroth node. The $r_{i}$ value for the central node is manually set to 1.

When the motion vectors represented by the nodes demonstrate consistency, we add undirected edges $e_{ci}$ between the central node and the neighboring node for initialization. Then, the elements at positions $(c, i)$ and $(i, c)$ in the adjacency matrix are both set to 1. In this context, $u_{i}$ could serve as the weight of the edge. However, since we employ GAT, we do not utilize edge weights. (12) $e_{ci} \in E$ (12) (13) $e_{ci} = {\begin{matrix} 1 & if u_{i} ⩾ ε \\ 0 & if u_{i} < ε \end{matrix}$ (13)

To minimize the scale of the graph, we use the ReLU function in Equation (9) to adjust the formation of edges, discarding edges connected to neighboring nodes whose motion vectors are in complete opposition. If the scale and rotation angle of the image are determined to be relatively small, we can further increase the bias term $ε$ to reduce the number of edges in the graph.

3.2. Node attributes aggregation and graph classification

In our approach, we employ GAT to identify and establish connections between nodes demonstrating similar motion consistency. This allows us to obtain a more comprehensive representation of the graph. Incorporating a multi-head mechanism yields a more thorough understanding of the various implications of node attributes. (14) $h_{i}^{'} = ReLU (BN (\sum_{j \in N (i)} α_{ij}^{l} (W^{l} h_{j})))$ (14) In Equation (14), $h_{i}^{'}$ represents the new attributes of the node after aggregation. $l$ represents the layer number. The first step in this process is to individually compute the attention coefficients between node $i$ and its neighbors $j$ . Given that the attribute updates are oriented toward each individual node, we opt not to employ $N_{c}$ . In the second step, based on the calculated attention coefficients, the attributes are aggregated through a weighted summation. Here, $α_{ij}^{l}$ refers to (15) $α_{ij}^{l} = \frac{\exp (σ (e_{ij}^{l}))}{\sum_{k \in N (i)} \exp (σ (e_{ik}^{l}))}$ (15) (16) $e_{ij}^{l} = a^{T} ([W^{l} h_{i} ∥ W^{l} h_{j}])$ (16) In Equation (16), the attention coefficient (Velickovic et al. Citation2017) is introduced, where $\cdot^{T}$ represents transposition and $∥$ is the concatenation operation. First, node attributes are upsampled using a linear mapping with shared parameter $W^{l}$ . $a^{T}$ then maps the concatenated high-dimensional attributes to a real number $e_{ij}^{l}$ after a nonlinearity $σ$ (i.e. LeakyReLU) is applied, where a softmax operation yields relative attention coefficients $α_{ij}^{l}$ for each point, reflecting the degree of closeness between node $i$ and node $j$ . When the layer l is zero, $h_{i}$ represents the input $x_{i}$ . (17) $h_{i}^{'} (K) = ∥_{k = 1}^{K} σ (\sum_{j \in N (i)} α_{ij}^{k} (W^{k} h_{j}))$ (17) (18) $z_{g} = Readout ({h_{i}, i \in V})$ (18) Moreover, the attributes of the node are obtained by concatenating the multi-head mechanism in Equation (17). $K$ represents the number of multi-head attention mechanisms. Subsequently, we employ graph pooling (Readout) to acquire a graph-level property representation $z_{g}$ from the set of node attribute representations, as shown in . This aids in class label prediction for unknown graphs, with $y = f (z_{g})$ .

Figure 2. The diagram depicts the architecture of our model. Initially, the model processes the graph and its attributes. The data then pass through GAT layers with multilayer and multi-head features. Nodes of similar colors have received attention, and their attributes aggregated, as shown by adjacent squares. The data then enter the attention pooling layer. Different node sizes indicate the contribution of the nodes to the graph’s properties, forming the graph embedding. Finally, traditional machine learning is employed to classify the graph properties.

Considering that each node’s contribution to the overall graph property is inconsistent in our case and that nodes more consistent with the center node are more important, we allow the network to learn the importance rankings first and then perform pooling.

The attention pooling layer in graph neural networks (Itoh, Kubo, and Ikeda Citation2022) is used to aggregate the information of nodes in a graph into a global representation. It determines the contribution of node properties to the global representation by learning the attention weight of each node, thereby aggregating nodes based on their importance. The attention pooling layer calculates the weighted property representation according to the attributes of the nodes and the attention weight, where the attention weight reflects the importance or relevance between the nodes.

After concatenating the multi-head attributes, we use the attention pooling (AP) mechanism rather than pooling each head’s attributes separately. This allows attention pooling to balance multiple heads, understand more attributes, and learn which attributes are more important. (19) $AP (h_{i}) = σ (b^{T} h_{i})$ (19) (20) $β_{i} = \frac{\exp (AP (h_{i}))}{\sum_{j \in V} \exp (AP (h_{j}))}$ (20) Similar to the attention mechanism employed in the aforementioned GAT, the primary components of the AP mechanism include a linear layer $b^{T}$ that maps high-dimensional attributes to real-valued scalars (i.e. a fully connected layer) and a nonlinear and softmax operation, which are used to compute the attention score $β_{i}$ for each node in the graph. These attention scores are then normalized and combined through a weighted sum to obtain the overall attribute representation of the graph. (21) $z_{g} = \sum_{i \in V} β_{i} * h_{i}^{}$ (21) (22) $p = Sigmoid ((FC ⊙ BN ⊙ ReLU)^{L} z_{g}))$ (22) $z_{g}$ represents the output obtained from the pooling layer. The network uses batch normalization and the ReLU activation function to enhance training stability and accelerate convergence speed. The final layer employs a sigmoid activation function, allowing the network’s output to be interpreted as the probability of the category. $⊙$ represents the composite operation of the fully connected (FC) layer, batch normalization (BN), and ReLU activation, $^{L}$ denotes the number of layers of composite operations, and $p_{i}$ is denoted as the prediction of the graph properties. (23) $\hat{y} = {\begin{matrix} 1 & if p_{i} ⩾ γ \\ 0 & if p_{i} < γ \end{matrix}$ (23) (24) $I = {i | {\hat{y}}_{i} = 1}$ (24) In this context, $γ$ is a bias term ranging between zero and one and used for adjusting the tendency of network output, favoring either recall or precision. ${\hat{y}}_{i}$ represents the prediction of the attributes of the graph of the ith match in a pair of images. $I$ denotes the set of prediction results given by the algorithm.

4. Experiment and results

In this section, we present an accurate and reliable graph dataset with labels for training and evaluating our model. These data originate from several large remote sensing images that we randomly cropped and performed rotation and scale adjustments. We subsequently performed feature extraction, and the coordinates on the source images were traced back to determine accurate feature matches and construct the graph data. These data were then divided into training and testing datasets. We also sought other types of datasets for testing. We compared our method with GMS, LPM, LMR, mTopKRP, LGSC, LOGO, MCDM and OANet.

4.1. Dataset generation

Given the lack of preexisting graph datasets suitable for our remote image feature point matching and graph classification tasks, we generated our own training graph datasets to emulate real-world image correspondences.

4.1.1. Training dataset generation

The first portion of our training set images is derived from remote sensing images mentioned in the paper (Ma, Zhong, and Zhang Citation2015). This dataset includes four 10,000 * 9,000-pixel images (only one is actually published), with a spatial resolution of 0.61 m. These images, captured from the USGS (United States Geological Survey), cover Montgomery, Ohio, USA. The dataset was collected and produced by the RSIDEA research group of Wuhan University.

The second portion of the dataset is sourced from MtS-WH (Wu, Zhang, and Du Citation2017), mainly comprising two 7200 * 6000-pixel high-resolution remote sensing images, which encompass Hanyang district in Wuhan, China. These images were obtained in February 2002 and June 2009, respectively, with a resolution of 1 m. The dataset was generated by the SIGMA research group of Wuhan University.

As shown in , we performed random cropping on these large remote sensing images to generate image pairs (‘A’ and ‘B’ boxes in ), each including an image with a random width and height ranging from 300 to 800 pixels. This range was determined by referencing the standard dimensions of widely-used image matching datasets: for example, the ADE (Wong et al. Citation2011) dataset, which typically includes image widths between 480 and 900 pixels; the VGG (Kadir, Zisserman, and Brady Citation2004) dataset, with widths ranging from 640 to 1000 pixels; and the Daisy (Tola, Lepetit, and Fua Citation2009) dataset, which provides images at a consistent resolution of 768 by 512 pixels. The dimensions of the two images can differ, adding difficulty to the dataset. Among the generated images, 50% were randomly subjected to scale transformations between 0.4 and 1, and another 50% underwent rotations of up to ±120 degrees. Images could undergo both scaling and rotation. These transformations ensure a variety of image pairs under different scales and orientations for model training. The overlap area between the two images had to be greater than 10% of the smaller image’s area to ensure sufficient matching inliers.

Figure 3. The random cropping dataset generation process. ‘A’ and ‘B’ boxes represent a pair of cropping regions. The blue and red dots indicate correct and incorrect matches, respectively, along with their corresponding points on the source image.

For each pair of overlapping images, we applied SIFT for feature extraction and brute force matching. For each feature match, we used the images’ cropping start positions, rotation matrix, and scaling factors to trace back their original locations in the source large-scale image to determine if they are correct matches. In , correct matches $(p_{1}, q_{1})$ are highlighted in blue (points on the source large-scale image with close or identical coordinates), and incorrect matches $(p_{2}, q_{2})$ are highlighted in red (distant pixel points on the source large-scale image). Given that feature point coordinates can be subpixels and the rotation matrix can lead to coordinate truncation, we set an offset error Euclidean distance of 3 pixels for effective and accurate match determination. Correct and incorrect match pairs were labeled as 1 and 0, respectively. (25) $er r_{i} = \sqrt{| (\frac{1}{s_{o}} \cdot R_{o}^{T} p_{i} + l_{o}) - (\frac{1}{s_{d}} \cdot R_{d}^{T} q_{i} + l_{d}) |}$ (25) In Equation (25), $R_{o}^{}$ and $R_{d}^{}$ denote the rotation matrices for a pair of original and destination images, respectively, where an identity matrix represents the absence of rotation. $s_{o}$ , $s_{d}$ denotes the scale factor. $l_{o}$ and $l_{d}$ denote the coordinates of the starting position of the cropped image on the source large-scale image, which correspond to the upper left corner of rectangular areas A and B, respectively, as depicted in . The square root operation calculates the error $er r_{i}$ in pixel distance for the matched point pairs in the large-scale image.

In total, we generated 4,500 pairs of images, from which approximately 3 million match pairs were derived from 4,300 image pairs for training. There were 1,481,403 incorrect matches and 1,776,543 correct matches. We downsampled the correct matches to approximate the quantity of incorrect matches to prevent model training bias. The dataset was shuffled, and the final training set consisted of approximately 2.96 million graphs, as demonstrated in . The number of inliers in the test set are described in the following sections. The remaining 200 image pairs, constituting approximately 140 thousand graphs, were used for the test set.

The second part of the set comes from the UZH-FPV Drone Racing dataset (Delmerico et al. Citation2019), which is a dataset specifically designed for state estimation related to drone racing. This dataset uses a fisheye camera model, and severe distortion is present in the original images. We selected the scene categorized as ‘Outdoor 45-degree downward-facing’ and restricted our focus to the lawn area. Images with a frame count within 5 constitute a pair of matching images A and B. These pixel coordinates then underwent fisheye camera distortion correction, with distortion information sourced from the dataset. Subsequently, brute force matching and RANSAC were used to compute the homography matrix H from Image B to Image A, with a reprojection error of 7, yielding a rough matching result.

Next, epipolar constraints were utilized. Their dataset ground truth is the camera pose obtained through optical imaging. We constructed the fundamental matrix F using their ground truth rotation matrix R, translation vector t, and the intrinsic parameters of the camera. The F matrix was then subjected to error measurement, as referenced in (Bian et al. Citation2019). Since the F matrix cannot fully determine the correctness of points, a manual secondary check was performed to ensure that all correct matching points are retained as much as possible.

In Equation (26), $(p_{i}, q_{i})$ represents a pair of matches, expressed by homogeneous coordinates of pixel points. $l^{'} = [l_{1}^{'}, l_{2}^{'}, l_{3}^{'}]^{T}$ represents the homogeneous coordinates of the epipolar line, with the Equation being the point-to-line distance $e_{i}$ . The epipolar distance was set to 3 to further filter correct matches.

Since the F matrix cannot fully determine the correctness of points, a manual secondary check was performed to ensure that all correct matching points are retained as much as possible. (26) $l_{i}^{'} = F p_{i}$ (26) (27) $e_{i} = \frac{| {l^{'}}_{i}^{T} q_{i} |}{\sqrt{l_{i, 1} {^{'}}^{2} + l_{i, 2} {^{'}}^{2}}}$ (27) In our investigation, we specifically engaged the Random Cropping and FPV datasets to individually train two distinct models, each aimed at addressing the unique challenges presented by rigid and nonrigid image scenes, respectively.

4.1.2. Test dataset

Swimming Pool And Car Detection: This dataset is derived from the Open Kaggle database, which comprises approximately six thousand images with a resolution of 224*224 pixels for both the training and testing sets. Two hundred pairs from the overlapping matching images were selected. We used RANSAC to calculate the homography matrix H of the two images and conducted manual verification. Since POOL consists of simple planar translational scenes, RANSAC is effective on this dataset. The projection error was set to 3 pixels. The H matrix helped us determine the mapping relationship of pixel points between the two images. This formed our POOL dataset.

In the POOL dataset, H transforms the pixel vertex coordinates as shown in Equation (28). By directly subtracting pixel values as per the formula, their overlapping position is represented as the pixel value differences in Equation (29). The zero value, which is the black area in the image, indicates that the content is exactly the same on both images, thus confirming the correctness of the H matrix and thereby proving the accuracy of our match. Otherwise, the nonconvexity of the image leads to significant noise.

$x = [x, y, 1]^{T}$ and $x^{'} = [x^{'}, y^{'}, 1]^{T}$ represents the homogeneous coordinates before and after the H transformation $x^{'} = H x$ . In Equation (28), $B (x, y)$ denotes the pixel value at the coordinates $(x, y)$ on image B. $∥$ represents the absolute value, used for generating image D. (28) $B^{'} (x^{'}, y^{'}) = B (x, y)$ (28) (29) $D (x, y) = | A (x, y) - B^{'} (x^{'}, y^{'}) |$ (29) We also used our Random Cropping positive class and FPV positive class to calculate the H matrix using least squares. The black areas indicate they are equal and represent the intersections of images A and B in the figure.

After distortion removal and distortion addition, fisheye pixel coordinates produce certain truncation errors. Due to exposure differences, this leads to a state that is not purely black, as shown in .

Figure 4. Sample images from the POOL, random cropping, and FPV datasets. Each section contains three images, with the first two being a pair of matching images, image A and image B. The third image, containing black shapes, is formed by subtracting the pixel values of the images, resulting in image D.

For the purpose of model evaluation, we applied the Pool datasets and Random Cropping datasets to scrutinize the efficacy of our model developed for rigid environments. We employed the FPV dataset to systematically assess the performance of our model under nonrigid conditions. Random Cropping Dataset: 200 pairs of randomly generated images form the random cropping test set. In addition, 100 pairs of images from the FPV dataset were used for testing.

4.2. Implementation details

All experiments were conducted on a machine equipped with an Intel i7 9700 processor, 64 GB of RAM, and an NVIDIA GeForce RTX3060 GPU with 12 GB of memory. The datasets were generated using SIFT feature extraction and brute-force matching from OpenCV to obtain initial feature match pairs. Random numbers used in the dataset generation process were generated by the C++ random library.

The number of neighbor nodes was set to 30, but adjusting this number based on the density of feature points is recommended. For the initialization of the adjacency matrix, the parameter $ε$ in Equation (13) was set to 0.3.

For the standard output function sigmoid in binary classification tasks, 0.5 is typically chosen as the threshold for the adjustable parameter $γ$ in Equation (23). We tested different values of $γ$ to observe their performances, exploring the model’s balance between prioritizing high accuracy or high recall rates. These results are presented in .

Table 1. Performance on the Random Cropping Dataset under different $γ$ values expressed as percentages, with all other parameters held constant. Values in parentheses indicate standard deviation.

Display Table

As shown in , our algorithm demonstrates some adjustability. 0.5 was thus used as the experimental value for $γ$ .

In our model structure, the features of nodes and the connections of edges were designed specifically for GAT. However, we also considered a variety of graph neural networks (GNN) that are widely used for comparative purposes to determine the most suitable model for our graph classification task. These include graph convolutional networks (GCN) (Zhang et al. Citation2019), which operate convolutions on the nodes of the graph and their neighbors; graph isomorphism networks (GIN) (Xu et al. Citation2018), which address graph isomorphism; and GraphSAGE (Hamilton, Ying, and Leskovec Citation2017), which generates node embeddings by sampling and aggregating features of local neighborhoods of nodes.

For the pooling layer following the graph neural network, we examined several approaches. Sum Pooling (Xu et al. Citation2018) has been proven to be more expressive than Max and Mean Pooling. Top-K Pooling (Gao and Ji Citation2019) selects the top K nodes based on node scores or a fixed number of nodes from the entire graph for pooling, which is similar to our attention pooling, except out method considers the attention of all nodes. Set2Set Pooling (Vinyals, Bengio, and Kudlur Citation2015) performs pooling over sequences using an LSTM internally. Finally, out Attention Pooling accounts for the attention scores of all nodes.

We compared these methods with our GAT model. The hidden layer dimension, denoted as ‘middle’, was set to be the same for all models, and the total number of graph neural network layers was three. The model underwent training with the Adam optimizer (Kingma and Ba Citation2014), set at a learning rate of 0.001 and incorporating a weight decay parameter of 5e-4. A binary cross-entropy loss function was then employed to quantify the discrepancy between the model’s predicted output and the actual labels. Finally, in the later part of the network, all methods employed the same fully connected layer (FC) and sigmoid activation function for classification.

We selected 131,080 graphs from the dataset for testing, which contained two types of properties in equal proportions. After completing a total of 160 training epochs, we monitored the variation in the f score to assess the performance of various graph neural networks combined with different pooling layers in our designed graph classification task, with evaluations based on the total number of graphs in the dataset. All other experimental procedures remained unchanged.

The results in (a) demonstrate that all graph neural networks possess robust learning capabilities. Although there were no significant differences among the GNNs, distinctions were still observable due to the large base of the dataset. The GCN exhibited good adaptability under different pooling schemes. GIN, which focuses more on the structural aspects of graphs, performed less effectively on our simpler graphs. GraphSAGE provided positive assistance by selecting neighbors. GAT recognized the importance of neighbors but potentially lost further graph information when combined with Top-K pooling. Our attention pooling performed well across various GNNs. Set2Set demonstrated consistent performance, likely due to its larger number of parameters. Sum pooling also showed impressive expressiveness, while Top-K Pooling performed less favorably for our task. (30) $x_{i, without center piror propagation} = [o_{ci}, d_{ci}, \vec{n_{i}}, m_{i}, k_{i}, u_{i}, r]_{i}^{T}$ (30) (31) $x_{i, without vector similarity comparsion} = [o_{c}, o_{ci}, d_{c}, d_{ci}, \vec{c_{}}, \vec{n_{i}}]^{T}$ (31) (32) $x_{i, neither} = [o_{ci}, d_{ci}, \vec{n_{i}}]^{T}$ (32)

Figure 5. (a) Classification results of different graph neural networks combined with various pooling layers on our graph dataset, with a particular focus on the impact of attention pooling on different GNNs with varying node attributes. The highest values are marked with diamond blocks. Given the large size of the dataset, percentages are shown to five decimal places.

Consequently, we chose attention pooling combined with GNNs. (b) depicts a new discussion on node attributes, which we conducted to identify which aspects were more prominent and more likely to be noticed by the network. We restructured node features in Equation (11), including variants without center prior propagation $o_{c}, d_{c}, \vec{c_{}}$ in Equation (30) and reduced vector similarity comparison $m_{i}, k_{i}, u_{i}, r_{i}$ in Equation (31), as well as Equation (32), where the center prior propagation and vector similarity comparison are removed.

The experiments revealed that complete node attributes in Equation (11) performed the best, with vector similarity attributes contributing the most to the model, followed by central connectivity. The absence of these two attributes made network fitting more challenging.

GCN combined with Set2Set exhibited a slight advantage in some ways, but its parameters were significantly higher than those of attention. In (a), GCN combined with attention pooling and GAT with attention pooling performed almost identically. However, considering the interpretability of GAT in our model, we continued to use GAT combined with attention pooling for subsequent testing structures.

For our experimental comparison, we selected multiple algorithms, including GMS, LPM, LMR, OANet, MCDM, LGSC, LOGO and mTopKRP. These algorithms are based on similar mismatch removal methods. Our test dataset included a POOL dataset with 200 image pairs, a random cropping dataset with 200 image pairs, and an FPV dataset with 100 image pairs. We believe that a larger number of images covers a broader range of scenarios, thus yielding more stable and representative experimental results. depicts a representative subset of our experimental results.

Figure 6. Examples of the experimental results from our feature matching algorithm on the POOL, random cropping and FPV datasets. The images from the POOL dataset displayed were upsampled by ESRGAN (Wang et al. Citation2018), but the images used in the experiments were not altered. The blue lines connect the coordinates of $p_{i}$ and $q_{i}$ on one white image. The color scheme is as follows: blue lines indicate true positives, black lines indicate true negatives, green lines indicate false negatives, and red lines indicate false positives.

As shown in , in the case of the POOL dataset, which presents smaller image dimensions, the detected feature points are comparatively scarce. The Random Cropping dataset, however, exhibits a substantial quantity of inliers. The FPV dataset, marked by the existence of significant repeating textures, contains a larger number of detected feature points. portrays the distribution of the inlier ratios and counts, covering an extensive range of scenarios, with inlier probabilities spanning the majority of the 0–100% interval and inlier counts varying from a few dozens to thousands, reflecting the diversity of the scenes.

Figure 7. Cumulative distribution for inlier ratio, total matches, and correct matches in POOL, Random Cropping and FPV datasets. The inlier ratio is on the left y-axis as a percentage; total matches and correct matches are on the right y-axis as counts.

Figure 8. The quantitative analysis results. From top to bottom, the cumulative distribution graphs of each column represent the precision, recall, and F score. From left to right, the datasets used are POOL, Random Cropping, and FPV.

As depicted in , the precision and recall in testing were calculated based on the number of matches in each pair of matching images, rather than considering the entire training set or batch size as a unit.

All algorithms achieved admirable precisions on the POOL dataset. This dataset, consisting of relatively simple and rigid images, exhibits a coherent motion pattern across all matches. The LGSC, LOGO, MCDM, mTopKRP, and GMS algorithms, as well as our method, displayed extraordinary precision, with nearly perfect precision across a series of 200 pairs of images. This performance demonstrates that our algorithm has effectively learned the scene’s patterns. On other datasets, the performance remained consistent, with LMR falling slightly behind, possibly due to the model not being specifically trained for remote sensing scenarios. Notably, on both the Random Cropping and FPV datasets, LGSC, LOGO, mTopKRP and GMS achieved performances closest to that of our algorithm. However, our algorithm once again achieved the highest performance.

In the context of recall, our algorithm was marginally outperformed by other methods, although it still maintained a high recall score. The primary contributing factor to this is our utilization of KD-tree for swift neighbor selection based on minimal Euclidean distance. This approach might falter in scenarios with an absolute lack of correct matches or under extreme rotational circumstances. Meanwhile, the mTopKRP algorithm, specifically designed for remote sensing images and leveraging multiple iterations, achieved a recall close to 100, which was the highest recall rate. Both LMR and LPM also performed impressively, especially considering their time efficiency.

Concerning the F score, our algorithm excelled by balancing precision and recall, thereby achieving the highest F score. Although the advantage was not as conspicuous in entirely rigid images, such as those in the POOL dataset, it was particularly apparent in images with rotation and scale changes, as well as in nonrigid images.

For runtime analysis, we excluded descriptor calculation and brute-force matching from all comparisons. As shown in , the GMS algorithm completed all experiments on the POOL dataset in the shortest timeframe. This is attributed to its grid-based strategy, which reduces the time complexity to O(1). LPM also exhibited commendable time efficiency due to its implementation of two rounds of iterations, where the first round’s output feeds into the second round to augment the algorithm’s performance. In contrast, mTopKRP and LGSC utilize three iterations, which incurred a significantly larger computational cost. Our algorithm achieved a middle-range performance in computational time.

Figure 9. Cumulative distribution of computational time for matching each pair of images in the POOL, random cropping and FPV datasets.

We conducted an analysis of the predictive errors in our algorithm, which included a comparison between the root mean square error (RMSE) and the maximum error (MAE). For the POOL dataset, the function $f$ originates from the reprojection of the H matrix. (33) $RMSE = \sqrt{\frac{1}{| I |} \sum_{i \in I} {‖ p_{i} - f (q_{i}) ‖}^{2}}$ (33) (34) $MAE = max {‖ p_{i} - f (q_{i}) ‖, i \in I}$ (34) In the FPV dataset, the error came from the reprojection of dedistorted points. The error in the Random Cropping dataset came from the disparity between points in the original images, which is calculated based on Equation (25), where $p_{i} - f (q_{i})$ is replaced with $er r_{i}$ in Equation (33). Incorrect point matching results in significant numerical penalties, as wrongly matched points in the original images could be far apart. The results are presented in .

Table 2. The average RMSE and MAE performance on three datasets, with standard deviations in parentheses.

Download CSV Display Table

These two parameters represent the precision of the algorithm to a certain extent. In the POOL dataset, our algorithm was slightly behind GMS and LGSC, partly because POOL dataset was not included in our training set. However, our algorithm still maintained an average RMSE within 1 pixel. Our algorithm performed well on the RC and Fisheye datasets, especially in terms of MAE, which is crucial for algorithms utilizing feature matching. Our method also performed similarly to GMS, LGSC, and mTopKRP. The advantages of our method in terms of RMSE and MAE remain evident.

5. Discussion

In the previous chapter, we analyzed the strengths and primary contributions of our model, namely, the vector consistency assessment in node features and the central prior propagation. Our incorporation of node feature engineering and the highly capable graph neural network for classification enables the algorithm’s high F-score compared to those of the voting-based and graph-based algorithms.

Precision, recall, and the F score are three core metrics for evaluating the performances of binary classification models. For instance, in certain scenarios, we might prioritize recall, such as in remote sensing where texture is extremely sparse and detectable feature points are limited, necessitating the maintenance of a few match points for subsequent tasks. In scenarios with abundant feature points (though too many points can increase processing time), we can prioritize precision to yield better initial values for other operations.

Our graph structure, which observes only a portion of the image’s matching neighbors, fails to fully utilize global information. Once the correct feature matches completely lack similar matches around them, the algorithm fails to aggregate useful information from neighboring nodes, leading to a lower recall rate. Using a sigmoid function as the output, we can adjust the algorithm’s output tendency through $γ$ , as discussed in . However, this adjustment is limited and might reduce model performance in other aspects. For the binary classification of feature matching, points in the ambiguous zone are the most challenging to differentiate, and the algorithm must make trade-offs.

One balance approach is to design a lenient algorithm, favoring initially classifying ambiguous points as correct matches to ensure recall rate, then iterating again on the results using the improved inlier rates as input. However, this approach can reduce precision and increase processing time. A potential solution is to adjust the loss function’s preference during training or to directly adjust the $γ$ of the trained weights to achieve higher recall rates. Then, these results are iterated as model inputs to achieve a more robust output.

In model comparisons, due to the multitude of variables, we could not test every variable in detail and conducted tests only on a limited number of models. The model performance on different GNNs also shows that GAT is suitable for our constructed graph classification problem, but there is more in-depth information worth exploring.

For 3D point cloud matching, a similar assumption of motion consistency exists, and our method might be equally applicable, especially since our graph structure and node features can be designed similarly.

Notably, our model performs well on the FPV dataset, indicating its suitability for distorted scenarios. However, this does not mean it will be effective for all types of distortion. For different camera intrinsics and distortion models, retraining the model on specific camera images is recommended.

6. Conclusions

In this paper, we proposed a graph property binary classifier for robust selection of feature point matching in remote sensing images. Our core idea is that correct matching points have similar motion displacement, including direction and angle. Based on this, we built an undirected graph with designed node features and initial edge connections. Using a graph attention network, we learned the internode correlations, enabling the central node’s matching correctness to be classified. We also designed a training dataset, generating many reliable data with accurate labels through random cropping and source point tracing. The results showed that our method is applicable to both rigid and nonrigid remote sensing image matching tasks, demonstrating high precision and robustness in tests. However, we have not yet proven its effectiveness in perspective scenes or more generalized F-matrix scenarios. Moreover, relying solely on nearest neighbor methods fails to fully utilize global information, and the model may fail when similar matches are completely lacking or scarce in the vicinity. These issues await further verification and improvement and will be the focus of our future work.

Acknowledgments

We sincerely thank the authors of OANet, GMS, LPM, LMR, and mTopKRP for providing their algorithm codes, which facilitated the comparative experiments.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The code, trained weights and experimental data that support the findings of this study are openly available in GitHub at https://github.com/codingbadbad/RS-Image-Feature-Matching-via-GCwith-LMC. The test dataset is publicly available in figshare at https://figshare.com/articles/dataset/RS_Image_Feature_Matching_via_GCwith_LMC/23995008.

Additional information

Funding

This work was supported by Key scientific and technological innovation projects of Fujian Province: [grant number 2022G02008]; The Education and Scientific Research Project of Fujian Provincial Department of Finance: [grant number KY030346].

References

Barath, Daniel, Jiri Matas, and Jana Noskova. 2019. “MAGSAC: Marginalizing Sample Consensus.” Paper presented at the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Google Scholar
Barfoot, Timothy D. 2017. State Estimation for Robotics. Cambridge: Cambridge University Press.
Google Scholar
Bay, Herbert, Tinne Tuytelaars, and Luc Van Gool. 2006. “Surf: Speeded up Robust Features.” Paper presented at the Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7–13, 2006. Proceedings, Part I 9.
Google Scholar
Bian, JiaWang, Wen-Yan Lin, Yasuyuki Matsushita, Sai-Kit Yeung, Tan-Dat Nguyen, and Ming-Ming Cheng. 2017. “Gms: Grid-Based Motion Statistics for Fast, Ultra-Robust Feature Correspondence.” Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Google Scholar
Bian, Jia-Wang, Yu-Huan Wu, Ji Zhao, Yun Liu, Le Zhang, Ming-Ming Cheng, and Ian Reid. 2019. “An Evaluation of Feature Matchers for Fundamental Matrix Estimation.” arXiv preprint arXiv:1908.09474.
Google Scholar
Calonder, Michael, Vincent Lepetit, Christoph Strecha, and Pascal Fua. 2010. “Brief: Binary Robust Independent Elementary Features.” Paper presented at the Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5–11, 2010, Proceedings, Part IV 11.
Google Scholar
Delmerico, Jeffrey, Titus Cieslewski, Henri Rebecq, Matthias Faessler, and Davide Scaramuzza. 2019. “Are we Ready for Autonomous Drone Racing? The UZH-FPV Drone Racing Dataset.” Paper presented at the 2019 International Conference on Robotics and Automation (ICRA).
Google Scholar
Engel, J., V. Koltun, and D. Cremers. 2018. “Direct Sparse Odometry.” IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (3): 611–625. https://doi.org/10.1109/TPAMI.2017.2658577.
PubMed Web of Science ®Google Scholar
Forster, Christian, Matia Pizzoli, and Davide Scaramuzza. 2014. “SVO: Fast Semi-Direct Monocular Visual Odometry.” Paper presented at the 2014 IEEE International Conference on Robotics and Automation (ICRA).
Google Scholar
Gao, Hongyang, and Shuiwang Ji. 2019. “Graph u-Nets.” Paper presented at the International Conference on Machine Learning.
Google Scholar
Ham, Hanry, Julian Wesley, and Hendra Hendra. 2019. “Computer Vision Based 3D Reconstruction: A Review.” International Journal of Electrical and Computer Engineering 9 (4): 2394.
Google Scholar
Hamilton, Will, Zhitao Ying, and Jure Leskovec. 2017. “Inductive Representation Learning on Large Graphs.” Advances in Neural Information Processing Systems 30: 1025–1035.
Google Scholar
Harel, Jonathan, Christof Koch, and Pietro Perona. 2006. “Graph-based Visual Saliency.” Advances in Neural Information Processing Systems 19: 545–552.
Google Scholar
Horn, Berthold KP, and Brian G Schunck. 1981. “Determining Optical Flow.” Artificial Intelligence 17 (1-3): 185–203. https://doi.org/10.1016/0004-3702(81)90024-2.
Web of Science ®Google Scholar
Itoh, Takeshi D, Takatomi Kubo, and Kazushi Ikeda. 2022. “Multi-level Attention Pooling for Graph Neural Networks: Unifying Graph Representations with Multiple Localities.” Neural Networks 145: 356–373. https://doi.org/10.1016/j.neunet.2021.11.001.
PubMed Web of Science ®Google Scholar
Jiang, Xingyu, Junjun Jiang, Aoxiang Fan, Zhongyuan Wang, and Jiayi Ma. 2019. “Multiscale Locality and Rank Preservation for Robust Feature Matching of Remote Sensing Images.” IEEE Transactions on Geoscience and Remote Sensing 57 (9): 6462–6472. https://doi.org/10.1109/tgrs.2019.2906183.
Web of Science ®Google Scholar
Jiang, Xingyu, Jiayi Ma, Guobao Xiao, Zhenfeng Shao, and Xiaojie Guo. 2021. “A Review of Multimodal Image Matching: Methods and Applications.” Information Fusion 73: 22–71. https://doi.org/10.1016/j.inffus.2021.02.012.
Web of Science ®Google Scholar
Jiang, Xingyu, Yifan Xia, Xiao-Ping Zhang, and Jiayi Ma. 2022. “Robust Image Matching via Local Graph Structure Consensus.” Pattern Recognition 126. https://doi.org/10.1016/j.patcog.2022.108588.
Web of Science ®Google Scholar
Jiayi, Ma, Zhao Ji, Tian Jinwen, A. L. Yuille, and Tu Zhuowen. 2014. “Robust Point Matching via Vector Field Consensus.” IEEE Transactions on Image Processing 23 (4): 1706–1721. https://doi.org/10.1109/TIP.2014.2307478.
PubMed Web of Science ®Google Scholar
Kadir, Timor, Andrew Zisserman, and Michael Brady. 2004. “An Affine Invariant Salient Region Detector.” Paper presented at the Computer Vision-ECCV 2004: 8th European Conference on Computer Vision, Prague, Czech Republic, May 11–14, 2004. Proceedings, Part I 8.
Google Scholar
Kingma, Diederik P, and Jimmy Ba. 2014. “Adam: A Method for Stochastic Optimization.” arXiv preprint arXiv:1412.6980.
Google Scholar
Leutenegger, Stefan, Margarita Chli, and Roland Y Siegwart. 2011. “BRISK: Binary Robust Invariant Scalable Keypoints.” Paper presented at the 2011 International Conference on Computer Vision.
Google Scholar
Li, Xiangru, and Zhanyi Hu. 2010. “Rejecting Mismatches by Correspondence Function.” International Journal of Computer Vision 89: 1–17. https://doi.org/10.1007/s11263-010-0318-x.
Web of Science ®Google Scholar
Lin, Hui, Peijun Du, Weichang Zhao, Lianpeng Zhang, and Huasheng Sun. 2010. “Image Registration Based on Corner Detection and Affine Transformation.” Paper presented at the 2010 3rd International Congress on Image and Signal Processing.
Google Scholar
Lin, Wen-Yan, Fan Wang, Ming-Ming Cheng, Sai-Kit Yeung, Philip HS Torr, Minh N Do, and Jiangbo Lu. 2017. “CODE: Coherence Based Decision Boundaries for Feature Correspondence.” IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (1): 34–47.
PubMed Web of Science ®Google Scholar
Lowe, David G. 2004. “Distinctive Image Features from Scale-Invariant Keypoints.” International Journal of Computer Vision 60: 91–110. https://doi.org/10.1023/B:VISI.0000029664.99615.94.
Web of Science ®Google Scholar
Ma, Jiayi, Aoxiang Fan, Xingyu Jiang, and Guobao Xiao. 2022. “Feature Matching via Motion-Consistency Driven Probabilistic Graphical Model.” International Journal of Computer Vision 130 (9): 2249–2264. https://doi.org/10.1007/s11263-022-01644-2.
Web of Science ®Google Scholar
Ma, Jiayi, Ji Zhao, Junjun Jiang, Huabing Zhou, and Xiaojie Guo. 2019. “Locality Preserving Matching.” International Journal of Computer Vision 127: 512–531. https://doi.org/10.1007/s11263-018-1117-z.
Web of Science ®Google Scholar
Ma, Ailong, Yanfei Zhong, and Liangpei Zhang. 2015. “Adaptive Multiobjective Memetic Fuzzy Clustering Algorithm for Remote Sensing Imagery.” IEEE Transactions on Geoscience and Remote Sensing 53 (8): 4202–4217. https://doi.org/10.1109/TGRS.2015.2393357.
Web of Science ®Google Scholar
Mikolajczyk, Krystian, and Cordelia Schmid. 2004. “Scale & Affine Invariant Interest Point Detectors.” International Journal of Computer Vision 60: 63–86. https://doi.org/10.1023/B:VISI.0000027790.02288.f2.
Web of Science ®Google Scholar
Nie, Lang, Chunyu Lin, Kang Liao, Shuaicheng Liu, and Yao Zhao. 2021. “Unsupervised Deep Image Stitching: Reconstructing Stitched Features to Images.” IEEE Transactions on Image Processing 30: 6184–6197. https://doi.org/10.1109/TIP.2021.3092828.
PubMed Web of Science ®Google Scholar
Rublee, Ethan, Vincent Rabaud, Kurt Konolige, and Gary Bradski. 2011. “ORB: An Efficient Alternative to SIFT or SURF.” Paper presented at the 2011 International Conference on Computer Vision.
Google Scholar
Sarlin, Paul-Edouard, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. 2020. “Superglue: Learning Feature Matching with Graph Neural Networks.” Paper presented at the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Google Scholar
Tola, Engin, Vincent Lepetit, and Pascal Fua. 2009. “Daisy: An Efficient Dense Descriptor Applied to Wide-Baseline Stereo.” IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (5): 815–830. https://doi.org/10.1109/TPAMI.2009.77.
Web of Science ®Google Scholar
Torr, Philip HS, and Andrew Zisserman. 2000. “MLESAC: A new Robust Estimator with Application to Estimating Image Geometry.” Computer Vision and Image Understanding 78 (1): 138–156. https://doi.org/10.1006/cviu.1999.0832.
Web of Science ®Google Scholar
Triggs, Bill, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. 2000. “Bundle Adjustment—A Modern Synthesis.” Paper presented at the Vision Algorithms: Theory and Practice: International Workshop on Vision Algorithms Corfu, Greece, September 21–22, 1999 Proceedings.
Google Scholar
Velickovic, Petar, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. “Graph Attention Networks.” stat 1050 (20): 10-48550.
Google Scholar
Vinyals, Oriol, Samy Bengio, and Manjunath Kudlur. 2015. “Order Matters: Sequence to Sequence for Sets.” arXiv preprint arXiv:1511.06391.
Google Scholar
Wang, Xintao, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. 2018. “Esrgan: Enhanced Super-Resolution Generative Adversarial Networks.” Paper presented at the Proceedings of the European Conference on Computer Vision (ECCV) Workshops.
Google Scholar
Wong, Hoi Sim, Tat-Jun Chin, Jin Yu, and David Suter. 2011. “Dynamic and Hierarchical Multi-Structure Geometric Model Fitting.” Paper presented at the 2011 International Conference on Computer Vision.
Google Scholar
Wu, Chen, Liangpei Zhang, and Bo Du. 2017. “Kernel Slow Feature Analysis for Scene Change Detection.” IEEE Transactions on Geoscience and Remote Sensing 55 (4): 2367–2384. https://doi.org/10.1109/TGRS.2016.2642125.
Web of Science ®Google Scholar
Xia, Y., and J. Ma. 2022. “Locality-Guided Global-Preserving Optimization for Robust Feature Matching.” IEEE Transactions on Image Processing 31: 5093–5108. https://doi.org/10.1109/TIP.2022.3192993.
PubMed Web of Science ®Google Scholar
Xu, Keyulu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. “How Powerful are Graph Neural Networks?” arXiv preprint arXiv:1810.00826.
Google Scholar
Yi, Kwang Moo, Eduard Trulls, Yuki Ono, Vincent Lepetit, Mathieu Salzmann, and Pascal Fua. 2018. “Learning to Find Good Correspondences.” Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Google Scholar
Zhang, Jiahui, Dawei Sun, Zixin Luo, Anbang Yao, Lei Zhou, Tianwei Shen, Yurong Chen, Long Quan, and Hongen Liao. 2019. “Learning two-View Correspondences and Geometry Using Order-Aware Network.” Paper presented at the Proceedings of the IEEE/CVF International Conference on Computer Vision.
Google Scholar
Zhang, S., H. Tong, J. Xu, and R. Maciejewski. 2019. “Graph Convolutional Networks: A Comprehensive Review.” Comput Soc Netw 6 (1): 11. https://doi.org/10.1186/s40649-019-0069-y.
PubMedGoogle Scholar