Local feature matching system based on keypoint detection

By sharing weights among Transformer layers and combining multi-scale keypoint detection, the high model size and computational cost of existing methods are addressed, achieving efficient and robust image feature matching in complex scenes.

WO2026129560A1PCT designated stage Publication Date: 2026-06-25YANGTZE RIVER DELTA HIT ROBOT TECH RES INST

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
YANGTZE RIVER DELTA HIT ROBOT TECH RES INST
Filing Date
2025-05-30
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

Existing Transformer-based local feature matching methods struggle to achieve robust and accurate image correspondences when faced with challenges such as texture-poor regions, motion blur, changes in lighting and viewpoint, and repetitive patterns. Furthermore, these methods have high model size and computational cost.

Method used

We employ weighted reuse techniques to share task parameters between consecutive Transformer layers and combine them with multi-scale keypoint detection. We use deep Transformer networks for feature aggregation and matching, including dense feature aggregation modules, sparse feature aggregation modules, and sparse-to-dense feature aggregation modules, to reduce redundant information propagation and improve feature representation capabilities and model efficiency.

Benefits of technology

It effectively reduces the model size, improves the accuracy and robustness of feature matching, and maintains good matching results, especially in complex scenarios, thus enhancing the model's stability under harsh conditions.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025098354_25062026_PF_FP_ABST
    Figure CN2025098354_25062026_PF_FP_ABST
Patent Text Reader

Abstract

Disclosed in the present invention is a local feature matching system based on keypoint detection. The system comprises: an encoder, a deep Transformer model and a matching module. An image pair to be matched (IA, IB) is input into the encoder, and fine features (formula I), coarse features (formula II), keypoints PA and PB of the image pair (IA, IB) are extracted; the coarse features (formula II) and the image pair (IA, IB) are input into the deep Transformer model, and the deep Transformer model performs feature aggregation on an image IA and an image IB to obtain keypoint features (formula III); and the matching module converts the keypoint features (formula III) into a confidence matrix C, performs matching between the keypoint PA and the keypoint PB on the basis of the confidence matrix C, and then performs matching enhancement on the basis of the fine features (formula I), so as to complete the matching of the image pair (IA, IB). A weight (parameter) reuse technique is used to share task parameters between consecutive Transformer layers on the basis of task requirements, such that a model can maintain a feature expression capability, improving the model performance, and can also effectively reduce the model size. In addition, the use of a multi-scale keypoint detector reduces the propagation of redundant information and enhances feature specificity, thereby improving the model efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Local Feature Matching System Based on Keypoint Detection Technical Field

[0001] This invention belongs to the field of image matching technology, and more specifically, this invention relates to a local feature matching system based on key point detection. Background Technology

[0002] Local feature matching is a crucial step in many computer vision tasks, such as SLAM, pose estimation, and visual localization, which aim to establish accurate correspondences between pairs of images. However, obtaining robust and accurate correspondences is challenging when faced with texture-poor regions, motion blur, changes in lighting and viewpoint, and repetitive patterns.

[0003] Leveraging the global receptive field provided by the Transformer, long-distance visual and geometric information is aggregated across keypoints. LoFTR is a prominent example, using a linear Transformer with self-attention and cross-attention layers to simultaneously encode keypoint features from two images and compute a soft assignment matrix for coarse-grained to fine-grained matching. The design and architecture of the transformer blocks play a crucial role in determining matching performance in these methods. For instance, Quadtree innovatively proposes a Transformer layer capable of building a token pyramid and computing hierarchical attention, significantly improving performance. ASpanFormer introduces a novel attention mechanism that adaptively adjusts the span, achieving state-of-the-art performance on multiple evaluation benchmarks.

[0004] However, the above methods mainly focus on designing complex Transformer architectures, which increases the model size and consequently increases storage and computation costs. Summary of the Invention

[0005] This invention provides a local feature matching method based on weighted reusability, aiming to improve the above-mentioned problems.

[0006] This invention is implemented as follows: a local feature matching system based on key point detection, the system comprising:

[0007] It consists of an encoder, a deep Transformer model, and a matching module;

[0008] The image pairs to be matched (I) A ,I B The input encoder extracts image pairs (I) A ,I B fine features coarse features And key point P A PB ;

[0009] coarse features and image pairs (I A ,I B Input a deep Transformer model, the deep Transformer model processes the image I A Image I B Perform feature aggregation to obtain key point features.

[0010] The matching module will use key point features Convert to a credibility matrix C, and then perform key point P based on the credibility matrix C. A With key point P B Matching, and then based on fine features Perform matching enhancement to complete image pairing (I A ,I B Matching of ).

[0011] Furthermore, the deep Transformer model consists of s Transformer layers connected sequentially, with the (k-1)th Transformer layer sharing the parameter θ. s If the model parameters θ of the k-th Transformer layer are passed to the k-th Transformer layer... k The corresponding loss is lower than the shared parameter θ s Then update the shared parameters to the model parameters θ. k If the model parameter θ k The corresponding loss is higher than the shared parameter θ s Then the shared parameter θ will not be updated. s Then the shared parameter θ s Pass it to the (k+1)th Transformer layer.

[0012] Furthermore, the Transformer layer consists of a dense feature aggregation module, a sparse feature aggregation module, and a sparse-to-dense feature aggregation module. The dense feature aggregation module is connected to the sparse feature aggregation module, and the dense feature aggregation module, the sparse feature aggregation module, and the sparse-to-dense feature aggregation module are connected to the sparse-to-dense feature aggregation module.

[0013] The dense feature aggregation module is used to perform coarse feature aggregation on the dense feature points output by the encoder. Perform aggregation separately to obtain feature F AD F BD The sparse feature aggregation module is used to extract feature F. AD F BD Representative key points in the form of feature F AK FBK For feature F AK F BK After aggregation, enhanced feature F is formed. AE F BE The sparse-to-dense feature aggregation module will enhance feature F. AE F BE Representative key point features are passed to feature F AD F BD Thus generating features and

[0014] Furthermore, the initial visual description F A and F B The input dense feature aggregation module performs T1 recursive feature aggregations, as follows:

[0015] in, These represent the initial visual description F. A F B The enhanced features formed after i recursions These represent the initial visual description F. A F B The enhanced features formed after i-1 recursions, θ i θ represents the model parameters of the current Transformer layer after i recursions. s This represents the shared parameters passed from the previous Transformer layer.

[0016] Furthermore, image I A The specific process for extracting representative key points is as follows:

[0017] Will Convert to a 2D feature map and calculate the sum along the channel dimension to generate

[0018] For image I A For the pixel with coordinates (i,j), crop a local window of size w×w centered on that pixel and generate a confidence score for that pixel.

[0019] in, These are images I A The confidence score and features of the pixel at coordinate (i,j), i∈{0,1,2,...,H / 8-1}, j∈{0,1,2,...,w / 8-1};

[0020] Crop multi-scale local windows using w=3, 5, 7 to generate confidence scores for pixel (i,j). Then confidence score

[0021] Keep S A The top K key points with the highest confidence scores are used as the basis for image I. A Representative key point P AK .

[0022] Furthermore, the sparse feature aggregation module performs a process on feature F. AK Feature F BK Performing T2 recursive updates, the aggregation process can be represented as follows:

[0023] Among them, F AK F BK as initial value Representing features F respectively AK F BK The enhanced features formed after i recursions Representing features F respectively AK F BK The enhanced features formed after i-1 recursions, θ i θ represents the model parameters of the current Transformer layer after i recursions. s This represents the shared parameters passed from the previous Transformer layer.

[0024] Furthermore, the feature F output by the sparse feature aggregation module is... AE F BE Representative key point features are passed to the feature F output by the dense feature aggregation module. AD F BD The sparse-to-dense feature aggregation module performs T3 recursive updates, and the update process is represented as follows:

[0025] Among them, the enhanced feature F output by the dense feature aggregation module AD F BD as initial value The representative key point features F output by the sparse feature aggregation module AE F BE as initial value

[0026] Furthermore, the formation process of the credibility matrix C is as follows:

[0027] Determine features The score matrix S between representative key points,

[0028] Convert the score matrix S into a confidence matrix C, where C(i,j) = Softmax(S(i,·))·Softmax(S(·,j)), where S(i,·) and S(·,j) represent all elements in the i-th row and j-th column of the score matrix S, respectively, and C(i,j) represents the confidence of the match between the i-th keypoint and the j-th keypoint.

[0029] Furthermore, image I is divided using an 8×8 grid. A I B And the center pixel of all grids is used as the key point P of the corresponding image. A P B .

[0030] This invention employs weight (parameter) reuse technology to share task parameters between consecutive Transformer layers according to task requirements, enabling the model to not only maintain the expressive power of features and improve model performance, but also effectively reduce the model size. In addition, the use of a multi-scale keypoint detector reduces the propagation of redundant information, enhances the specificity of features, and improves model efficiency. Attached Figure Description

[0031] Figure 1 is a schematic diagram of the structure of the local feature matching system based on key point detection provided in an embodiment of the present invention. Detailed Implementation

[0032] The specific embodiments of the present invention will be further described in detail below with reference to the accompanying drawings, so as to help those skilled in the art to have a more complete, accurate and in-depth understanding of the inventive concept and technical solution of the present invention.

[0033] Image pairs (I) consisting of images to be matched A ,I BThe input encoder first generates low-resolution coarse features for coarse-level matching, and high-resolution fine features for refining coarse matching into fine-level matching, based on a CNN encoder. The coarse features are flattened into a one-dimensional sequence and input into a deep Transformer, where the model performs long-range feature aggregation across and within images. Specifically, the model employs a proposed weight reuse technique, which continuously shares weight parameters between Transformer layers, reducing the model size while effectively aggregating features. Furthermore, a multi-scale keypoint detection module is introduced to identify representative keypoints, replacing all keypoints in feature aggregation, thereby reducing the transmission of unimportant information to some extent while keeping computational costs under control. Following the deep Transformer model, a coarse-level dense matching module is used to generate coarse-level matches. Finally, a matching refinement module refines the matching to the sub-pixel level, establishing accurate and reliable matching between images.

[0034] Figure 1 is a schematic diagram of the local feature matching system based on key point detection provided in an embodiment of the present invention. For ease of explanation, only the parts related to the embodiment of the present invention are shown. The system includes:

[0035] It consists of an encoder, a deep Transformer model, and a matching module. The deep Transformer model is composed of s Transformer layers connected in sequence.

[0036] The image pairs to be matched (I) A ,I B The input encoder extracts image pairs (I) A ,I B fine features coarse features And key point P A P B ; coarse features and image pairs (I A ,I B Input a deep Transformer model, the deep Transformer model processes the image I A Image I B Perform feature aggregation to obtain key point features. The matching module will use key point features Convert to a credibility matrix C, and then perform key point P based on the credibility matrix C. A With key point P B Matching, and then based on fine features Perform matching enhancement to complete image pairing (I A ,I B Matching of ).

[0037] (1) The encoder extracts fine features, coarse features and key points;

[0038] It employs a convolutional neural network (CNN) with a feature pyramid (FPN) to construct the image pairs to be matched (I A ,I B The input encoder extracts the image pairs to be matched (I). A ,I B Fine features of 1 / 2 downsampling coarse features and 1 / 8 downsampling Image I is divided using an 8×8 grid. A I B And the center pixel of all grids is used as the key point P of the corresponding image. A ,P B ∈R N×2 coarse features As the initial visual description Where N = H / 8 × W / 8.

[0039] (2) Deep Transformer model aggregation features;

[0040] A deep Transformer model consists of s Transformer layers connected sequentially, with the (k-1)th Transformer layer sharing the parameter θ. s If the model parameters θ of the k-th Transformer layer are passed to the k-th Transformer layer... k The corresponding loss is lower than the shared parameter θ s Then update the shared parameters to the model parameters θ. k If the model parameter θ k The corresponding loss is higher than the shared parameter θ s Then the shared parameter θ will not be updated. s Then the shared parameter θ s The process is passed to the (k+1)th Transformer layer, and the above process is repeated.

[0041] This paper examines the impact of the number of Transformer layers on matching performance from different perspectives. Stacking more Transformer layers can extract deep aggregated features, thereby improving matching accuracy. On the other hand, when people match images, they repeatedly scan back and forth; the more times they scan, the easier it is to remember easily matched features. However, building a deep Transformer architecture leads to the problem that the model size grows linearly with the number of Transformer layers. To address this issue, this invention introduces a weight reuse technique, which recycles the weights of consecutive Transformer layers, effectively controlling the model size.

[0042] In this embodiment of the invention, the Transformer layer consists of a linear self-attention layer and a cross-attention layer. First, the linear attention technique used to construct the Transformer layer is introduced, which is the foundation of the weight reuse technique. The input features of the Transformer layer are represented as U and R. Given U and R, a query, key, and value are first generated through linear projection, represented as... Where N is the number of key points. This represents the dimensions of the query, key, and value. The above process can be represented as: Q = UW Q K = RW K V = RW V (1)

[0043] Among them, W Q W K W V These are learnable parameters implemented using three multilayer perceptrons (MLPs). Linear attention is applied to model the global dependencies between keypoints, as follows: M = φ(Q)(φ(K)) T V),φ(·)=elu(·)+1 (2)

[0044] Here, M represents the retrieved global information, and elu(·) represents the non-linear activation function. Then, another MLP is used to process the retrieved global information: These are the parameters of the MLP. Finally, a feedforward network (FFN) is applied after the linear attention layer to extract deep aggregation features. The FFN consists of two linear transformation layers and a non-linear activation function between them, and can be expressed as:

[0045] Here, W1 and W2 are parameter matrices of two linear transformation layers, [·||·] represents the connection operation along the channel dimension, and σ represents a nonlinear activation function, such as GELU.

[0046] Long-range feature aggregation within images is performed using Transformer layers with self-attention layers, while long-range feature aggregation between images is performed using Transformer blocks with cross-attention layers. Specifically, for Transformer blocks with self-attention layers, the input features U and R are the same, such as (U = F). A R = F A For a Transformer block with a cross-attention layer, the input features U and R come from two images (U = F). A R = F B Or U = F B R = F A The complete Transformer layer is represented as: U′,R-Trans(U,R;θ) (4)

[0047] Where U′, R′ represent the features aggregated by the Transformer layer, and θ is the parameter of the Transformer layer with self / cross-attention, i.e., θ={W Q W K W V ,W1,W2,W O The main parameters of the model are composed of θ.

[0048] Based on the above attention mechanism, a deep Transformer layer with self-attention and cross-attention is constructed. In order to further reduce the computational overhead caused by excessively deep Transformer networks, the construction of the deep Transformer architecture is decomposed into: (1) dense feature aggregation module, (2) sparse feature aggregation block, and (3) sparse to dense feature aggregation block.

[0049] (21) Dense Feature Aggregation Module:

[0050] Describe the initial visual image F A and F B The input is a Transformer layer, and then a weight reclamation technique is applied to the Transformer layer to perform T1 recursive feature aggregations, which can be represented as:

[0051] in, These represent the initial visual description F. A F B The enhanced features formed after i recursions, θ i θ represents the model parameters of the current Transformer layer after i recursions. s This represents the shared parameters passed from the previous Transformer layer; it represents the enhanced features formed after the T1 iteration. Use F respectively AD F BD express,

[0052] (22) Sparse Feature Aggregation Module: To reduce the computational cost caused by excessively deep Transformer layers, this invention proposes a multi-scale keypoint detection module to identify representative keypoints to replace all other keypoints for further feature aggregation. Simultaneously, using representative keypoints for feature aggregation also has the advantage of reducing redundant information propagation. The multi-scale keypoint detection module is inspired by the fact that keypoints with rich visual information can be distinguished from their neighboring keypoints at different scales. More specifically, it updates the features after T1 iterations... For example, Convert to a 2D feature map and calculate the sum along the channel dimension to generate Then, for image I A For the pixel with coordinates (i,j), a local window of size w×w is cropped centered on this pixel, and then the confidence score of this pixel is generated.

[0053] in, These are images I A The confidence score and features of the pixel at coordinate (i,j) are calculated, i∈{0,1,2,...,H / 8-1}, j∈{0,1,2,...,w / 8-1}. Therefore, formula (6) compares the relationship between the central element and its neighboring elements through subtraction. For pixels with high values ​​in the feature map and large differences from surrounding pixels, they can have high confidence scores and can be regarded as representative keypoints. To improve the robustness of representative keypoint detection, multi-scale local windows are cropped using w=3, 5, 7 to generate the confidence score of pixel (i,j). Then confidence score Represented as:

[0054] S A ∈R H / 8×w / 8 express Confidence score mask. Based on S A Keypoints with the top K confidence scores are retained as image I. A Representative key point P AK ∈R K×2 It has a descriptor Used for subsequent feature aggregation. The same process is used from image I. B Obtain F BK Representative key point P BKThe proposed weight reclamation technique is applied to the Transformer layer, performing T2 recursive updates to promote deep feature aggregation among representative keypoints. Each complete Transformer layer is represented as:

[0055] Among them, F AK F BK as initial value Following the sparse feature aggregation module, updated representative keypoint features are further obtained. and Define them respectively as F AE F BE .

[0056] (23) Sparse to dense feature aggregation module: Once the updated representative keypoint features F are obtained... AE and F BE A sparse-to-dense feature aggregation module is introduced to pass the features of representative key points to all key points in the feature map, thereby generating features. and To generate a coarse match, the weight reclamation technique is used to simulate T3 recursive updates, which can be represented as:

[0057] Among them, the enhanced feature F output by the dense feature aggregation module AD F BD as initial value The representative key point features F output by the sparse feature aggregation module AE F BE as initial value The enhanced features formed after T3 recursions Represented as

[0058] (3) The matching module performs key point matching between images.

[0059] Coarse matching extraction: given Define a score matrix S∈R between selected representative keypoints. N×N for:

[0060] Then, double softmax is applied to transform the score matrix S into a confidence matrix C:

[0061] C(i,j)=Softmax(S(i,·))·Softmax(S(·,j))(11)

[0062] Where S(i,·) and S(·,j) represent all elements in the i-th row and j-th column of the scoring matrix S, respectively, i.e., the similarity score between the i-th keypoint and all keypoints, and the similarity score between all keypoints and the j-th keypoint. C(i,j) represents the credibility of the match between the i-th keypoint and the j-th keypoint. Based on the credibility matrix C, the threshold λ and the nearest neighbor (MNN) criterion are used to select keypoints P... A ,P B Extract coarse matching points in, n represents the number of predicted matches.

[0063] Fine-grained match extraction: After extracting coarse matches, fine-grained matches are generated. At coarse matching point Several local windows of size w×w were cropped around the perimeter, and the fine features extracted were encoded using CNN. This enhances the cropped features. Then, the inner product of the transformed features is calculated, and a softmax function is applied to generate a probability distribution map. Finally, the expectation of the probability distribution is calculated as the offset δ∈R. n×2 Ultimately, fine-tuning... It can be represented as:

[0064] The matching of the two images is completed based on the final fine-matching results.

[0065] The local feature matching system based on key point detection provided by this invention has the following beneficial technical effects:

[0066] (1) Reduce model parameters and improve model efficiency: This invention adopts weight (parameter) reuse technology to share task parameters between consecutive Transformer layers according to task requirements, so that the model can not only maintain the expressive power of features and improve model performance, but also effectively reduce the model size; in addition, the use of multi-scale key point detectors reduces the propagation of redundant information, enhances the specificity of features, and improves model efficiency.

[0067] (2) Enhanced Feature Representation: This invention establishes a closer correlation between features at different levels and scales by building a deeper Transformer network and combining multiple feature aggregation methods, thereby enhancing the expressive power of features. Especially in complex scenarios, it can capture more detailed information, effectively improving the discriminativeness and expressiveness of features, thus enhancing the model's ability to identify and classify fine-grained features.

[0068] (3) Improved feature matching effect: This invention strengthens the local and global dependencies of features through a deep Transformer network, which effectively improves the accuracy and robustness of feature matching. Especially when the image content is complex, the target is occluded or deformed, it can still maintain a good matching effect.

[0069] (4) Enhanced robustness: This invention significantly enhances the robustness of the model in complex environments through deep Transformer networks and multi-scale keypoint detection. The model maintains stable performance effectively under adverse conditions such as low light, noise interference, or target occlusion. The feature transformation module enhances feature diversity, enabling the model to better cope with various changes in different scenarios and reducing the negative impact of external environmental fluctuations.

[0070] The present invention has been described by way of example. Obviously, the specific implementation of the present invention is not limited to the above-described manner. Any non-substantial improvements made using the inventive concept and technical solution of the present invention, or the direct application of the inventive concept and technical solution of the present invention to other occasions without modification, are all within the protection scope of the present invention.

Claims

1. A keypoint detection based local feature matching system, characterized in that, The system comprises: The encoder, the deep layer Transformer model and the matching module are composed of; The pair of images (I A ,I B ) to be matched is input to an encoder, which extracts fine features of the pair of images (I A ,I B ) coarse features and key point P A , P B ; coarse features and image pair (I A , B ) into a deep Transformer model, the deep Transformer model performs feature aggregation on the image I A , the image I B to obtain key point features The matching module matches the key point features transformed into a confidence matrix C, based on which the key points P A are matched with the key points P B based on which the fine features F Matching enhancement is performed to complete the matching of the image pair (I A , B ) 2. The local feature matching system based on keypoint detection of claim 1, wherein, The deep Transformer model is sequentially connected by s Transformer layers, the k-1th Transformer layer passes shared parameters θ s to the kth Transformer layer, if the loss corresponding to the model parameters θ k of the kth Transformer layer is lower than the shared parameters θ s , the shared parameters are updated to the model parameters θ k , if the loss corresponding to the model parameters θ k is higher than the shared parameters θ s , the shared parameters θ s are not updated, and then the shared parameters θ s are passed to the k+1th Transformer layer.

3. The local feature matching system based on keypoint detection of claim 2, wherein, The Transformer layer is composed of a dense feature aggregation module, a sparse feature aggregation module and a sparse-to-dense feature aggregation module, the dense feature aggregation module is connected with the sparse feature aggregation module, and the dense feature aggregation module, the sparse feature aggregation module and the sparse-to-dense feature aggregation module are connected; The dense feature aggregation module is configured to aggregate the dense coarse features output by the encoder Respectively, the aggregation of the features F AD , F BD ; the sparse feature aggregation module is used to extract representative key points in the features F AD , F BD , and form the features F AK , F BK ; after the aggregation of the features F AK , F BK , the enhanced features F AE , F BE are formed; the representative key point features in the enhanced features F AE , F BE are transferred to the features F AD , F BD , so as to generate the features F and 4. The local feature matching system based on keypoint detection of claim 3, wherein, F A and F B input dense feature aggregation module, the dense feature aggregation module performs T1 times of recursive feature aggregation, denoted as: wherein, respectively, represent the initial visual description F A , F B enhanced features formed after i recursions, respectively, represent the initial visual description F A , F B enhanced features formed after i-1 recursions, θ i represent the model parameters of the current Transformer layer after i recursions, θ s represent the shared parameters passed from the previous Transformer layer.

5. The local feature matching system based on keypoint detection of claim 3, wherein, Image I A The representative keypoint extraction process is specified as follows: Will converted to 2D feature maps and sum is computed along channel dimension, resulting in For an image I A A local window of size w x w is cropped around the pixel with coordinates (i, j) in the image I, and the confidence score of the pixel is generated wherein, respectively are images I A confidence scores and features of pixels at coordinates (i, j), i e {0, 1, 2,..., H / 8 - 1}, j e {0, 1, 2,..., w / 8 - 1}; Using w=3, 5, 7 cropped multiscale local windows to generate confidence scores for pixel (i,j) then the confidence score S A The top K confidence scores of the key points are reserved as the representative key points P A of the image I AK .

6. The local feature matching system based on keypoint detection of claim 5, wherein, The sparse feature aggregation module is applied to the feature F AK , the feature F BK T2 recursive updates are performed, and the aggregation process is represented as: wherein F AK , F BK is used as an initial value respectively denote features F AK , F BK enhanced features formed after i recursions, respectively represent the feature F AK , F BK enhanced features formed after i-1 times of recursion, θ i represent the model parameters of the current Transformer layer after i times of recursion, θ s represent the shared parameters passed from the previous Transformer layer.

7. The local feature matching system based on keypoint detection as claimed in claim 3, wherein, The feature F output by the sparse feature aggregation module AE F BE Representative key point features are passed to the feature F output by the dense feature aggregation module. AD F BD The sparse-to-dense feature aggregation module performs T3 recursive updates, and the update process is represented as follows: wherein the enhanced features F AD , F BD as the initial value The representative key point feature F output by the sparse feature aggregation module is taken as the initial value AE , F BE as the initial value 8. The local feature matching system based on keypoint detection of claim 1, wherein, The formation process of the credibility matrix C is specifically as follows: determining features a score matrix S between the representative key points, The score matrix S is converted into the credibility matrix C, C(i,j)=Softmax(S(i,·))·Softmax(S(·,j)), wherein S(i,·) and S(·,j) respectively represent all elements of the i-th row and the j-th column of the score matrix S, and C(i,j) represents the credibility of matching between the i-th key point and the j-th key point.

9. The local feature matching system based on keypoint detection of claim 1, wherein, The image I is divided with an 8x8 grid A , I B , and the center pixel of all grids is taken as the key point P of the corresponding image A , P B .