Feature extraction model training method and device, electronic equipment and program product
By fine-tuning the pre-trained feature extraction model, image feature points in the background region are removed. Semantic segmentation and self-supervised learning methods are used to improve the accuracy and robustness of the feature extraction model, solving problems that have not been solved in the prior art. This technology is applied to the field of image feature extraction, specifically involving AR navigation and image feature extraction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ALIBABA (CHINA) CO LTD
- Filing Date
- 2022-03-14
- Publication Date
- 2026-06-26
AI Technical Summary
Traditional image feature extraction methods suffer from a small number of feature points, uneven distribution, and poor robustness. Furthermore, deep learning-based network feature extraction methods suffer from poor semantic distribution, which affects the subsequent feature point tracking performance.
By fine-tuning the pre-trained feature extraction model, removing image feature points from background regions, and using semantic segmentation and self-supervised learning methods, the model's ability to recognize background regions of images is improved. Further fine-tuning is performed using images without background regions, and a self-supervised learning process is formed by combining homography transformation and geometric shape training.
It improves the accuracy and robustness of the feature extraction model, avoids extracting feature points at infinity from the background region, reduces the complexity of image feature point extraction, and improves the recognition ability of image feature points.
Smart Images

Figure CN114898170B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of image technology, specifically to a training method, apparatus, electronic device, and program product for a feature extraction model. Background Technology
[0002] With the development of AR (Augmented Reality) technology, its applications are becoming increasingly widespread, with AR navigation being one of the main applications of image-based AR technology. Image-based AR technology typically requires local feature extraction from images. The inventors of this disclosure have discovered that some traditional image feature extraction methods suffer from problems such as a small number of extracted feature points, uneven distribution, and poor robustness. While deep learning-based network feature extraction methods improve upon these issues, poor semantic distribution in these methods affects the extraction of feature points at infinity in the background region, impacting subsequent feature point tracking and other computational processes. Therefore, overcoming the problem of inaccurate image feature point extraction due to poor semantic distribution is one of the main technical challenges that needs to be addressed. Summary of the Invention
[0003] This disclosure provides a method, apparatus, electronic device, and program product for training a feature extraction model.
[0004] In a first aspect, embodiments of this disclosure provide a method for training a feature extraction model, comprising:
[0005] Obtain the fine-tuned training images;
[0006] Image feature points of the fine-tuned training image are extracted using a pre-trained feature extraction model.
[0007] Remove image feature points belonging to the background region of the fine-tuned training image, and use the remaining image feature points as the annotation labels of the fine-tuned training image;
[0008] The feature extraction model is fine-tuned and trained using the fine-tuned training images and the labeled tags.
[0009] Furthermore, the method also includes:
[0010] Obtain fine-tuned training images with no background region;
[0011] The feature extraction model is used to extract image feature points from the fine-tuned training image in the background-free region, and the image feature points are used as annotation labels for the fine-tuned training image in the background-free region.
[0012] The feature extraction model is fine-tuned and trained using the labeled images of the fine-tuned training images without background regions.
[0013] Furthermore, the method also includes:
[0014] Obtain unlabeled training sample images;
[0015] The training sample images are subjected to homography transformation to obtain transformed sample images;
[0016] Image feature points are extracted from the transformed sample image and the training sample image using a pre-trained base model.
[0017] After performing the inverse homography transformation on the image feature points extracted from the transformed sample image, the homography transformation is combined with the image feature points extracted from the training sample image to serve as the annotation labels for the training sample image.
[0018] The base model is trained using the training sample images and the labeled images to obtain the feature extraction model.
[0019] Furthermore, the method also includes:
[0020] Obtain the geometric shape image and the image feature points of the geometric shape image;
[0021] The basic model is trained using the geometric shape image and its image feature points.
[0022] Further, image feature points belonging to the background region of the fine-tuned training image are removed, and the retained image feature points are used as annotation labels for the fine-tuned training image, including:
[0023] The fine-tuned training image is semantically segmented to obtain a semantic segmentation result; the semantic segmentation result includes the location information of the background region of the fine-tuned training image.
[0024] The semantic segmentation results are used to remove image feature points belonging to the background region of the fine-tuned training image, and the retained image feature points are used as annotation labels for the fine-tuned training image.
[0025] Secondly, this invention provides an image feature extraction method, comprising: extracting image feature points from an image using a feature extraction model trained by the method described in the first aspect.
[0026] Thirdly, this invention provides an AR navigation method, comprising: extracting image feature points from a real-world image using the method described in the second aspect, and providing AR navigation services to the navigated object based on the extracted image feature points.
[0027] Fourthly, embodiments of the present invention provide a training apparatus for a feature extraction model, comprising:
[0028] The first acquisition module is configured to acquire fine-tuned training images;
[0029] The first extraction module is configured to extract image feature points of the fine-tuned training image using a pre-trained feature extraction model.
[0030] The removal module is configured to remove image feature points belonging to the background region of the fine-tuned training image, and use the retained image feature points as annotation labels for the fine-tuned training image.
[0031] The first training module is configured to fine-tune the feature extraction model using the fine-tuned training images and the labeled tags.
[0032] The function can be implemented by hardware or by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above function.
[0033] In one possible design, the above-described device includes a memory and a processor. The memory stores one or more computer instructions that support the device in performing the corresponding methods described above, and the processor is configured to execute the computer instructions stored in the memory. The device may also include a communication interface for communicating with other devices or communication networks.
[0034] Fifthly, embodiments of this disclosure provide an electronic device including a memory, a processor, and a computer program stored in the memory, wherein the processor executes the computer program to implement the method described in any of the preceding aspects.
[0035] In a sixth aspect, embodiments of this disclosure provide a computer-readable storage medium for storing computer instructions used by any of the above-described devices, which, when executed by a processor, are used to implement the methods described in any of the above aspects.
[0036] In a seventh aspect, embodiments of this disclosure provide a computer program product comprising computer instructions which, when executed by a processor, are used to implement the methods described in any of the preceding aspects.
[0037] The technical solutions provided in this disclosure can include the following beneficial effects:
[0038] In this embodiment, during the training of the feature extraction model, the pre-trained feature extraction model undergoes further fine-tuning training. During this fine-tuning, the feature extraction model extracts image feature points from the training image, and after removing feature points belonging to the background region, the remaining image feature points are used as annotations for the training image. The training image and the annotations are then used to further fine-tune the feature extraction model. This approach overcomes the problem of traditionally trained feature extraction models easily extracting feature points at infinite distances in the background region due to poor semantic distribution. The feature extraction model trained using the fine-tuning method of this embodiment incorporates semantic supervision of the background region during the fine-tuning training process, enabling the model to recognize image feature points in the image background region. Consequently, the extracted image feature points no longer contain image feature points from the background region, improving the performance of the feature extraction model.
[0039] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure. Attached Figure Description
[0040] Other features, objects, and advantages of this disclosure will become more apparent from the following detailed description of non-limiting embodiments, taken in conjunction with the accompanying drawings. In the drawings:
[0041] Figure 1 A flowchart illustrating a training method for a feature extraction model according to an embodiment of the present disclosure is shown.
[0042] Figure 2 A schematic diagram illustrating the pre-training process of the feature extraction model in an embodiment of this disclosure is shown.
[0043] Figure 3 A schematic diagram illustrating the fine-tuning process of a feature extraction model according to an embodiment of the present disclosure is shown.
[0044] Figure 4 This diagram illustrates the application of a feature extraction model according to an embodiment of the present disclosure in an AR navigation application scenario.
[0045] Figure 5 A structural block diagram of a training apparatus for a feature extraction model according to an embodiment of the present disclosure is shown.
[0046] Figure 6 This is a schematic diagram of the structure of an electronic device suitable for implementing a training method, image feature extraction method, and / or AR navigation method of a feature extraction model according to an embodiment of the present disclosure. Detailed Implementation
[0047] In the following, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings to enable those skilled in the art to readily implement them. Furthermore, for clarity, portions unrelated to the description of the exemplary embodiments have been omitted from the drawings.
[0048] In this disclosure, it should be understood that terms such as “comprising” or “having” are intended to indicate the presence of features, figures, steps, behaviors, components, parts or combinations thereof disclosed in this specification, and do not preclude the possibility of the presence or addition of one or more other features, figures, steps, behaviors, components, parts or combinations thereof.
[0049] It should also be noted that, unless otherwise specified, the embodiments and features described in this disclosure can be combined with each other. This disclosure will now be described in detail with reference to the accompanying drawings and embodiments.
[0050] The details of the embodiments of this disclosure are described in detail below through specific examples.
[0051] Figure 1 A flowchart illustrating a training method for a feature extraction model according to an embodiment of the present disclosure is shown. Figure 1 As shown, the training method for this feature extraction model includes the following steps:
[0052] In step S101, the fine-tuning training image is acquired;
[0053] In step S102, the image feature points of the fine-tuned training image are extracted using a pre-trained feature extraction model;
[0054] In step S103, image feature points belonging to the background region of the fine-tuned training image are removed, and the remaining image feature points are used as annotation labels for the fine-tuned training image.
[0055] In step S104, the feature extraction model is fine-tuned using the fine-tuned training image and the labeled image.
[0056] In this embodiment, the training method of the feature extraction model can be executed on a server to fine-tune the pre-trained feature extraction model so that the feature extraction model can extract more accurate feature points from the image, thereby overcoming the defect of extracting feature points at infinity from the background region.
[0057] It is understood that the feature extraction model can be pre-trained and has the ability to extract image feature points. In view of the problem that the feature extraction model pre-trained by traditional methods is insufficient, this disclosure proposes a fine-tuning training method. After fine-tuning training according to this disclosure, the feature extraction model can overcome the problem of extracting feature points at infinity from the background region of the image.
[0058] In some embodiments, the feature extraction model can be a deep learning-based neural network model.
[0059] In order to fine-tune the feature extraction model, this embodiment first collects fine-tuning training images, which may be unlabeled, that is, the labels of the fine-tuning training images are unknown.
[0060] Subsequently, the feature extraction model can be used to extract image feature points from the fine-tuning training image, and feature points belonging to the background region of the fine-tuning training image can be removed from the extracted image feature points. The retained image feature points are used as the annotation labels of the fine-tuning training image.
[0061] By further training the feature extraction model using the fine-tuned training image and its corresponding labels, the model can be made capable of removing feature points from the image background. This avoids the problem of traditional methods extracting feature points from background regions, such as extracting sky features from outdoor images. It should be noted that traditional methods require a separate semantic segmentation network to effectively remove background feature points from the extracted image features. However, in this embodiment, the feature extraction model is fine-tuned so that background feature points are removed from the final extracted image features, eliminating the need for an additional semantic segmentation network and reducing the complexity of image feature point extraction.
[0062] In this embodiment, during the training of the feature extraction model, the pre-trained feature extraction model undergoes further fine-tuning training. During this fine-tuning, the feature extraction model extracts image feature points from the training image, and after removing feature points belonging to the background region, the remaining image feature points are used as annotations for the training image. The training image and the annotations are then used to further fine-tune the feature extraction model. This approach overcomes the problem of traditionally trained feature extraction models easily extracting feature points at infinite distances in the background region due to poor semantic distribution. The feature extraction model trained using the fine-tuning method of this embodiment incorporates semantic supervision of the background region during the fine-tuning training process, enabling the model to recognize image feature points in the image background region. Consequently, the extracted image feature points no longer contain image feature points from the background region, improving the performance of the feature extraction model.
[0063] In an optional implementation of this embodiment, the method further includes the following steps:
[0064] Obtain fine-tuned training images with no background region;
[0065] The feature extraction model is used to extract image feature points from the fine-tuned training image in the background-free region, and the image feature points are used as annotation labels for the fine-tuned training image in the background-free region.
[0066] The feature extraction model is fine-tuned and trained using the labeled images of the fine-tuned training images without background regions.
[0067] In this optional implementation, to improve the fine-tuning training effect, fine-tuning training images of background-free regions, such as regions without sky, can be collected. These background-free fine-tuning training images can then be used to further fine-tune the feature extraction model, thereby improving its ability to recognize background regions. In this implementation, the feature extraction model can extract image feature points from the background-free fine-tuning training images. These extracted feature points are used as labels for the background-free fine-tuning training images, and the labels and corresponding fine-tuning training images are then used to further fine-tune the feature extraction model. This approach improves the ability of the fine-tuned feature extraction model to recognize various types of images, thereby increasing the accuracy of the image feature points extracted by the model.
[0068] In an optional implementation of this embodiment, the method further includes the following steps:
[0069] Obtain unlabeled training sample images;
[0070] The training sample images are subjected to homography transformation to obtain transformed sample images;
[0071] Image feature points are extracted from the transformed sample image and the training sample image using a pre-trained base model.
[0072] After performing the inverse homography transformation on the image feature points extracted from the transformed sample image, the homography transformation is combined with the image feature points extracted from the training sample image to serve as the annotation labels for the training sample image.
[0073] The base model is trained using the training sample images and the labeled images to obtain the feature extraction model.
[0074] In this optional implementation, the present disclosure proposes a pre-training method for a feature extraction model. This pre-training method is a self-supervised learning process. In this self-supervised learning process, the training sample images do not need to be manually labeled, that is, the training sample images do not have corresponding labels.
[0075] In the self-supervised learning process of the feature extraction model, the first step is to obtain unlabeled training sample images, and then perform various homography transformations on these training sample images to obtain the corresponding transformed sample images.
[0076] In some embodiments, the training sample images used in the pre-training process and the fine-tuning training images used in the fine-tuning training process may be the same or different.
[0077] Before the self-supervised training of the feature extraction model, a basic model trained with simple geometric shapes can be obtained. This basic model has a simple recognition ability for images and can extract relatively obvious image feature points from the images. In other words, this basic model has been pre-trained with simple geometric shapes and has limited recognition ability.
[0078] The homography-transformed sample image is input into a pre-trained base model. The base model extracts image feature points from the homography-transformed sample image, and the extracted image feature points are then subjected to the inverse homography transformation and used as the annotation labels for the training sample image. Additionally, image feature points from the training sample image itself are extracted using the base model and also used as annotation labels for the training sample image. The base model is then trained using the training sample images and these annotation labels. After training with a large number of training sample images, a self-supervised pre-trained feature extraction model can be obtained.
[0079] In some embodiments, homography transformation may include, but is not limited to, image transformations such as cropping, scaling, rotation, and translation of training sample images.
[0080] In some embodiments, the same training sample image can undergo multiple homography transformations to obtain multiple transformed sample images. A base network is then used to extract features from these multiple transformed sample images. The extracted image feature points, after inverse transformation, can all be used as annotation labels for the training sample images. In this embodiment, because the base model has a simple ability to recognize images, it can extract relatively obvious image feature points from the images, such as some prominent corner points. Therefore, by inputting the training sample image through multiple homography transformations into the base model, this embodiment of the disclosure enables the base model to extract more image feature points from the various homography transformed images.
[0081] In an optional implementation of this embodiment, the method further includes the following steps:
[0082] Obtain the geometric shape image and the image feature points of the geometric shape image;
[0083] The basic model is trained using the geometric shape image and its image feature points.
[0084] In this optional implementation, as mentioned above, the base model is a model trained with simple geometric shapes, and the model structure of the base model can adopt a neural network model structure based on deep learning.
[0085] For the base model, simple geometric shapes can be used for training. First, images containing simple geometric shapes are acquired, such as images of triangles, squares, and other polygons. Feature points in these simple geometric shape images can be pre-labeled; for example, the vertices of the simple geometric shape can be used as feature points in the image. These feature points are then used as labels for the image. The base model is trained using these simple geometric shape images and their corresponding labels. After training with a large number of simple geometric shape images, a pre-trained base model can be obtained. This base model has a certain ability to recognize image feature points, such as corner points, and can extract these feature points.
[0086] Using this basic model, image feature points can be extracted from the homography transformation image of unlabeled training sample images, and then the annotation labels of the unlabeled training sample images can be obtained based on these image feature points.
[0087] Figure 2 This diagram illustrates the pre-training process of the feature extraction model in an embodiment of this disclosure. For example... Figure 2 As shown, the pre-training process of the feature extraction model consists of three parts: pre-training of the base model, self-labeling of training sample images, and training of the feature extraction model.
[0088] During the pre-training process of the base model, the deep learning-based neural network model is trained using simple geometric shapes and their corner points to obtain the base model; that is, the training samples in this process are images containing simple geometric shapes, and the labels are the vertices and corner points of the simple geometric shapes.
[0089] In the self-labeling process, the unlabeled training sample images are homography transformed to obtain transformed sample images. Then, the pre-trained base model from the previous process is used to extract image feature points from the training sample images and the transformed sample images. The image feature points extracted from the transformed sample images are then subjected to inverse homography transformation. The image feature points after inverse transformation and the image feature points extracted from the training sample images are used as the labeling labels for the training sample images, thus completing the self-labeling process.
[0090] During the pre-training process of the feature extraction model, the model is trained using the labels of the training sample images. During the training process, the training sample images can be input into the feature extraction model to obtain the predicted values of feature points. Taking corner points as an example, the first corner point loss function is constructed based on the difference between the predicted corner point position and the corner point position in the label.
[0091] In addition to extracting the location information of corner points, the feature extraction model can also form corner point descriptors. To train the feature extraction model's ability to form descriptors, the training sample images can be transformed, such as homography transformation. The transformed images are then input into the feature extraction model to obtain the predicted values of the corner points in the transformed images. A second corner point loss function is constructed based on the difference between the predicted value and the position of the corner point in the label (which can be obtained after the corresponding image transformation). A descriptor loss function is constructed based on the difference between the predicted value of the corner point descriptor and the predicted value of the descriptor in the training sample image without image transformation.
[0092] The feature extraction model is trained using the first corner loss function, the second corner loss function, and the descriptor loss function, and a pre-trained feature extraction model can be obtained.
[0093] In an optional implementation of this embodiment, step S103, which involves removing image feature points belonging to the background region of the fine-tuned training image and using the retained image feature points as annotation labels for the fine-tuned training image, further includes the following steps:
[0094] The fine-tuned training image is semantically segmented to obtain a semantic segmentation result; the semantic segmentation result includes the location information of the background region of the fine-tuned training image.
[0095] The semantic segmentation results are used to remove image feature points belonging to the background region of the fine-tuned training image, and the retained image feature points are used as annotation labels for the fine-tuned training image.
[0096] In this optional implementation, a pre-trained feature extraction model is used to extract image feature points in the fine-tuned training image. Furthermore, existing semantic segmentation methods can be used to perform semantic segmentation on the fine-tuned training image to obtain a semantic segmentation result. This semantic segmentation result may include the location information of the background region and / or foreground region in the fine-tuned training image.
[0097] Using the semantic segmentation results, after removing the image feature points belonging to the background region from the image feature points extracted in the fine-tuning training image, the remaining image feature points are used as the annotation labels for the fine-tuning training image.
[0098] Then, by using the labeled image feature points that only include the foreground region and the fine-tuned training image, the feature extraction model is fine-tuned and trained. This enables the feature extraction model to extract image feature points belonging to the foreground region from the image without extracting image feature points in the background region. This overcomes the defect of traditional image feature extraction models that extract image feature points at infinity, such as in the sky, and improves the feature extraction effect of the image feature extraction model.
[0099] Figure 3 A schematic diagram illustrating the fine-tuning process of a feature extraction model according to an embodiment of this disclosure is shown. For example... Figure 3 As shown, the unlabeled fine-tuning training image is semantically segmented to obtain the semantic segmentation result. After the fine-tuning training image is processed by the pre-trained feature extraction model, the corner points are extracted. Based on the semantic segmentation result, the corner points belonging to the background region are removed. The retained corner points are used as the annotation labels of the fine-tuning training image. The feature extraction model is further fine-tuned and trained using the fine-tuning training image and the annotation labels, and finally the improved feature extraction model can be obtained.
[0100] According to an embodiment of the present disclosure, the image feature extraction method includes: extracting image feature points from an image using a feature extraction model trained by the training method described above.
[0101] In this embodiment, the image feature extraction method is executed on the user terminal and applied to the visual odometry process of visual SLAM. The terminal can be a mobile phone, iPad, computer, smartwatch, vehicle, etc. In this embodiment, for images acquired by an image acquisition device on the terminal, image feature points are extracted using a feature extraction model. Image feature matching or optical flow tracing methods are then used to track these extracted feature points in subsequently acquired images. A three-dimensional spatial model is then built based on the successfully tracked feature points, and this model is fused with map data to obtain an AR navigation map. This AR navigation map is presented on the user terminal's navigation interface, providing AR navigation for the user.
[0102] According to an embodiment of the present disclosure, the AR navigation method includes: extracting image feature points in a real-world image using the image feature extraction method described above, and providing AR navigation services to the navigated object based on the extracted image feature points.
[0103] In this embodiment, the AR navigation method is executed on a navigation device and can be applied during AR navigation. The navigation device can be a mobile phone, iPad, computer, smartwatch, vehicle, AR glasses, or other wearable devices. In this embodiment, a visual odometry calculation method can be pre-deployed on the navigation device, and an image acquisition device can also be installed on the navigation device. During the movement of the vehicle carrying the navigation device or other navigated object, the image acquisition device can acquire real-world images in three-dimensional space in real time. The real-world images acquired by the image acquisition device are transmitted to the visual odometry in real time, so that the visual odometry can extract image feature points from the real-world images using a feature extraction model, and use image feature matching methods or optical flow tracing methods to match or track the extracted image feature points in subsequently acquired real-world images. Then, a three-dimensional spatial model is built based on the tracked image feature points, and the three-dimensional spatial model is fused with map data to finally obtain an AR navigation map. This AR navigation map is presented on the navigation interface of the navigation device, providing AR navigation services to the navigated object.
[0104] Figure 4 This diagram illustrates the application of a feature extraction model according to an embodiment of the present disclosure in an AR navigation application scenario. For example... Figure 4 As shown, the server can pre-train a base model based on simple geometric shapes and their corner points. After receiving training sample images for pre-training the feature extraction model, the server can use the pre-trained base model to self-annotate the training sample images, obtaining labeled tags for the training sample images. Then, the server uses the training sample images and labeled tags to train the base model, resulting in the pre-trained feature extraction model.
[0105] To improve the feature extraction model's ability to extract image feature points and avoid extracting image feature points infinitely far from the background region, the server can randomly select a portion of images from the training sample images. This portion of images is then semantically segmented using semantic segmentation. The feature extraction model is then used to extract image feature points from this portion of images. Based on the semantic segmentation results, image feature points belonging to the background region are removed. The retained image feature points, along with the training sample images, are then used to fine-tune the feature extraction model, resulting in a finely tuned model. The server can then push the trained feature extraction model to the navigation device.
[0106] During AR navigation, the navigation device activates visual odometry and acquires real-world images in real time. These images are input into the visual odometry, which then feeds them into a feature extraction model for image feature extraction. The extracted image feature points are used by the visual odometry for feature point tracking, and a three-dimensional spatial model is built based on the successfully tracked image feature points. This three-dimensional spatial model is fused with map data and then rendered before being output to the navigation interface of the navigation device, providing AR navigation services to the navigated object.
[0107] The following are embodiments of the apparatus disclosed herein, which can be used to execute embodiments of the method disclosed herein.
[0108] Figure 5 This diagram illustrates a structural block diagram of a training apparatus for a feature extraction model according to an embodiment of the present disclosure. This apparatus can be implemented as part or all of an electronic device through software, hardware, or a combination of both. Figure 5 As shown, the training device for this feature extraction model includes:
[0109] The first acquisition module 501 is configured to acquire fine-tuning training images;
[0110] The first extraction module 502 is configured to extract image feature points of the fine-tuned training image using a pre-trained feature extraction model.
[0111] The removal module 503 is configured to remove image feature points belonging to the background region of the fine-tuned training image, and use the retained image feature points as annotation labels for the fine-tuned training image.
[0112] The first training module 504 is configured to fine-tune the feature extraction model using the fine-tuned training image and the labeled image.
[0113] In this embodiment, the training device for the feature extraction model can be executed on a server to fine-tune the pre-trained feature extraction model so that the feature extraction model can extract more accurate feature points from the image, thereby overcoming the defect of extracting feature points at infinity from the background region.
[0114] It is understood that the feature extraction model can be pre-trained and has the ability to extract image feature points. In view of the problem that the feature extraction model pre-trained by traditional devices is insufficient, this disclosure proposes a fine-tuning training device. After fine-tuning training by this disclosure, the feature extraction model can overcome the problem of extracting feature points at infinity from the background region of the image.
[0115] In some embodiments, the feature extraction model can be a deep learning-based neural network model.
[0116] In order to fine-tune the feature extraction model, this embodiment first collects fine-tuning training images, which may be unlabeled, that is, the labels of the fine-tuning training images are unknown.
[0117] Subsequently, the feature extraction model can be used to extract image feature points from the fine-tuning training image, and feature points belonging to the background region of the fine-tuning training image can be removed from the extracted image feature points. The retained image feature points are used as the annotation labels of the fine-tuning training image.
[0118] By further training the feature extraction model using the fine-tuned training image and its corresponding labels, the model can be made capable of removing feature points from the image background. This avoids the problem of traditional devices extracting feature points from background regions, such as extracting sky features from outdoor images. It should be noted that traditional devices require a separate semantic segmentation network to effectively remove background feature points from the extracted image features. However, in this embodiment, the feature extraction model is fine-tuned so that background feature points are removed from the final extracted image features, eliminating the need for an additional semantic segmentation network and reducing the complexity of image feature point extraction.
[0119] In this embodiment, during the training of the feature extraction model, the pre-trained feature extraction model undergoes further fine-tuning training. During this fine-tuning, the feature extraction model extracts image feature points from the training image, and after removing feature points belonging to the background region, the remaining image feature points are used as annotations for the training image. The training image and the annotations are then used to further fine-tune the feature extraction model. This approach overcomes the problem that feature extraction models trained by traditional devices often extract feature points at infinite distances in the background region due to poor semantic distribution. The feature extraction model trained by the fine-tuning training device of this embodiment incorporates semantic supervision of the background region during the fine-tuning training process, enabling the model to recognize image feature points in the image background region. This ensures that the extracted image feature points no longer contain image feature points from the background region, thus improving the performance of the feature extraction model.
[0120] In an optional implementation of this embodiment, the apparatus further includes the following steps:
[0121] The device further includes:
[0122] The second acquisition module is configured to acquire fine-tuned training images with no background region;
[0123] The second extraction module is configured to use the feature extraction model to extract image feature points of the fine-tuned training image in the backgroundless region, and use the image feature points as annotation labels of the fine-tuned training image in the backgroundless region.
[0124] The second training module is configured to fine-tune the feature extraction model using the labeled tags of the fine-tuned training images without background regions.
[0125] In this optional implementation, to improve the fine-tuning training effect, fine-tuning training images of background-free regions, such as regions without sky, can be collected. These background-free fine-tuning training images can then be used to further fine-tune the feature extraction model, thereby improving its ability to recognize background regions. In this implementation, the feature extraction model can extract image feature points from the background-free fine-tuning training images. These extracted feature points are used as labels for the background-free fine-tuning training images, and the labels and corresponding fine-tuning training images are then used to further fine-tune the feature extraction model. This approach improves the ability of the fine-tuned feature extraction model to recognize various types of images, thereby increasing the accuracy of the image feature points extracted by the model.
[0126] In an optional implementation of this embodiment, the apparatus further includes:
[0127] The third acquisition module is configured to acquire training sample images without labels;
[0128] The transformation module is configured to perform homography transformation on the training sample image to obtain a transformed sample image;
[0129] The third extraction module is configured to extract image feature points from the transformed sample image and the training sample image using a pre-trained base model.
[0130] The inverse transformation module is configured to perform the inverse transformation of the homography transformation on the image feature points extracted from the transformed sample image, and then use them together with the image feature points extracted from the training sample image as the annotation labels of the training sample image.
[0131] The third training module is configured to train the base model using the training sample images and the labeled images to obtain a feature extraction model.
[0132] In this optional implementation, the present disclosure proposes a pre-training device for a feature extraction model. This pre-training device is a self-supervised learning process. In this self-supervised learning process, the training sample images do not need to be manually labeled, that is, the training sample images do not have corresponding labels.
[0133] In the self-supervised learning process of the feature extraction model, the first step is to obtain unlabeled training sample images, and then perform various homography transformations on these training sample images to obtain the corresponding transformed sample images.
[0134] In some embodiments, the training sample images used in the pre-training process and the fine-tuning training images used in the fine-tuning training process may be the same or different.
[0135] Before the self-supervised training of the feature extraction model, a basic model trained with simple geometric shapes can be obtained. This basic model has a simple recognition ability for images and can extract relatively obvious image feature points from the images. In other words, this basic model has been pre-trained with simple geometric shapes and has limited recognition ability.
[0136] The homography-transformed sample image is input into a pre-trained base model. The base model extracts image feature points from the homography-transformed sample image and performs an inverse homography transformation on these feature points, which are then used as labels for the training sample image. Additionally, the base model extracts image feature points from the training sample image itself, which are also used as labels for the training sample image. The base model is then trained using the training sample images and these labels. After training with a large number of training sample images, a self-supervised pre-trained feature extraction model is obtained.
[0137] In some embodiments, homography transformation may include, but is not limited to, image transformations such as cropping, scaling, rotation, and translation of training sample images.
[0138] In some embodiments, the same training sample image can undergo multiple homography transformations to obtain multiple transformed sample images. A base network is then used to extract features from these multiple transformed sample images. The extracted image feature points, after inverse transformation, can all be used as annotation labels for the training sample images. In this embodiment, because the base model has a simple ability to recognize images, it can extract relatively obvious image feature points from the images, such as some prominent corner points. Therefore, by inputting the training sample image through multiple homography transformations into the base model, this embodiment of the disclosure enables the base model to extract more image feature points from the various homography transformed images.
[0139] In an optional implementation of this embodiment, the apparatus further includes:
[0140] The fourth acquisition module is configured to acquire a geometric shape image and image feature points of the geometric shape image;
[0141] The fourth training module is configured to train a base model using the geometric shape image and the image feature points of the geometric shape image.
[0142] For the base model, simple geometric shapes can be used for training. First, images containing simple geometric shapes are acquired, such as images of triangles, squares, and other polygons. Feature points in these simple geometric shape images can be pre-labeled; for example, the vertices of the simple geometric shape can be used as feature points in the image. These feature points are then used as labels for the image. The base model is trained using these simple geometric shape images and their corresponding labels. After training with a large number of simple geometric shape images, a pre-trained base model can be obtained. This base model has a certain ability to recognize image feature points, such as corner points, and can extract these feature points.
[0143] Using this basic model, image feature points can be extracted from the homography transformation image of unlabeled training sample images, and then the annotation labels of the unlabeled training sample images can be obtained based on these image feature points.
[0144] In an optional implementation of this embodiment, the removal module includes:
[0145] The segmentation submodule is configured to perform semantic segmentation on the fine-tuned training image to obtain a semantic segmentation result; the semantic segmentation result includes the location information of the background region of the fine-tuned training image.
[0146] The removal submodule is configured to use the semantic segmentation result to remove image feature points belonging to the background region of the fine-tuned training image, and use the retained image feature points as annotation labels for the fine-tuned training image.
[0147] In this optional implementation, a pre-trained feature extraction model is used to extract image feature points in the fine-tuned training image. Furthermore, an existing semantic segmentation device can be used to perform semantic segmentation on the fine-tuned training image to obtain a semantic segmentation result. This semantic segmentation result may include the location information of the background region and / or foreground region in the fine-tuned training image.
[0148] Using the semantic segmentation results, after removing the image feature points belonging to the background region from the image feature points extracted in the fine-tuning training image, the remaining image feature points are used as the annotation labels for the fine-tuning training image.
[0149] Then, by using the labeled image feature points that only include the foreground region and the fine-tuned training image, the feature extraction model is fine-tuned and trained. This enables the feature extraction model to extract image feature points belonging to the foreground region from the image without extracting image feature points in the background region. This overcomes the defect of traditional image feature extraction models that extract image feature points at infinity, such as in the sky, and improves the feature extraction effect of the image feature extraction model.
[0150] According to an embodiment of the present disclosure, the image feature extraction apparatus includes: extracting image feature points from an image using a feature extraction model trained by a training device for the feature extraction model described above.
[0151] In this embodiment, the image feature extraction device is located on the user terminal and is applied in the visual odometry process of visual SLAM. The terminal can be a mobile phone, iPad, computer, smartwatch, vehicle, etc. In this embodiment, for images acquired by the image acquisition device on the terminal, image feature points are extracted using a feature extraction model. The extracted image feature points are then tracked in subsequently acquired images using an image feature matching device or an optical flow tracking device. A three-dimensional spatial model is then built based on the successfully tracked image feature points, and this three-dimensional spatial model is fused with map data to finally obtain an AR navigation map. This AR navigation map is presented on the navigation interface of the user terminal to provide AR navigation for the user.
[0152] According to an embodiment of the present disclosure, the AR navigation device includes: extracting image feature points from a real-world image using the image feature extraction device described above, and providing AR navigation services to the object being navigated based on the extracted image feature points.
[0153] In this embodiment, the AR navigation device is located within the navigation equipment and can be used during AR navigation. The navigation equipment can be a mobile phone, iPad, computer, smartwatch, vehicle, AR glasses, or other wearable devices. In this embodiment, a visual odometry calculation method can be pre-deployed on the navigation equipment, and an image acquisition device can also be installed on the navigation equipment. During the movement of the vehicle carrying the navigation equipment or other navigated objects, the image acquisition device can acquire real-world images in three-dimensional space in real time. The real-world images acquired by the image acquisition device are transmitted to the visual odometry in real time, so that the visual odometry can use a feature extraction model to extract image feature points in the real-world images, and use an image feature matching device or an optical flow tracking device to match or track the extracted image feature points in subsequently acquired real-world images. Then, a three-dimensional spatial model is built based on the tracked image feature points, and the three-dimensional spatial model is fused with map data to finally obtain an AR navigation map. This AR navigation map is presented on the navigation interface of the navigation equipment, providing AR navigation services to the navigated objects.
[0154] Figure 6 This is a schematic diagram of the structure of an electronic device suitable for implementing a training method, an image feature extraction method, and / or an AR navigation method for a feature extraction model according to embodiments of the present disclosure.
[0155] like Figure 6 As shown, the electronic device 600 includes a processing unit 601, which can be implemented as a CPU, GPU, FPGA, NPU, or other processing unit. The processing unit 601 can execute various processes according to any of the methods described above in this disclosure, based on a program stored in the read-only memory (ROM) 602 or a program loaded from the storage portion 608 into the random access memory (RAM) 603. The RAM 603 also stores various programs and data required for the operation of the electronic device 600. The processing unit 601, ROM 602, and RAM 603 are interconnected via a bus 604. An input / output (I / O) interface 605 is also connected to the bus 604.
[0156] The following components are connected to I / O interface 605: an input section 606 including a keyboard, mouse, etc.; an output section 607 including a cathode ray tube (CRT), liquid crystal display (LCD), etc., and speakers, etc.; a storage section 608 including a hard disk, etc.; and a communication section 609 including a network interface card such as a LAN card, modem, etc. The communication section 609 performs communication processing via a network such as the Internet. A drive 610 is also connected to I / O interface 605 as needed. A removable medium 611, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., is installed on drive 610 as needed so that computer programs read from it can be installed into storage section 608 as needed.
[0157] In particular, according to embodiments of this disclosure, any of the methods described above in the embodiments of this disclosure can be implemented as a computer software program. For example, embodiments of this disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program containing program code for performing any of the methods in the embodiments of this disclosure. In such an embodiment, the computer program can be downloaded and installed from a network via communication section 609, and / or installed from removable medium 611.
[0158] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0159] The units or modules described in the embodiments of this disclosure can be implemented in software or hardware. The described units or modules can also be located in a processor, and the names of these units or modules do not necessarily constitute a limitation on the unit or module itself.
[0160] In another aspect, this disclosure also provides a computer-readable storage medium, which may be a computer-readable storage medium included in the apparatus described in the above embodiments; or it may be a standalone computer-readable storage medium not assembled into a device. The computer-readable storage medium stores one or more programs that are used by one or more processors to perform the methods described in this disclosure.
[0161] The above description is merely a preferred embodiment of this disclosure and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of the invention involved in this disclosure is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the inventive concept. For example, technical solutions formed by substituting the above-described features with (but not limited to) technical features disclosed in this disclosure that have similar functions.
Claims
1. A method for fine-tuning training a feature extraction model, wherein, include: Acquire fine-tuning training images, which are unlabeled; Image feature points of the fine-tuned training image are extracted using a pre-trained feature extraction model. Remove image feature points belonging to the background region of the fine-tuned training image, and use the remaining image feature points as the annotation labels of the fine-tuned training image; The pre-trained feature extraction model is fine-tuned using the fine-tuned training image and the labeled tags, so that the feature extraction model has the ability to recognize image feature points in the background region of the image.
2. The method according to claim 1, wherein, The method further includes: Obtain fine-tuned training images with no background region; The feature extraction model is used to extract image feature points from the fine-tuned training image in the background-free region, and the image feature points are used as annotation labels for the fine-tuned training image in the background-free region. The feature extraction model is fine-tuned and trained using the labeled images of the fine-tuned training images without background regions.
3. The method according to claim 1 or 2, wherein, The method further includes: Obtain unlabeled training sample images; The training sample images are subjected to homography transformation to obtain transformed sample images; Image feature points are extracted from the transformed sample image and the training sample image using a pre-trained base model. After performing the inverse homography transformation on the image feature points extracted from the transformed sample image, the homography transformation is combined with the image feature points extracted from the training sample image to serve as the annotation labels for the training sample image. The base model is trained using the training sample images and the labeled images to obtain the feature extraction model.
4. The method according to claim 3, wherein, The method further includes: Obtain the geometric shape image and the image feature points of the geometric shape image; The basic model is trained using the geometric shape image and its image feature points.
5. The method according to any one of claims 1-2 and 4, wherein, Image feature points belonging to the background region of the fine-tuned training image are removed, and the retained image feature points are used as annotation labels for the fine-tuned training image, including: The fine-tuned training image is semantically segmented to obtain a semantic segmentation result; the semantic segmentation result includes the location information of the background region of the fine-tuned training image. The semantic segmentation results are used to remove image feature points belonging to the background region of the fine-tuned training image, and the retained image feature points are used as annotation labels for the fine-tuned training image.
6. An image feature extraction method, wherein, include: The feature extraction model trained using the method described in any one of claims 1-5 extracts image feature points from an image.
7. An AR navigation method, wherein, include: The method described in claim 6 is used to extract image feature points from real-world images, and AR navigation services are provided for the navigable object based on the extracted image feature points.
8. A fine-tuning training device for a feature extraction model, wherein, include: The first acquisition module is configured to acquire fine-tuned training images, which are unlabeled. The first extraction module is configured to extract image feature points of the fine-tuned training image using a pre-trained feature extraction model. The removal module is configured to remove image feature points belonging to the background region of the fine-tuned training image, and use the retained image feature points as annotation labels for the fine-tuned training image. The first training module is configured to fine-tune the pre-trained feature extraction model using the fine-tuned training image and the labeled tags, so that the feature extraction model has the ability to recognize image feature points in the background region of the image.
9. An electronic device, wherein, The method includes a memory, a processor, and a computer program stored in the memory, wherein the processor executes the computer program to implement the method of any one of claims 1-7.
10. A computer program product comprising computer instructions, wherein, When executed by a processor, the computer instructions implement the method described in any one of claims 1-7.