An object recognition method, apparatus, electronic device, and storage medium

By employing a multi-view matching strategy and a feature extraction network, background interference is suppressed, and multi-view feature matching similarity is calculated. This solves the problem of accuracy in object recognition caused by angle changes and occlusion, and achieves efficient object recognition.

CN119229210BActive Publication Date: 2026-06-19SHENZHEN LINKRIC TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHENZHEN LINKRIC TECH CO LTD
Filing Date
2024-10-11
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies are not accurate enough in object recognition, especially when the object angle changes greatly, the viewpoint is random, or there is partial occlusion.

Method used

A multi-view matching strategy is adopted. By acquiring multi-view images of the object to be identified, multi-view features are extracted. An explicit supervised spatial attention module and an automatic alignment cross attention module are used to suppress background interference. The matching similarity between the multi-view features and the multi-view template features of the objects in the retrieval database is calculated, and the recognition result is finally determined.

Benefits of technology

It effectively solves the problem of accurate object recognition when the object has a large angle change, random viewpoint, or partial occlusion in the image.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119229210B_ABST
    Figure CN119229210B_ABST
Patent Text Reader

Abstract

This invention discloses an object recognition method, apparatus, electronic device, and storage medium. It acquires multi-view images of the object to be recognized, extracts multi-view features from these images, and employs a multi-view matching strategy to determine the multi-view similarity between the multi-view features and the multi-view template features of various objects in a search database. Based on these multi-view similarity scores, the recognition result of the object to be recognized is determined. This invention utilizes a multi-view matching strategy, matching the multi-view features of the object to be recognized with the multi-view template features of various objects in a search database. This effectively solves the problem of accurately recognizing the object when the angle of the object changes significantly, the viewpoint is random, and the object is partially occluded in the image.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of object recognition technology, and more specifically, to an object recognition method, apparatus, electronic device, and storage medium. Background Technology

[0002] In daily life and production, object recognition (such as that for business processing, security, and other purposes) is required. Current methods typically rely on acquired images for object recognition. However, when the acquired image contains objects with large variations in angle, random viewing angles, or even partial occlusion, object recognition can easily fail. Summary of the Invention

[0003] In view of this, the present invention discloses an object recognition method, apparatus, electronic device and storage medium, so as to achieve accurate recognition of the object to be recognized even when the angle of the object to be recognized in the image changes greatly and the viewing angle is random, especially when the object to be recognized is partially occluded.

[0004] An object recognition method, comprising:

[0005] Acquire multi-view images of the object to be identified;

[0006] Extract multi-view features from the multi-view images;

[0007] A multi-view matching strategy is adopted to determine the multi-view matching similarity between the multi-view features and the multi-view template features of each object in the retrieval library;

[0008] The recognition result of the object to be identified is determined based on the multi-view matching similarity of each of the above.

[0009] Optionally, extracting multi-view features from the multi-view images includes:

[0010] The multi-view features are extracted from the multi-view images using a feature extraction network;

[0011] The feature extraction network includes: an explicit supervised spatial attention module and an automatically aligned cross attention module;

[0012] The explicit supervised spatial attention module is used to perform masking operations during feature extraction to suppress background region features of the multi-view image;

[0013] The automatic alignment cross-attention module is used to perform feature segmentation and align local region features in the multi-view image to obtain the multi-view features.

[0014] Optionally, the explicit supervised spatial attention module takes the intermediate features extracted from the multi-view image by the backbone network as input, and after channel compression and deformation recombination, obtains the attention vector of local global perception. The attention vector is then processed by a multilayer perceptron and deformed recombination to obtain spatial attention.

[0015] The automatic alignment cross-attention module is used to take the image features processed by the spatial attention as input, divide the image features into grid blocks to obtain multiple local region features of the multi-view image, calculate the cross-attention between the multiple local region features, and aggregate local features with similar semantics to the corresponding query to perform local region feature alignment to obtain the multi-view features.

[0016] Optionally, the step of employing a multi-view matching strategy to determine the multi-view matching similarity between the multi-view features and the multi-view template features of each object in the retrieval database includes:

[0017] A multi-view matching strategy is used to determine the similarity between the multi-view features and the multi-view template features of each object in the retrieval library, thereby obtaining a set of similarity values ​​for each object;

[0018] The average similarity of the set of similarity values ​​is taken, and the resulting average similarity is used as the multi-view matching similarity between the object to be identified and the corresponding object.

[0019] Optionally, the step of extracting multi-view features from the multi-view images and employing a multi-view matching strategy to determine the multi-view matching similarity between the multi-view features and the multi-view template features of each object in the retrieval database includes:

[0020] The multi-view features are extracted from the multi-view images using a pre-trained object recognition model, and a multi-view matching strategy is adopted to determine the multi-view matching similarity between the multi-view features and the multi-view template features of each object in the retrieval library.

[0021] Optionally, the training process of the object recognition model includes:

[0022] The first stage of self-supervised pre-training of the object recognition model was carried out based on the ImageNet1K dataset to obtain the backbone network parameters and the explicit supervised spatial attention module parameters.

[0023] Obtain an object recognition dataset, which includes multiple individuals of each object and multi-view images of each individual, with each multi-view image labeled with an object bounding box and a mask label;

[0024] Using the backbone network parameters and the explicit supervised spatial attention module parameters as initialization parameters, and the object recognition dataset as the training dataset, the second-stage task head training of the object recognition model is performed to obtain the automatic alignment cross-attention module parameters.

[0025] Using the backbone network parameters, the explicit supervised spatial attention module parameters, and the automatic alignment cross attention module parameters as initialization parameters, and the object recognition dataset as the training dataset, the third stage of end-to-end training of the object recognition model is performed to obtain the object recognition model.

[0026] Optionally, determining the recognition result of the object to be identified based on the similarity of each of the multi-view matching methods includes:

[0027] Each of the multi-view matching similarities is used as the final probability score of the object to be identified as the object corresponding to the multi-view matching similarity, and the maximum value of the final probability score is determined from all the final probability scores.

[0028] The target object corresponding to the maximum final probability score is determined as the recognition result of the object to be identified.

[0029] Optionally, determining the recognition result of the object to be identified based on the similarity of each of the multi-view matching methods includes:

[0030] The multi-view features are input into the customized classification module to obtain the classification score of the object to be identified as each object in the retrieval library;

[0031] The weighted average of the multi-view matching similarity and the classification score corresponding to each object is used to obtain the final probability score of the object to be identified as the object.

[0032] Determine the maximum final probability score from all the final probability scores;

[0033] The target object corresponding to the maximum final probability score is determined as the recognition result of the object to be identified.

[0034] Optionally, determining the target object corresponding to the maximum final probability score as the recognition result of the object to be identified includes:

[0035] Determine whether the maximum value of the final probability score is greater than the similarity threshold;

[0036] If so, the target object corresponding to the maximum value of the final probability score is determined as the recognition result of the object to be identified.

[0037] Optionally, the training process of the customized classification module includes:

[0038] Obtain a customized fine-tuning dataset of images to be identified, the customized fine-tuning dataset including: unlabeled images and labeled images;

[0039] When performing customized training using a semi-supervised training method, a pseudo-label is assigned to each of the unlabeled images, and the true label is determined for each of the labeled images;

[0040] The target multi-view features are extracted from the customized fine-tuning dataset, and the target multi-view features are classified in a customized manner to obtain the target classification score of each image to be identified in the customized fine-tuning dataset.

[0041] For each unlabeled image's pseudo-label, each labeled image's real label, and each target classification score, the cross-entropy loss function is used to calculate the loss function value, and the parameters of the customized classification module are trained using the backpropagation algorithm to obtain the customized classification module.

[0042] An object recognition device, comprising:

[0043] The image acquisition unit is used to acquire multi-view images of the object to be identified.

[0044] The feature extraction unit is used to extract multi-view features from the multi-view image;

[0045] The similarity matching unit is used to determine the multi-view matching similarity between the multi-view features and the multi-view template features of each object in the retrieval library by adopting a multi-view matching strategy.

[0046] The identification unit is used to determine the identification result of the object to be identified based on the multi-view matching similarity of each object.

[0047] An electronic device, comprising: a memory and a processor;

[0048] The memory is used to store at least one instruction;

[0049] The processor is used to execute the at least one instruction to implement the object recognition method described above.

[0050] A computer-readable storage medium storing at least one instruction that, when executed by a processor, implements the object recognition method described above.

[0051] As can be seen from the above technical solution, this invention discloses an object recognition method, apparatus, electronic device, and storage medium. It acquires multi-view images of the object to be recognized, extracts multi-view features from the multi-view images, and determines the multi-view matching similarity between the multi-view features and the multi-view template features of each object in the retrieval database using a multi-view matching strategy. This allows for the determination of the recognition result of the object to be recognized based on the multi-view matching similarity. This invention employs a multi-view matching strategy, effectively solving the problem of accurate object recognition when the angle of the object to be recognized in the image changes significantly, the viewpoint is random, and especially when the object is partially occluded. Attached Figure Description

[0052] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the published drawings without creative effort.

[0053] Figure 1 This is a flowchart of an object recognition method disclosed in an embodiment of the present invention;

[0054] Figure 2 This is a structural diagram of a feature extraction network disclosed in an embodiment of the present invention;

[0055] Figure 3 This is a schematic diagram of an object recognition process disclosed in an embodiment of the present invention;

[0056] Figure 4 This is a flowchart of a customized training, inference, and data acquisition process disclosed in an embodiment of the present invention;

[0057] Figure 5(a) is a schematic diagram of the first stage of self-supervised pre-training of an object recognition model disclosed in an embodiment of the present invention;

[0058] Figure 5(b) is a schematic diagram of the second-stage task head training of an object recognition model disclosed in an embodiment of the present invention;

[0059] Figure 5(c) is a schematic diagram of the third stage end-to-end training of an object recognition model disclosed in an embodiment of the present invention;

[0060] Figure 6 This is a schematic diagram of the structure of an object recognition device disclosed in an embodiment of the present invention;

[0061] Figure 7 This is a schematic diagram of the structure of an electronic device disclosed in an embodiment of the present invention. Detailed Implementation

[0062] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0063] This invention discloses an object recognition method, apparatus, electronic device, and storage medium. It acquires multi-view images of the object to be recognized, extracts multi-view features from the multi-view images, and determines the multi-view similarity between the multi-view features and the multi-view template features of each object in a search database using a multi-view matching strategy. Based on these multi-view similarity scores, the recognition result of the object to be recognized is determined. This invention employs a multi-view matching strategy, effectively solving the problem of accurate object recognition when the object's angle changes significantly, the viewing angle is random, and the object is partially occluded in the image.

[0064] See Figure 1 The present invention discloses a flowchart of an object recognition method, which includes:

[0065] Step S101: Obtain multi-view images of the object to be identified.

[0066] The objects to be identified in this application include, but are not limited to, animals, people, and vehicles.

[0067] Multi-view images refer to multiple images of an object to be identified in the same scene captured from different perspectives. The pose of the object to be identified in the same scene can be fixed or variable.

[0068] Step S102: Extract multi-view features from the multi-view images.

[0069] This application extracts the corresponding viewpoint features from each viewpoint image of a multi-view image, thereby obtaining the multi-view features of the object to be identified. This ensures that the object to be identified can still reflect the features of the object well, even when the object is partially occluded in the image, due to the large angle changes and random viewpoints.

[0070] Taking an animal as an example, the multi-view features extracted from multi-view images can be color features, texture features, and shape features.

[0071] Color characteristics: Different animals may exhibit different color characteristics from different perspectives. For example, some animals may show a sharp color contrast from a certain perspective, while their colors may appear dull or indistinguishable from other perspectives.

[0072] Texture features: An animal's skin, fur, or scales will exhibit different textures under different viewing angles. These texture features can be used to distinguish different species of animals or to identify the same animal from different perspectives.

[0073] Shape features, such as the size of an animal's body, the shape of its head, and the length of its limbs, can vary from different perspectives, but some basic shape features (such as the outline of the head and the shape of the tail) may remain consistent across multiple perspectives.

[0074] Step S103: Using a multi-view matching strategy, determine the multi-view matching similarity between the multi-view features and the multi-view template features of each object in the retrieval library.

[0075] This application pre-stores multi-view template features of different objects in the retrieval database. For example, the retrieval database stores multi-view template features of different breeds of cats, multi-view template features of different breeds of dogs, and so on.

[0076] In practical applications, multi-view images of various objects can be entered into the retrieval library, and then a feature extraction network can be used to extract the corresponding multi-view template features from the multi-view images of each object and store them.

[0077] In this application, the multi-view features of the object to be identified contain multiple different view features, and the multi-view template features of each object in the retrieval library contain multiple different view template features. By matching the multiple different view features of the object to be identified with the multiple different view template features of each object in the retrieval library using a multi-view matching strategy, the multi-view matching similarity between the object to be identified and each object in the retrieval library can be obtained.

[0078] By matching the multi-view features of the object to be identified with the multi-view template features of each object in the retrieval database using a multi-view matching strategy, the problem of identifying the object can be effectively solved when the angle of the object to be identified changes greatly and the viewpoint is random, especially when the object to be identified is partially occluded.

[0079] Step S104: Determine the recognition result of the object to be identified based on the multi-view matching similarity of each object.

[0080] Specifically, the recognition result of the object to be identified can be determined based on the numerical value of the multi-view matching similarity, including: taking the target object corresponding to the maximum value of the multi-view matching similarity as the recognition result of the object to be identified.

[0081] In summary, this invention discloses an object recognition method that acquires multi-view images of the object to be recognized, extracts multi-view features from the multi-view images, and determines the multi-view matching similarity between the multi-view features and the multi-view template features of each object in the retrieval database by employing a multi-view matching strategy. This allows for the determination of the recognition result of the object to be recognized based on the multi-view matching similarity. This invention employs a multi-view matching strategy, effectively solving the problem of accurate object recognition when the object to be recognized has large angle variations and random viewpoints in the image, especially when the object is partially occluded.

[0082] In one embodiment, step S102 may specifically include:

[0083] A feature extraction network is used to extract multi-view features from multi-view images.

[0084] The feature extraction network includes: an explicit supervised spatial attention module and an automatically aligned cross attention module;

[0085] The explicit supervised spatial attention module is used to perform masking operations during feature extraction to suppress background region features of the multi-view image.

[0086] The automatic alignment cross-attention module is used to perform feature segmentation and align local region features in the multi-view image to obtain the multi-view features.

[0087] For details, see Figure 2 The diagram shown illustrates the structure of the feature extraction network. Figure 2 As can be seen, the explicit supervised spatial attention module takes the intermediate features extracted from the multi-view images by the backbone network as input, and after channel compression and reshaping, obtains the attention vector of local and global perception. The attention vector is then processed by a multilayer perceptron and reshaping to obtain spatial attention.

[0088] During the training of the feature extraction network, the learning objective of the explicitly supervised spatial attention module is to assign high weights to the pixel positions of the target region and low weights to the background region. Therefore, spatial attention can enable the feature extraction network to suppress the interference of the background region.

[0089] The automatic alignment cross-attention module is used to take the image features processed by the spatial attention as input, divide the image features into grid blocks to obtain multiple local region features of the multi-view image, calculate the cross-attention between the multiple local region features, and aggregate local features with similar semantics to the corresponding query to perform local region feature alignment to obtain the multi-view features.

[0090] Specifically, the automatic alignment cross-attention module takes image features processed by spatial attention as input. It first divides the image features into n local region features of the multi-view image according to a grid. To align these local region features when calculating matching similarity, the automatic alignment cross-attention module generates a built-in parameter, namely... Figure 2 The system uses m queries q for cross-attention, and then uses n local region features as keys k and values ​​v for cross-attention. It calculates the cross-attention among these n local region features, aggregating local features with similar semantics to the corresponding queries q. The built-in parameter query q is randomly initialized at the start of training and is optimized along with other network parameters during training.

[0091] Ideally, the number of local region features n should be 16, and the number of queries q should be 4.

[0092] The explicit supervised spatial attention module and automatic alignment cross attention module designed in this invention effectively suppress problems caused by large changes in object posture, especially severe background interference (e.g., severe background interference within the bounding box due to large movements of the animal's tail and limbs) and misalignment of different parts of the object.

[0093] The three-stage training strategy for the feature extraction network designed in this invention effectively improves the alignment effect of the automatic alignment cross-attention module.

[0094] The explicit supervised spatial attention and automatically aligned cross-attention modules used in the feature extraction network in this invention can be further improved in performance by stacking or increasing the channel dimension.

[0095] In one embodiment, step S103 may specifically include:

[0096] A multi-view matching strategy is used to determine the similarity between the multi-view features and the multi-view template features of each object in the retrieval library, thereby obtaining a set of similarity values ​​for each object;

[0097] The average similarity of the set of similarity values ​​is taken, and the resulting average similarity is used as the multi-view matching similarity between the object to be identified and the corresponding object.

[0098] See details Figure 3The diagram shown illustrates the object recognition process. The retrieval database stores multi-view template features of each object. A multi-view matching strategy is adopted, and the cosine similarity index is used to determine the similarity between the multi-view features of the object to be identified and the multi-view template features of each object in the retrieval database, thus obtaining a set of similarity values ​​for each object.

[0099] For example, see Figure 3 A multi-view matching strategy is adopted, and the cosine similarity algorithm is used to calculate the similarity between the multi-view features of the object to be identified and the multi-view template features of object 1, resulting in a set of similarity values, namely: S 11 S 12 S 1n Where n is the number of multi-view template features in object 1. Let S... 11 S 12 S 1n Remove the average similarity to obtain the mean similarity value. and the average similarity This serves as the multi-view matching similarity between the object to be identified and object 1.

[0100] A multi-view matching strategy is adopted, and the cosine similarity algorithm is used to calculate the similarity between the multi-view features of the object to be identified and the multi-view template features of object 2, resulting in a set of similarity values, namely: S 21 S 22 S 2n Where n is the number of multi-view template features in object 2. Let S... 21 S 22 S 2n Remove the average similarity to obtain the mean similarity value. and the average similarity This serves as the multi-view matching similarity between the object to be identified and object 2.

[0101] The multi-view matching strategy designed in this invention uses the cosine similarity index to measure the degree of matching between the object to be identified and each object in the retrieval database. The cosine similarity index can be replaced by other similarity indices or distance metrics, such as Euclidean distance, Mahalanobis distance, Bach distance, etc.

[0102] In one embodiment, step S104 may specifically include:

[0103] Each of the multi-view matching similarities is used as the final probability score of the object to be identified as the object corresponding to the multi-view matching similarity, and the maximum value of the final probability score is determined from all the final probability scores.

[0104] The target object corresponding to the maximum final probability score is determined as the recognition result of the object to be identified.

[0105] from Figure 3 As can be seen, by adopting a multi-view matching strategy, we can obtain the multi-view matching similarity between the object to be identified and each object in the retrieval database. Each multi-view matching similarity can be used as the final probability score between the object to be identified and the corresponding object. The target object corresponding to the maximum value of the final probability score is determined as the identification result of the object to be identified. That is, the identification result of the object to be identified is: the object to be identified is the target object.

[0106] To improve the accuracy of identifying the object to be identified, a similarity threshold th can be set. It is determined whether the maximum value of the final probability score is greater than the similarity threshold. If so, the target object corresponding to the maximum value of the final probability score is determined as the identification result of the object to be identified, and the identification of the object to be identified is confirmed to be successful.

[0107] Conversely, if the maximum value of the final probability score is not greater than the similarity threshold, it is determined that the object to be identified has failed to be identified, that is, it is determined that the object to be identified does not exist in the search database.

[0108] In one embodiment, this application also provides a customized fine-tuning strategy to significantly improve the model's ability to identify key objects of interest. Even when the objects of interest are relatively fixed, the model can continuously learn the appearance features of the objects and adapt to changes in the appearance of the key objects of interest.

[0109] Based on this, step S104 may specifically include:

[0110] The multi-view features are input into a customized classification module to obtain the classification score of each object in the retrieval library for the object to be identified.

[0111] The weighted average of the multi-view matching similarity and the classification score corresponding to each object is used to obtain the final probability score of the object to be identified as the object.

[0112] Determine the maximum final probability score from all the final probability scores;

[0113] The target object corresponding to the maximum final probability score is determined as the recognition result of the object to be identified.

[0114] For details, see Figure 3 The input of the object to be identified is fed into the customized classification module, which obtains the classification score of the object as belonging to each object in the retrieval database. For example, the classification score of the object as object 1 in the retrieval database is [score missing]. The object to be identified was classified as object 2 in the retrieval database with a classification score of [missing information]. .

[0115] The weighted average of the multi-view matching similarity and classification score corresponding to each object is used to obtain the final probability score of the object to be identified.

[0116] Assume that the weighted weight of the multi-view matching similarity is α, the weighted weight of the classification score is (1-α), and the multi-view matching similarity of object 1 is... The classification score corresponding to object 1 is The multi-view matching similarity of object 1 is... The product of α and the classification score corresponding to object 1 is: Summing the product of (1-α) and (1-α) yields the final probability score that the object to be identified is object 1. .

[0117] The multi-view matching similarity for object 2 is The classification score corresponding to object 2 is The multi-view matching similarity of object 2 is... The product of α and the classification score corresponding to object 2 is: Summing the product of (1-α) and (1-α) yields the final probability score that the object to be identified is object 2. .

[0118] Maximize all final probability scores The corresponding target object is determined as the recognition result of the object to be identified.

[0119] It should be noted that the weighted weight α is set to 1 before the customized fine-tuning is completed, which means that the object to be identified is confirmed only by the multi-view matching results. After the customized fine-tuning is completed, the weighted weight α is set to 0.5.

[0120] To improve the accuracy of identifying the object to be identified, a similarity threshold th can be set. It is determined whether the maximum value of the final probability score is greater than the similarity threshold. If so, the target object corresponding to the maximum value of the final probability score is determined as the identification result of the object to be identified, and the identification of the object to be identified is confirmed to be successful.

[0121] Conversely, if the maximum value of the final probability score is not greater than the similarity threshold, it is determined that the object to be identified has failed to be identified, that is, it is determined that the object to be identified does not exist in the search database.

[0122] It should be noted that the customized fine-tuning strategy further designed in this invention enables the entire object recognition model to continuously learn the features of individual objects that require long-term attention, thereby strengthening the model's ability to distinguish the identities of specific individual objects. The customized training, inference, and data acquisition process is as follows: Figure 4As shown. During the algorithm's operation, the customized fine-tuning process will be executed repeatedly at certain time intervals. The execution order each time is as follows: first, customized data collection is performed, then the customized training process is performed, and after the customized training is completed, the new customized model can be used to execute the customized inference process.

[0123] The specific process for a round of customized fine-tuning is as follows:

[0124] (1) Obtain a customized fine-tuning dataset of the image to be identified, wherein the customized fine-tuning dataset includes: unlabeled images and labeled images.

[0125] Specifically, during algorithm execution, images to be identified are randomly sampled and randomly pushed to users at a certain ratio for object information annotation (e.g., animal identification). The "image-annotation" pairs are then stored as labeled images in a customized fine-tuning dataset. Considering feasibility, the number of images to be identified pushed to users for annotation should be very small, with the majority of the customized fine-tuning data consisting of unlabeled images.

[0126] (2) When performing customized training using a semi-supervised training method, a pseudo-label is assigned to each of the unlabeled images, and the real label of each of the labeled images is determined.

[0127] Specifically, once the amount of data in the acquired customized fine-tuning dataset reaches the required level, the customized training process begins. Each customized training session consists of 60 training epochs with a fixed learning rate of 0.01. Since the customized fine-tuning dataset contains a large number of unlabeled images and a small number of labeled images, a semi-supervised training approach is adopted. In each iteration, training data is sampled from both labeled and unlabeled images at a preset ratio. For labeled images, the real labels are used directly; for unlabeled images, pseudo-labels are first assigned, and then training is performed using these pseudo-labels.

[0128] To assign pseudo-labels to unlabeled images, feature extraction and customized classification are first performed on the unlabeled images to obtain scores for each category. When the maximum score is greater than the score threshold ts, the pseudo-label corresponding to the maximum score is assigned to the unlabeled image. If all scores are less than the score threshold ts, the unlabeled image is considered invalid and will not participate in the current training iteration. In this invention, the customized classification module uses a multilayer perceptron consisting of three fully connected layers, and the score threshold ts is set to 0.7.

[0129] (3) Extract the target multi-view features from the customized fine-tuning dataset, and perform customized classification on the target multi-view features to obtain the target classification score of each image to be identified in the customized fine-tuning dataset.

[0130] (4) For each unlabeled image pseudo-label, each labeled image real label and each target classification score, the cross-entropy loss function is used to calculate the loss function value, and the parameters of the customized classification module are trained by the backpropagation algorithm to obtain the customized classification module.

[0131] The parameters of the feature extraction module used to extract multi-view features of the target from the customized fine-tuning dataset are fixed.

[0132] In practical applications, once the customized classification module has been trained, it can be used for object recognition. The complete inference process combines multi-view similarity matching. Category Score The final probability score of the object to be identified is obtained by weighting the weights of α. The maximum value of the final probability score is determined from all the final probability scores, and the target object corresponding to the maximum value of the final probability score is determined as the identification result of the object to be identified.

[0133] It should be noted that the customized fine-tuning of this invention adopts a semi-supervised training method, which requires only a small number of annotations to effectively and continuously learn the appearance features of the key target, thereby further improving the performance of the entire object recognition model.

[0134] The semi-supervised training method used in the customized fine-tuning strategy of this invention can be replaced by other semi-supervised training methods, such as FixMatch, MixMatch, and FreeMatch.

[0135] In one embodiment, step S103 may specifically include:

[0136] Using a pre-trained object recognition model and the aforementioned multi-view matching strategy, the multi-view matching similarity between the multi-view features and the multi-view template features of each object in the retrieval database is determined.

[0137] The training process of the object recognition model is divided into three stages: self-supervised pre-training stage, task head training stage, and end-to-end training stage.

[0138] The three training phases are illustrated in Figures 5(a), 5(b), and 5(c), respectively. The overall process is as follows:

[0139] (1) The first stage of self-supervised pre-training of the object recognition model is carried out based on the ImageNet1K dataset to obtain the backbone network parameters and explicit supervised spatial attention module parameters.

[0140] The self-supervised pre-training method used in this invention is the BarlowTwins algorithm, which requires only positive sample pairs and does not need to use negative sample pairs for training. The BarlowTwins algorithm uses a feature dimension of 8192 to calculate the contrastive loss, with a batch size of 256 data points per iteration and a total of 300 training rounds. In one training iteration, the same input image is first augmented twice. Then, the augmented image's features are extracted through a backbone network, an explicitly supervised spatial attention module, global average pooling, and a fully connected layer. Positive sample pairs are then constructed using the same image but with different augmented features, and the contrastive loss for all positive sample pairs in this iteration is calculated. The model is then trained based on the backpropagation algorithm. For a more detailed description of the training method, please refer to the existing technology BarlowTwins self-supervised pre-training.

[0141] The BarlowTwins algorithm is a self-supervised learning algorithm that aims to learn a universal representation that is invariant to changes in the input by reducing redundancy between output representations.

[0142] The BarlowTwins self-supervised pre-training method can be replaced by other pre-training methods, such as fully supervised pre-training, MOCO self-supervised pre-training, SimCLR self-supervised pre-training, BYOL self-supervised pre-training, etc.

[0143] The ImageNet1K dataset has a hierarchical directory structure, with each category corresponding to a subdirectory containing all images belonging to that category. This organization makes data access and processing relatively simple and efficient.

[0144] (2) Obtain an object recognition dataset, which includes multiple individuals of each object and multi-view images of each individual, with each multi-view image labeled with an object bounding box and a mask label.

[0145] Current training data for object recognition is limited in both quantity and quality. Therefore, before starting subsequent training, data collection should be conducted to suit the application scenario. The dataset should cover multiple objects (e.g., multiple animals), with multiple individuals for each object (e.g., multiple individuals for each animal), and each individual should have images from multiple angles and poses. Furthermore, bounding boxes and mask labels should be annotated for the objects in the images. Since mask annotation is a large-scale process, this invention employs an interactive annotation method. This involves uniformly annotating several points on the object and then automatically generating the mask using a large segmentation model such as SAM (Segment Anything Model).

[0146] (3) Using the backbone network parameters and the explicit supervised spatial attention module parameters as initialization parameters, and the object recognition dataset as the training dataset, perform the second-stage task head training of the object recognition model to obtain the automatic alignment cross attention module parameters.

[0147] Specifically, the parameters of the backbone network and the explicit supervised spatial attention module obtained in the first stage of training are used as initialization parameters, and the acquired object recognition dataset is used as the training dataset to begin the second stage of task head training. During task head training, the parameters of the backbone network and the explicit supervised spatial attention module are kept fixed, and only the parameters of the automatic alignment cross-attention module are adjusted. In each training iteration, N anchor point samples are first sampled, and then a positive samples and b negative samples are sampled for each anchor point sample to form the batch data for this iteration. Features are then extracted from these samples, positive and negative sample pairs are constructed, triplet loss is calculated, and the network is trained under supervision. In the feature space, positive sample pairs are brought closer together, and negative sample pairs are pushed away.

[0148] The triplet loss used in this invention is defined as:

[0149]

[0150] In the formula, For triple loss, Let be the cosine distance between the i-th anchor point sample and the a-sampled positive samples in this iteration. Let N be the cosine distance between the anchor sample and the b negative samples, and let N be the number of anchor samples. Let represent a real number matrix with 1 row and 'a' columns. Let represent a real matrix with 1 row and b columns.

[0151] In this invention, N is 256, a is 5, b is 15, the training rounds are 10, the stochastic gradient descent method is used for training, the learning rate is 0.2, and the triplet loss weight is 0.001.

[0152] (4) Using the backbone network parameters, the explicit supervised spatial attention module parameters and the automatic alignment cross attention module parameters as initialization parameters, and using the object recognition dataset as the training dataset, perform the third stage end-to-end training of the object recognition model to obtain the object recognition model.

[0153] The backbone network parameters and explicitly supervised spatial attention module parameters obtained in the first stage of training, as well as the homogeneous cross-attention module parameters obtained in the second stage of training, are used as initialization parameters. The acquired object recognition dataset is used as the training dataset to begin the third stage of end-to-end training. During end-to-end training, all network parameters are trained simultaneously, and mask label information is added to the loss function of the second stage. The MSE (Mean Squared Error) loss is used to supervise the spatial attention module's ability to focus on the target region and suppress background interference. In this invention, the weight of the MSE loss is 1.

[0154] After the above three-stage training is completed, the object recognition model designed in this invention has the ability to recognize objects. After inputting the multi-view image of the object to be recognized, the object recognition result can be obtained through the multi-view matching steps (i.e. feature extraction and cosine similarity calculation).

[0155] Corresponding to the above method embodiments, the present invention also discloses an object recognition device.

[0156] See Figure 6 The present invention discloses a schematic diagram of the structure of an object recognition device, which may include:

[0157] Image acquisition unit 201 is used to acquire multi-view images of the object to be identified;

[0158] The objects to be identified in this application include, but are not limited to, animals, people, and vehicles.

[0159] Multi-view images refer to multiple images of an object to be identified in the same scene captured from different perspectives. The pose of the object to be identified in the same scene can be fixed or variable.

[0160] Feature extraction unit 202 is used to extract multi-view features from the multi-view image;

[0161] This application extracts the corresponding viewpoint features from each viewpoint image of a multi-view image, thereby obtaining the multi-view features of the object to be identified. This ensures that the object to be identified can still reflect the features of the object well, even when the object is partially occluded in the image, due to the large angle changes and random viewpoints.

[0162] Taking an animal as an example, the multi-view features extracted from multi-view images can be color features, texture features, and shape features.

[0163] The similarity matching unit 203 is used to determine the multi-view matching similarity between the multi-view features and the multi-view template features of each object in the retrieval library by adopting a multi-view matching strategy.

[0164] This application pre-stores multi-view template features of different objects in the retrieval database. For example, the retrieval database stores multi-view template features of different breeds of cats, multi-view template features of different breeds of dogs, and so on.

[0165] In practical applications, multi-view images of various objects can be entered into the retrieval library, and then a feature extraction network can be used to extract the corresponding multi-view template features from the multi-view images of each object and store them.

[0166] In this application, the multi-view features of the object to be identified contain multiple different view features, and the multi-view template features of each object in the retrieval library contain multiple different view template features. By matching the multiple different view features of the object to be identified with the multiple different view template features of each object in the retrieval library using a multi-view matching strategy, the multi-view matching similarity between the object to be identified and each object in the retrieval library can be obtained.

[0167] By matching the multi-view features of the object to be identified with the multi-view template features of each object in the retrieval database using a multi-view matching strategy, the problem of identifying the object can be effectively solved when the angle of the object to be identified changes greatly and the viewpoint is random, especially when the object to be identified is partially occluded.

[0168] The identification unit 204 is used to determine the identification result of the object to be identified based on the multi-view matching similarity of each object.

[0169] Specifically, the recognition result of the object to be identified can be determined based on the numerical value of the multi-view matching similarity, including: taking the target object corresponding to the maximum value of the multi-view matching similarity as the recognition result of the object to be identified.

[0170] In summary, this invention discloses an object recognition device that acquires multi-view images of an object to be recognized, extracts multi-view features from the multi-view images, and determines the multi-view matching similarity between the multi-view features and the multi-view template features of each object in the retrieval database using a multi-view matching strategy. This allows for the determination of the recognition result of the object to be recognized based on the multi-view matching similarity. This invention employs a multi-view matching strategy, effectively solving the problem of accurate object recognition when the object to be recognized exhibits large angle variations and random perspectives in the image, especially when the object is partially occluded.

[0171] In one embodiment, the feature extraction unit 202 is further configured to:

[0172] The multi-view features are extracted from the multi-view images using a feature extraction network;

[0173] The feature extraction network includes: an explicit supervised spatial attention module and an automatically aligned cross attention module;

[0174] The explicit supervised spatial attention module is used to perform masking operations during feature extraction to suppress background region features of the multi-view image;

[0175] The automatic alignment cross-attention module is used to perform feature segmentation and align local region features in the multi-view image to obtain the multi-view features.

[0176] The explicit supervised spatial attention module takes the intermediate features extracted from the multi-view image by the backbone network as input, and after channel compression and deformation recombination, obtains the attention vector of local and global perception. The attention vector is then processed by a multilayer perceptron and deformation recombination to obtain spatial attention.

[0177] The automatic alignment cross-attention module is used to take the image features processed by the spatial attention as input, divide the image features into grid blocks to obtain multiple local region features of the multi-view image, calculate the cross-attention between the multiple local region features, and aggregate local features with similar semantics to the corresponding query to perform local region feature alignment to obtain the multi-view features.

[0178] In one embodiment, the similarity matching unit 203 is further configured to:

[0179] A multi-view matching strategy is used to determine the similarity between the multi-view features and the multi-view template features of each object in the retrieval library, thereby obtaining a set of similarity values ​​for each object;

[0180] The average similarity of the set of similarity values ​​is taken, and the resulting average similarity is used as the multi-view matching similarity between the object to be identified and the corresponding object.

[0181] In one embodiment, the feature extraction unit 202 is further configured to:

[0182] The multi-view features are extracted from the multi-view images using a pre-trained object recognition model.

[0183] In one embodiment, the similarity matching unit 203 is further configured to:

[0184] Using a pre-trained object recognition model and a multi-view matching strategy, the multi-view matching similarity between the multi-view features and the multi-view template features of each object in the retrieval database is determined, including:

[0185] The multi-view features are extracted from the multi-view images using a pre-trained object recognition model, and a multi-view matching strategy is adopted to determine the multi-view matching similarity between the multi-view features and the multi-view template features of each object in the retrieval library.

[0186] In one embodiment, the object recognition device further includes a recognition model training module.

[0187] The recognition model training module is used for:

[0188] The first stage of self-supervised pre-training of the object recognition model was carried out based on the ImageNet1K dataset to obtain the backbone network parameters and the explicit supervised spatial attention module parameters.

[0189] Obtain an object recognition dataset, which includes multiple individuals of each object and multi-view images of each individual, with each multi-view image labeled with an object bounding box and a mask label;

[0190] Using the backbone network parameters and the explicit supervised spatial attention module parameters as initialization parameters, and the object recognition dataset as the training dataset, the second-stage task head training of the object recognition model is performed to obtain the automatic alignment cross-attention module parameters.

[0191] Using the backbone network parameters, the explicit supervised spatial attention module parameters, and the automatic alignment cross attention module parameters as initialization parameters, and the object recognition dataset as the training dataset, the third stage of end-to-end training of the object recognition model is performed to obtain the object recognition model.

[0192] In one embodiment, the identification unit 204 is further configured to:

[0193] Each of the multi-view matching similarities is used as the final probability score of the object to be identified as the object corresponding to the multi-view matching similarity, and the maximum value of the final probability score is determined from all the final probability scores.

[0194] The target object corresponding to the maximum final probability score is determined as the recognition result of the object to be identified.

[0195] In one embodiment, the identification unit 204 is further configured to:

[0196] The multi-view features are input into the customized classification module to obtain the classification score of the object to be identified as each object in the retrieval library;

[0197] The weighted average of the multi-view matching similarity and the classification score corresponding to each object is used to obtain the final probability score of the object to be identified as the object.

[0198] Determine the maximum final probability score from all the final probability scores;

[0199] The target object corresponding to the maximum final probability score is determined as the recognition result of the object to be identified.

[0200] In one embodiment, the identification unit 204 is further configured to:

[0201] Determine whether the maximum value of the final probability score is greater than the similarity threshold;

[0202] If so, the target object corresponding to the maximum value of the final probability score is determined as the recognition result of the object to be identified.

[0203] In one embodiment, the object recognition device may further include a customized classification module training unit.

[0204] The customized classification module training unit is also used for:

[0205] Obtain a customized fine-tuning dataset of images to be identified, the customized fine-tuning dataset including: unlabeled images and labeled images;

[0206] When performing customized training using a semi-supervised training method, a pseudo-label is assigned to each of the unlabeled images, and the true label is determined for each of the labeled images;

[0207] The target multi-view features are extracted from the customized fine-tuning dataset, and the target multi-view features are classified in a customized manner to obtain the target classification score of each image to be identified in the customized fine-tuning dataset.

[0208] For each unlabeled image's pseudo-label, each labeled image's real label, and each target classification score, the cross-entropy loss function is used to calculate the loss function value, and the parameters of the customized classification module are trained using the backpropagation algorithm to obtain the customized classification module.

[0209] Corresponding to the above embodiments, such as Figure 7 As shown, the present invention also discloses an electronic device, which may include: a processor 1 and a memory 2;

[0210] The processor 1 and memory 2 communicate with each other via the communication bus 3.

[0211] Processor 1, for executing at least one instruction;

[0212] Memory 2 is used to store at least one instruction;

[0213] Processor 1 may be a central processing unit (CPU), an application-specific integrated circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention.

[0214] Memory 2 may include high-speed RAM memory, and may also include non-volatile memory, such as at least one disk storage device.

[0215] The processor executes at least one instruction to implement the steps shown in the embodiment of the object recognition method.

[0216] Corresponding to the above embodiments, the present invention also discloses a computer-readable storage medium that stores at least one instruction, which, when executed by a processor, implements the steps shown in the embodiments of the object recognition method.

[0217] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0218] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.

[0219] The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the invention is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An object recognition method, characterized in that, include: Acquire multi-view images of the object to be identified; A feature extraction network based on a pre-trained object recognition model extracts multi-view features from the multi-view images. The feature extraction network comprises an explicit supervised spatial attention module and an automatic alignment cross-attention module. The explicit supervised spatial attention module takes intermediate features extracted from the multi-view images by the backbone network as input, performs channel compression and deformation recombination to obtain a local-global attention vector, and then performs a multilayer perceptron and deformation recombination to obtain spatial attention. The automatic alignment cross-attention module takes image features processed by the spatial attention and output by the backbone network as input, divides the image features into grid blocks to obtain multiple local region features of the multi-view images, calculates the cross-attention between these multiple local region features, and aggregates local features with similar semantics to corresponding queries for local region feature alignment, thus obtaining the multi-view features. A multi-view matching strategy is adopted to determine the multi-view matching similarity between the multi-view features and the multi-view template features of each object in the retrieval library; The recognition result of the object to be identified is determined based on the multi-view matching similarity of each of the above.

2. The object recognition method according to claim 1, characterized in that, The explicit supervised spatial attention module is used to perform masking operations during feature extraction to suppress background region features of the multi-view image; The automatic alignment cross-attention module is used to perform feature segmentation and align local region features in the multi-view image to obtain the multi-view features.

3. The object recognition method according to claim 1, characterized in that, The multi-view matching strategy is adopted to determine the multi-view matching similarity between the multi-view features and the multi-view template features of each object in the retrieval database, including: A multi-view matching strategy is used to determine the similarity between the multi-view features and the multi-view template features of each object in the retrieval library, thereby obtaining a set of similarity values ​​for each object. The average similarity of the set of similarity values ​​is taken, and the resulting average similarity is used as the multi-view matching similarity between the object to be identified and the corresponding object.

4. The object recognition method according to claim 1, characterized in that, The training process of the object recognition model includes: The first stage of self-supervised pre-training of the object recognition model was carried out based on the ImageNet1K dataset to obtain the backbone network parameters and the explicit supervised spatial attention module parameters. Obtain an object recognition dataset, which includes multiple individuals of each object and multi-view images of each individual, with each multi-view image labeled with an object bounding box and a mask label; Using the backbone network parameters and the explicit supervised spatial attention module parameters as initialization parameters, and the object recognition dataset as the training dataset, the second-stage task head training of the object recognition model is performed to obtain the automatic alignment cross-attention module parameters. Using the backbone network parameters, the explicit supervised spatial attention module parameters, and the automatic alignment cross attention module parameters as initialization parameters, and the object recognition dataset as the training dataset, the third stage of end-to-end training of the object recognition model is performed to obtain the object recognition model.

5. The object recognition method according to claim 1, characterized in that, The determination of the recognition result of the object to be identified based on the multi-view matching similarity includes: Each of the multi-view matching similarities is used as the final probability score of the object to be identified as the object corresponding to the multi-view matching similarity, and the maximum value of the final probability score is determined from all the final probability scores. The target object corresponding to the maximum final probability score is determined as the recognition result of the object to be identified.

6. The object recognition method according to claim 1, characterized in that, The determination of the recognition result of the object to be identified based on the multi-view matching similarity includes: The multi-view features are input into the customized classification module to obtain the classification score of the object to be identified as each object in the retrieval library; The weighted average of the multi-view matching similarity and the classification score corresponding to each object in the retrieval library is used to obtain the final probability score of the object to be identified as the object. Determine the maximum final probability score from all the final probability scores; The target object corresponding to the maximum final probability score is determined as the recognition result of the object to be identified.

7. The object recognition method according to claim 5 or 6, characterized in that, The step of determining the target object corresponding to the maximum final probability score as the object to be identified includes: Determine whether the maximum value of the final probability score is greater than the similarity threshold; If so, the target object corresponding to the maximum value of the final probability score is determined as the recognition result of the object to be identified.

8. The object recognition method according to claim 6, characterized in that, The training process of the customized classification module includes: Obtain a customized fine-tuning dataset of images to be identified, the customized fine-tuning dataset including: unlabeled images and labeled images; When performing customized training using a semi-supervised training method, a pseudo-label is assigned to each of the unlabeled images, and the true label is determined for each of the labeled images; The target multi-view features are extracted from the customized fine-tuning dataset, and the target multi-view features are classified in a customized manner to obtain the target classification score of each image to be identified in the customized fine-tuning dataset. For each unlabeled image's pseudo-label, each labeled image's real label, and each target classification score, the cross-entropy loss function is used to calculate the loss function value, and the parameters of the customized classification module are trained using the backpropagation algorithm to obtain the customized classification module.

9. An object recognition device, characterized in that, include: The image acquisition unit is used to acquire multi-view images of the object to be identified. A feature extraction unit is used to extract multi-view features from the multi-view image using a pre-trained object recognition model feature extraction network. The feature extraction network has an explicit supervised spatial attention module and an automatic alignment cross-attention module. The explicit supervised spatial attention module takes the intermediate features extracted from the multi-view image by the backbone network as input, performs channel compression and deformation recombination to obtain a local-global perception attention vector, and then performs deformation recombination through a multilayer perceptron to obtain spatial attention. The automatic alignment cross-attention module takes the image features output by the backbone network after spatial attention processing as input, divides the image features into grid blocks to obtain multiple local region features of the multi-view image, calculates the cross-attention between the multiple local region features, and aggregates local features with similar semantics to the corresponding query to perform local region feature alignment to obtain the multi-view features. The similarity matching unit is used to determine the multi-view matching similarity between the multi-view features and the multi-view template features of each object in the retrieval library by adopting a multi-view matching strategy. The identification unit is used to determine the identification result of the object to be identified based on the multi-view matching similarity of each object.

10. An electronic device, characterized in that, The electronic device includes: a memory and a processor; The memory is used to store at least one instruction; The processor is used to execute the at least one instruction to implement the object recognition method as described in any one of claims 1 to 8.

11. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores at least one instruction, which, when executed by a processor, implements the object recognition method as described in any one of claims 1 to 8.