Training method and device of object recognition model and computer device

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By constructing an object recognition model, using encoding networks and feature extraction models to encode and extract features from sample images and text descriptions, calculating feature distances and training decision thresholds, the problem of strong subjectivity in manual evaluation is solved, and higher recognition accuracy is achieved.

CN115358345BActive Publication Date: 2026-06-19TSINGHUA UNIVERSITY

View PDF 6 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: TSINGHUA UNIVERSITY
Filing Date: 2022-09-09
Publication Date: 2026-06-19

Application Information

Patent Timeline

09 Sep 2022

Application

19 Jun 2026

Publication

CN115358345B

IPC: G06V10/774; G06V20/62; G06T9/00; G06F40/126

AI Tagging

Application Domain

Character and pattern recognition Image coding

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN115358345B_ABST

Patent Text Reader

Abstract

This application relates to a training method, apparatus, and computer device for an object recognition model. The method includes: acquiring a training sample set; encoding the sample set based on an encoding network to obtain a first mixed code corresponding to positive sample objects, a second mixed code corresponding to negative sample objects, and a standard mixed code; inputting the mixed codes into a feature extraction model to obtain first object features, second object features, and standard object features; obtaining a first feature distance and a second feature distance through a feature adaptation network, the first object features, the second object features, and the standard object features; training a distance decision threshold and a feature adaptation network based on the first and second feature distances; and constructing an object recognition model based on the encoding network, the feature extraction model, the trained feature adaptation network, and the distance decision threshold. This method can improve the accuracy of object recognition.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to a training method, apparatus, computer device, storage medium, and computer program product for an object recognition model. Background Technology

[0002] With the development of artificial intelligence technology, image description technology is becoming increasingly mature. Image description is a method for assessing language expression ability, primarily serving as a supplement to formal language assessment methods.

[0003] Currently, language expression ability is tested by professionals who manually evaluate the target subject. The evaluation mainly focuses on the target subject's description of the image. The professionals then assess the target subject's language expression ability based on their experience and the standard description of the image.

[0004] However, due to the strong subjectivity of human evaluation of target objects, the accuracy of identification is low. Summary of the Invention

[0005] Therefore, it is necessary to provide a training method, apparatus, computer device, computer-readable storage medium, and computer program product for an object recognition model to address the aforementioned technical problems.

[0006] Firstly, this application provides a method for training an object recognition model. The method includes:

[0007] Obtain a training sample set, which includes sample groups. Each sample group includes a sample image, positive text description information of the sample image for positive sample objects, negative text description information of the sample image for negative sample objects, and standard text information corresponding to the sample image.

[0008] The sample group is encoded based on the coding network to obtain the first hybrid code corresponding to the positive sample object, the second hybrid code corresponding to the negative sample object, and the standard hybrid code.

[0009] The first hybrid encoding, the second hybrid encoding, and the standard hybrid encoding are respectively input into the feature extraction model to obtain the first object feature corresponding to the positive sample object, the second object feature corresponding to the negative sample object, and the standard object feature.

[0010] The first feature distance and the second feature distance are obtained by using the feature adaptation network, the first object feature, the second object feature, and the standard object feature;

[0011] Based on the first feature distance and the second feature distance, a distance decision threshold and the feature adaptation network are trained. Based on the encoding network, the feature extraction model, the trained feature adaptation network, and the distance decision threshold, an object recognition model is constructed. The object recognition model is used to identify the type of the target object based on the text description information provided by the target object.

[0012] In one embodiment, the encoding network includes a text encoder and an image encoder; the encoding process based on the encoding network to obtain a first hybrid code corresponding to the positive sample object, a second hybrid code corresponding to the negative sample object, and a standard hybrid code includes:

[0013] The positive text description information, negative text description information, and standard text information are respectively input into the text encoder to obtain the positive text code corresponding to the positive text description information, the negative text code corresponding to the negative text description information, and the standard text code corresponding to the standard text information.

[0014] The sample image is input into the image encoder to obtain the image code;

[0015] The positive text code is concatenated with the image code to obtain the first hybrid code; the negative text code is concatenated with the image code to obtain the second hybrid code; and the standard text code is concatenated with the image code to obtain the standard hybrid code.

[0016] In one embodiment, obtaining the first feature distance and the second feature distance through a feature adaptation network, the first object feature, the second object feature, and the standard object feature includes:

[0017] The first object feature, the second object feature, and the standard object feature are respectively input into the feature adaptation network to obtain the first object adaptation feature, the second object adaptation feature, and the standard object adaptation feature;

[0018] According to the preset distance algorithm, the first feature distance between the first object adaptation feature and the standard object adaptation feature, and the second feature distance between the second object adaptation feature and the standard object adaptation feature are calculated.

[0019] In one embodiment, training the distance decision threshold based on the first feature distance and the second feature distance includes:

[0020] The distance decision threshold is obtained by training a support vector machine based on the first feature distance and the second feature distance.

[0021] Secondly, this application provides an object recognition method. The method includes:

[0022] Obtain the target image corresponding to the target object, the text description information of the target object to the target image, and the standard text information corresponding to the target image;

[0023] The target image, the text description information, and the standard text information are input into the object recognition model to obtain the category of the target object;

[0024] The object recognition model is trained using the object recognition model training method described in any one of claims 1 to 4.

[0025] In one embodiment, the network, the feature extraction model, and the feature adaptation network;

[0026] The step of inputting the target image, the text description information, and the standard text information into the object recognition model to obtain the category of the target object includes:

[0027] The target image, the text description information, and the standard text information are encoded using an encoding network to obtain the target hybrid encoding and the target standard hybrid encoding corresponding to the target object.

[0028] The target hybrid encoding and target standard hybrid encoding are respectively input into the feature extraction model to obtain the target object features and target standard object features corresponding to the target object;

[0029] The target feature distance is obtained by using a feature adaptation network, the target object features, and the target standard object features;

[0030] The target feature distance is compared with the distance decision threshold in the object recognition model, and the type of the target object is determined based on the comparison result.

[0031] Thirdly, this application also provides a training apparatus for an object recognition model. The apparatus includes:

[0032] The acquisition module is used to acquire a training sample set, which includes sample groups. Each sample group includes a sample image, positive text description information of the positive sample object to the sample image, negative text description information of the negative sample object to the sample image, and standard text information corresponding to the sample image.

[0033] The encoding module is used to encode the sample group based on the encoding network to obtain a first hybrid code corresponding to the positive sample object, a second hybrid code corresponding to the negative sample object, and a standard hybrid code.

[0034] The extraction module is used to input the first hybrid encoding, the second hybrid encoding, and the standard hybrid encoding into the feature extraction model respectively to obtain the first object feature corresponding to the positive sample object, the second object feature corresponding to the negative sample object, and the standard object feature;

[0035] The adaptation module is used to obtain the first feature distance and the second feature distance through the feature adaptation network, the first object feature, the second object feature, and the standard object feature;

[0036] The training module is used to train a distance decision threshold and the feature adaptation network based on the first feature distance and the second feature distance, and to construct an object recognition model based on the encoding network, the feature extraction model, the trained feature adaptation network and the distance decision threshold. The object recognition model is used to identify the type of the target object based on the text description information provided by the target object.

[0037] In one embodiment, the encoding module is specifically used for:

[0038] The positive text description information, negative text description information, and standard text information are respectively input into the text encoder to obtain the positive text code corresponding to the positive text description information, the negative text code corresponding to the negative text description information, and the standard text code corresponding to the standard text information.

[0039] The sample image is input into the image encoder to obtain the image code;

[0040] The positive text code is concatenated with the image code to obtain the first hybrid code; the negative text code is concatenated with the image code to obtain the second hybrid code; and the standard text code is concatenated with the image code to obtain the standard hybrid code.

[0041] In one embodiment, the adapter module is specifically used for:

[0042] The first object feature, the second object feature, and the standard object feature are respectively input into the feature adaptation network to obtain the first object adaptation feature, the second object adaptation feature, and the standard object adaptation feature;

[0043] According to the preset distance algorithm, the first feature distance between the first object adaptation feature and the standard object adaptation feature, and the second feature distance between the second object adaptation feature and the standard object adaptation feature are calculated.

[0044] In one embodiment, the training module is specifically used for:

[0045] The distance decision threshold is obtained by training a support vector machine based on the first feature distance and the second feature distance.

[0046] Fourthly, this application also provides an object recognition device. The device includes:

[0047] The acquisition module is used to acquire the target image corresponding to the target object, the text description information of the target object to the target image, and the standard text information corresponding to the target image;

[0048] The recognition module is used to input the target image, the text description information, and the standard text information into the object recognition model to obtain the category of the target object.

[0049] In one embodiment, the identification module is specifically used for:

[0050] The target image, the text description information, and the standard text information are encoded using an encoding network to obtain the target hybrid encoding and the target standard hybrid encoding corresponding to the target object.

[0051] The target hybrid encoding and target standard hybrid encoding are respectively input into the feature extraction model to obtain the target object features and target standard object features corresponding to the target object;

[0052] The target feature distance is obtained by using a feature adaptation network, the target object features, and the target standard object features;

[0053] The target feature distance is compared with the distance decision threshold in the object recognition model, and the type of the target object is determined based on the comparison result.

[0054] Fifthly, this application also provides a computer device. The computer device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to implement the steps of the method described in the first or second aspect.

[0055] Sixthly, this application also provides a computer-readable storage medium. The computer-readable storage medium stores a computer program thereon, which, when executed by a processor, implements the steps of the method described in the first or second aspect.

[0056] In a seventh aspect, this application also provides a computer program product. The computer program product includes a computer program that, when executed by a processor, implements the steps of the method described in the first or second aspect.

[0057] The above-mentioned object recognition model training method, apparatus, computer equipment, storage medium, and computer program product acquire a training sample set, which includes sample groups. Each sample group contains sample images, positive text descriptions of positive sample objects for the sample images, negative text descriptions of negative sample objects for the sample images, and standard text information corresponding to the sample images. The sample groups are encoded using an encoding network to obtain a first mixed code corresponding to the positive sample objects, a second mixed code corresponding to the negative sample objects, and a standard mixed code. The first mixed code, the second mixed code, and the standard mixed code are input into a feature extraction model to obtain first object features corresponding to the positive sample objects, second object features corresponding to the negative sample objects, and standard object features. A first feature distance and a second feature distance are obtained through a feature adaptation network, the first object features, the second object features, and the standard object features. A distance decision threshold and a feature adaptation network are trained based on the first and second feature distances. Based on the encoding network, the feature extraction model, the trained feature adaptation network, and the distance decision threshold, an object recognition model is constructed. This object recognition model can identify the type of a target object based on the text description information provided by the target object. This eliminates the need for manual evaluation of the target object, thus improving the accuracy of target object identification. Attached Figure Description

[0058] Figure 1 This is a flowchart illustrating the training method of an object recognition model in one embodiment;

[0059] Figure 2 This is a schematic diagram of the feature adaptation network structure in one embodiment;

[0060] Figure 3 This is a flowchart illustrating a hybrid encoding method in one embodiment;

[0061] Figure 4 This is a flowchart illustrating a feature adaptation method in one embodiment;

[0062] Figure 5 This is a flowchart illustrating an object recognition method in one embodiment;

[0063] Figure 6 This is a schematic diagram illustrating the specific process of an object recognition method in one embodiment;

[0064] Figure 7 This is a flowchart illustrating the application of an object recognition model in one embodiment;

[0065] Figure 8 This is a structural block diagram of a training device for an object recognition model in one embodiment;

[0066] Figure 9This is a structural block diagram of an object recognition device in one embodiment;

[0067] Figure 10 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation

[0068] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0069] The driving environment model training method provided in this application can be applied to a terminal, a server, or a system including both a terminal and a server, and is implemented through interaction between the terminal and the server. The terminal can be, but is not limited to, various personal computers, laptops, smartphones, tablets, IoT devices, and portable wearable devices. The server can be a standalone server or a server cluster consisting of multiple servers.

[0070] In one embodiment, such as Figure 1 As shown, a training method for an object recognition model is provided, which is then applied to... Figure 1 Taking the terminal in the example, the explanation includes the following steps:

[0071] Step 102: Obtain the training sample set.

[0072] The training sample set contains sample groups, each of which contains a sample image, positive text descriptions of the sample image for positive sample objects, negative text descriptions of the sample image for negative sample objects, and standard text information corresponding to the sample image.

[0073] Optionally, this training sample set consists of multiple sample groups; the sample images contain information with different degrees of salience, such as people, objects, and behaviors that can be described using words from multiple semantic categories; or multiple objects that allow the target object to be introduced in the description and can be referred to using pronouns; or events with causal and temporal relationships; or events and behaviors that can be described using psychological state language; and rich multi-distributed scene information.

[0074] Standard text information can be pre-annotated by technical personnel. This standard text information should describe all content and details in the image as comprehensively as possible, and should meet the following conditions: describe important information first, then background information; important information should stand out from the rest; use vocabulary from multiple semantic categories, including both high-frequency and low-frequency words; achieve referential cohesion through anaphora in the description; indicate causal relationships between people and events in the image; attribute cognitive and emotional states to different events and behaviors in the image; demonstrate cognitive and perceptual skills, and use structured language in the description.

[0075] In this embodiment, the terminal acquires recordings of positive and negative sample objects describing the sample image. Automatic speech recognition converts the recorded speech information into text information, thereby obtaining positive text descriptions of the sample image from positive sample objects and negative text descriptions of the sample image from negative sample objects. Positive and negative sample objects belong to different types of objects (e.g., language ability types). In one example, a positive sample object can be a reference object, such as an object whose language expression ability meets normal conditions; a negative sample object can be a special object relative to the reference object, such as an object whose language expression ability does not meet normal conditions. It is understood that the above example is merely one implementation method provided by this application embodiment, and the selection method of positive and negative sample objects can be determined by those skilled in the art according to actual needs; this application embodiment does not limit this method.

[0076] Step 104: Encode the sample group based on the coding network to obtain the first mixed code corresponding to the positive sample object, the second mixed code corresponding to the negative sample object, and the standard mixed code.

[0077] In this embodiment, the terminal inputs the sample group into the encoding network, and encodes and splices the positive text description information, negative text description information, standard text information and sample images in the sample group to obtain the first hybrid code, the second hybrid code and the standard hybrid code. The specific encoding process will be described in detail later.

[0078] Step 106: Input the first hybrid encoding, the second hybrid encoding, and the standard hybrid encoding into the feature extraction model to obtain the first object feature corresponding to the positive sample object, the second object feature corresponding to the negative sample object, and the standard object feature.

[0079] The feature extraction model can be a Transformer model, which uses a Transformer module trained in the image generator. It consists of 64 attention layers, with 62 attention heads per layer and 64 dimensions per attention head. Each attention layer uses three types of sparse attention: row attention mask, column attention mask, and convolutional attention mask.

[0080] In this embodiment, the terminal inputs the first hybrid code, the second hybrid code, and the standard hybrid code into the Transformer model, respectively. The first hybrid code, the second hybrid code, and the standard hybrid code are predicted by the three sparse attention methods in the Transformer model: row attention mask, column attention mask, and convolutional attention mask, to obtain the first object feature vector, the second object feature vector, and the standard object feature vector.

[0081] Step 108: Obtain the first feature distance and the second feature distance through the feature adaptation network, the first object feature, the second object feature, and the standard object feature.

[0082] Wherein, the first feature distance is the distance from the first object feature to the standard object feature, and the second feature distance is the distance from the second object feature to the standard object feature.

[0083] In this embodiment of the application, the terminal inputs the first object feature, the second object feature and the standard object feature into the feature adaptation network to obtain the first adaptation feature corresponding to the first object feature, the second adaptation feature corresponding to the second object feature and the standard adaptation feature corresponding to the standard object feature. Then, the terminal calculates the first feature distance based on the first adaptation feature and the standard adaptation feature, and calculates the second feature distance based on the second adaptation feature and the standard adaptation feature.

[0084] like Figure 2 As shown, in one embodiment, the feature adaptation network includes an average pooling layer and a fully connected layer. Step 110: Train a distance decision threshold and a feature adaptation network based on a first feature distance and a second feature distance, and construct an object recognition model based on the encoding network, the feature extraction model, the trained feature adaptation network, and the distance decision threshold. The object recognition model is used to identify the type of the target object based on the textual description information provided by the target object.

[0085] Among them, the distance decision threshold can be used to distinguish between positive and negative samples. That is, the distance decision threshold is the dividing line between positive and negative samples, so that the type of the target object to be identified (such as language ability type) can be identified in the subsequent process. The identification result is used to indicate whether the target object belongs to the type corresponding to the positive sample object or the type corresponding to the negative sample object.

[0086] In this embodiment, the terminal trains a feature adaptation network based on a first feature distance and a second feature distance to obtain a distance decision threshold. The terminal then trains the feature adaptation network based on the first and second feature distances to obtain the trained feature adaptation network. Specifically, for each input sample group, the terminal obtains a first feature distance and a second feature distance. It can then calculate a loss value based on the first and second feature distances, adjust the parameters of the feature adaptation network based on this loss value, and then re-input a new sample group until a preset number of training iterations is reached, resulting in the trained feature adaptation network. Optionally, any loss function used to calculate the loss value can be applied in this embodiment, and this embodiment does not limit its scope.

[0087] The terminal constructs an object recognition model based on the encoding network, feature extraction model, trained feature adaptation network, and distance decision threshold. This object recognition model is used to identify the type of the target object based on the text description information provided by the target object.

[0088] In the above object recognition model training method, a coding network is used to perform mixed coding on sample images and corresponding text descriptions in the sample group, and features are extracted from this mixed coding. Then, the first feature distance and the second feature distance are obtained through the first object feature, the second object feature, and the standard object feature. A distance decision threshold and a feature fitting network are trained to construct the object recognition model. The distance decision threshold can effectively represent the boundary between positive and negative samples. Therefore, the object recognition model can identify target objects based on the distance decision threshold, improving recognition accuracy.

[0089] like Figure 3 As shown, in one embodiment, the encoding network includes a text encoder and an image encoder; encoding the sample group based on the encoding network to obtain a first hybrid code corresponding to positive sample objects, a second hybrid code corresponding to negative sample objects, and a standard hybrid code includes:

[0090] Step 302: Input the positive text description information, negative text description information, and standard text information into the text encoder respectively to obtain the positive text code corresponding to the positive text description information, the negative text code corresponding to the negative text description information, and the standard text code corresponding to the standard text information.

[0091] In this embodiment, for each positive text description, the terminal encodes it using a BPE (Byte Pair Encoding) text encoder to obtain the corresponding positive text encoding, which can be a text encoding with a maximum of 256 bits. Similarly, for each negative text description, the terminal encodes it using a BPE text encoder to obtain the corresponding negative text encoding, which can also be a text encoding with a maximum of 256 bits. For each standard text description, the terminal encodes it using a BPE text encoder to obtain the corresponding standard text encoding, which can also be a text encoding with a maximum of 256 bits.

[0092] Step 304: Input the sample image into the image encoder to obtain the image code.

[0093] In this embodiment, the terminal compresses the sample image using a DVAE (Discrete Variance Auto-Encoder) encoder, thereby encoding and serializing the image into 1024 codes.

[0094] Step 306: Concatenate the positive text code with the image code to obtain the first mixed code; concatenate the negative text code with the image code to obtain the second mixed code; and concatenate the standard text code with the image code to obtain the standard mixed code.

[0095] In this embodiment, the terminal concatenates the positive text codes first and the image codes last to obtain a first mixed code, which has 1280 codes; the terminal concatenates the negative text codes first and the image codes last to obtain a second mixed code, which also has 1280 codes; the terminal concatenates the standard text codes first and the image codes last to obtain a standard mixed code, which also has 1280 codes.

[0096] In this embodiment, the sample group is encoded and spliced by an encoding network. The feature extraction model can extract features from the sample image and the corresponding descriptive text information in a bimodal manner, thereby training the feature adaptation network and training a distance decision threshold based on the support vector machine, thus improving the recognition accuracy.

[0097] like Figure 4 As shown, in one embodiment, obtaining the first feature distance and the second feature distance through a feature adaptation network, a first object feature, a second object feature, and a standard object feature includes:

[0098] Step 402: Input the first object feature, the second object feature, and the standard object feature into the feature adaptation network to obtain the first object adaptation feature, the second object adaptation feature, and the standard object adaptation feature.

[0099] Among them, reference Figure 2 The feature adaptation network consists of an average pooling layer and a fully connected layer.

[0100] In this embodiment, the terminal performs average pooling on the first hybrid encoding of the sample group obtained in step 308 to obtain a first feature vector, which is a 1280-dimensional vector. Then, the fully connected layer reduces the dimensionality of the first feature vector to 256 dimensions to obtain the first object adaptation feature. The terminal also performs average pooling on the second hybrid encoding of the sample group obtained in step 308 to obtain a second feature vector, which is a 1280-dimensional vector. Then, the fully connected layer reduces the dimensionality of the second feature vector to 256 dimensions to obtain the second object adaptation feature. Finally, the terminal performs average pooling on the standard hybrid encoding of the sample group obtained in step 308 to obtain a standard feature vector, which is a 1280-dimensional vector. Then, the fully connected layer reduces the dimensionality of the standard feature vector to 256 dimensions to obtain the standard object adaptation feature.

[0101] Step 404: Calculate the first feature distance between the first object adaptation feature and the standard object adaptation feature, and the second feature distance between the second object adaptation feature and the standard object adaptation feature, according to the preset distance algorithm.

[0102] The calculation formula for the distance algorithm is as follows:

[0103] Set Loss as the loss value, X + As the first fitting feature, X - The second adaptation feature, X std This is a standard adaptation feature.

[0104] In this embodiment, the terminal calculates and obtains the first feature distance and the second feature distance according to the distance algorithm.

[0105] In this embodiment, by calculating the first feature distance and the second feature distance, the object recognition model can be trained and a distance decision threshold can be trained. This distance decision threshold can represent the boundary between positive and negative samples, thereby improving the recognition accuracy of the object recognition model.

[0106] In one embodiment, training a distance decision threshold based on a first feature distance and a second feature distance includes:

[0107] The distance decision threshold is obtained by training a support vector machine based on the first feature distance and the second feature distance.

[0108] In this embodiment, the terminal uses a support vector machine model to determine the first feature distance and the second feature distance, obtaining a distance decision threshold. The formula used in this decision method is: y = sign(ω T L+b).

[0109] Where L is the feature distance, ω and b are model parameters, y represents the decision category, which is the type to which the positive or negative sample object belongs (such as language ability type), and T represents transpose.

[0110] Specifically, obtaining the distance decision threshold through support vector machine training includes the following steps:

[0111] First, initialize the model parameters ω = 0, b = 0; then, the terminal randomly selects a training sample i, L i Let y be the feature distance of i. i Let i be the decision type, which is the type to which the positive or negative sample object belongs (e.g., language ability type). The decision type is determined according to the hinge loss function Loss. i =max[0,1-y i (ω t )L i +b] Calculate the loss and update the parameters according to the gradient down method. Update ω and b, train this step to the preset number of training iterations, and finally save the trained ω and b.

[0112] Then, the terminal inputs a preset test sample, according to y = sign(ω) T L+b) is calculated, and the result of - / is used as the distance decision threshold.

[0113] In this embodiment, by training a distance decision threshold, the target object can be judged based on the distance decision threshold. This distance decision threshold can accurately represent the boundary between positive and negative samples, thereby improving the accuracy of the object recognition model.

[0114] like Figure 5 As shown in the embodiments of this application, an object recognition method is also provided, which includes the following steps:

[0115] Step 502: Obtain the target image corresponding to the target object, the text description information of the target object on the target image, and the standard text information corresponding to the target image.

[0116] In this embodiment, the terminal can record the description of the target using a recording device, and convert the recorded speech information into text information through automatic speech recognition to obtain the text description information of the target object on the target image.

[0117] Step 504: Input the target image, text description information, and standard text information into the object recognition model to obtain the category of the target object.

[0118] In this embodiment, the terminal inputs the text description information of the target object and the target image into the trained object recognition model to obtain the recognition result of the target object.

[0119] In this embodiment, by training the model with bimodal hybrid encoding of the input object, non-invasive recognition of the target object can be achieved. (Refer to...) Figure 6 This describes the specific process for identifying the type of the target object provided in the embodiments of this application.

[0120] In one embodiment, such as Figure 7 As shown, the object recognition model includes an encoding network, a feature extraction model, and a feature adaptation network. The target image, text description information, and standard text information are input into the object recognition model to obtain the category of the target object, including:

[0121] Step 702: Encode the target image, text description information and standard text information based on the coding network to obtain the target hybrid code and standard hybrid code corresponding to the target object.

[0122] In this embodiment, the terminal inputs the text description information of the target object to the target image, the target image, and the corresponding standard text information into the encoding network. The encoding network encodes the text description information of the target object to the target image to obtain text encoding; the encoding network encodes the target image to obtain image encoding; the encoding network encodes the standard text information to obtain standard text encoding; then the encoding network concatenates the text encoding first and the image encoding last to obtain target hybrid encoding; the encoding network concatenates the standard text encoding first and the image encoding last to obtain standard hybrid encoding. The specific processing procedure of this step can be found in the explanations of steps 302, 304, and 306 above.

[0123] Step 704: Input the target hybrid encoding and the standard hybrid encoding into the feature extraction model respectively to obtain the target object features and standard object features corresponding to the target object.

[0124] In this embodiment, the terminal inputs the target hybrid encoding into the Transformer model to obtain the target object features, and inputs the standard hybrid encoding into the Transformer model to obtain the standard object features.

[0125] Step 706: Obtain the target feature distance through the feature adaptation network, target object features, and target standard object features.

[0126] In this embodiment, the terminal inputs the target object features and standard object features into the feature adaptation network to obtain the distance between the target object features and the standard object features, i.e., the target feature distance.

[0127] Step 708: Compare the target feature distance with the distance decision threshold in the object recognition model, and determine the type of the target object based on the comparison result.

[0128] In this embodiment, the terminal compares the target feature distance with the distance decision threshold to obtain the comparison result, and uses this result to determine the type of the target object.

[0129] In one embodiment, this application also provides an example of an object recognition model training method, specifically including:

[0130] The terminal acquires a training sample set, which contains sample groups. Each sample group contains a sample image, positive text description information of the positive sample object to the sample image, negative text description information of the negative sample object to the sample image, and standard text information corresponding to the sample image.

[0131] Input the positive text description information, negative text description information, and standard text information into the text encoder respectively to obtain the positive text code corresponding to the positive text description information, the negative text code corresponding to the negative text description information, and the standard text code corresponding to the standard text information.

[0132] Input the sample image into the image encoder to obtain the image code;

[0133] The first mixed code is obtained by concatenating the positive text code with the image code; the second mixed code is obtained by concatenating the negative text code with the image code; and the standard mixed code is obtained by concatenating the standard text code with the image code.

[0134] The first hybrid encoding, the second hybrid encoding, and the standard hybrid encoding are input into the feature extraction model to obtain the first object feature corresponding to the positive sample object, the second object feature corresponding to the negative sample object, and the standard object feature.

[0135] The first object feature, the second object feature, and the standard object feature are input into the feature adaptation network to obtain the first object adaptation feature, the second object adaptation feature, and the standard object adaptation feature.

[0136] According to the preset distance algorithm, calculate the first feature distance between the first object adaptation feature and the standard object adaptation feature, and the second feature distance between the second object adaptation feature and the standard object adaptation feature.

[0137] Based on the first feature distance and the second feature distance feature adaptation network, and trained by support vector machine to obtain the distance decision threshold, an object recognition model is constructed according to the encoding network, feature extraction model, trained feature adaptation network and distance decision threshold. The object recognition model is used to identify the type of target object based on the text description information provided by the target object.

[0138] It should be understood that although the steps in the flowcharts of the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the above embodiments may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.

[0139] Based on the same inventive concept, this application also provides a training apparatus for implementing the object recognition model described above. The solution provided by this apparatus is similar to the solution described in the above method; therefore, the specific limitations in one or more object recognition apparatus embodiments provided below can be found in the limitations of the object recognition model training method described above, and will not be repeated here.

[0140] In one embodiment, such as Figure 8 As shown, a training device for an object recognition model is provided, comprising: an acquisition module 810, an encoding module 820, an extraction module 830, an adaptation module 840, and a training module 850, wherein:

[0141] The acquisition module 810 is used to acquire a training sample set, which includes sample groups. Each sample group includes a sample image, positive text description information of the positive sample object to the sample image, negative text description information of the negative sample object to the sample image, and standard text information corresponding to the sample image.

[0142] The encoding module 820 is used to encode the sample group based on the encoding network to obtain a first hybrid code corresponding to the positive sample object, a second hybrid code corresponding to the negative sample object, and a standard hybrid code.

[0143] Extraction module 830 is used to input the first hybrid code, the second hybrid code and the standard hybrid code into the feature extraction model respectively to obtain the first object feature corresponding to the positive sample object, the second object feature corresponding to the negative sample object and the standard object feature;

[0144] The adaptation module 840 is used to obtain the first feature distance and the second feature distance through the feature adaptation network, the first object feature, the second object feature and the standard object feature;

[0145] The training module 850 is used to train a distance decision threshold and the feature adaptation network based on the first feature distance and the second feature distance, and to construct an object recognition model based on the encoding network, the feature extraction model, the trained feature adaptation network and the distance decision threshold. The object recognition model is used to identify the type of the target object based on the text description information provided by the target object.

[0146] In one embodiment, the encoding module is specifically used for:

[0147] The positive text description information, negative text description information, and standard text information are respectively input into the text encoder to obtain the positive text code corresponding to the positive text description information, the negative text code corresponding to the negative text description information, and the standard text code corresponding to the standard text information.

[0148] The sample image is input into the image encoder to obtain the image code;

[0149] The positive text code is concatenated with the image code to obtain the first hybrid code; the negative text code is concatenated with the image code to obtain the second hybrid code; and the standard text code is concatenated with the image code to obtain the standard hybrid code.

[0150] In one embodiment, the adapter module is specifically used for:

[0151] The first object feature, the second object feature, and the standard object feature are respectively input into the feature adaptation network to obtain the first object adaptation feature, the second object adaptation feature, and the standard object adaptation feature;

[0152] According to the preset distance algorithm, the first feature distance between the first object adaptation feature and the standard object adaptation feature, and the second feature distance between the second object adaptation feature and the standard object adaptation feature are calculated.

[0153] In one embodiment, the training module is specifically used for:

[0154] The distance decision threshold is obtained by training a support vector machine based on the first feature distance and the second feature distance.

[0155] In one embodiment, such as Figure 9 As shown, a training apparatus for an object recognition model is provided, comprising: an acquisition module 910 and a recognition module 920, wherein:

[0156] The acquisition module 910 is used to acquire the target image corresponding to the target object, the text description information of the target object to the target image, and the standard text information corresponding to the target image;

[0157] The recognition module 920 is used to input the target image, the text description information and the standard text information into the object recognition model to obtain the category of the target object.

[0158] In one embodiment, the identification module is specifically used for:

[0159] The target image, the text description information, and the standard text information are encoded using an encoding network to obtain the target hybrid encoding and the target standard hybrid encoding corresponding to the target object.

[0160] The target hybrid encoding and target standard hybrid encoding are respectively input into the feature extraction model to obtain the target object features and target standard object features corresponding to the target object;

[0161] The target feature distance is obtained by using a feature adaptation network, the target object features, and the target standard object features;

[0162] The target feature distance is compared with the distance decision threshold in the object recognition model, and the type of the target object is determined based on the comparison result.

[0163] Each module in the aforementioned object recognition device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the operations corresponding to each module.

[0164] In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as follows: Figure 10As shown. The computer device includes a processor, memory, communication interface, display screen, and input devices connected via a system bus. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, mobile cellular networks, NFC (Near Field Communication), or other technologies. The computer program is executed by the processor to implement the above methods. The display screen can be an LCD screen or an e-ink screen. The input devices can be a touch layer covering the display screen, buttons, a trackball, or a touchpad on the computer device's casing, or an external keyboard, touchpad, or mouse.

[0165] Those skilled in the art will understand that Figure 10 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0166] In one embodiment, a computer device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the above-described method steps.

[0167] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the above-described method steps.

[0168] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the above-described method steps.

[0169] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties.

[0170] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to these.

[0171] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0172] The above embodiments are merely illustrative of several implementation methods of this application, and their descriptions are relatively specific and detailed. However, they should not be construed as limiting the scope of this application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. A training method for an object recognition model, characterized in that, The method includes: Obtain a training sample set, which includes sample groups. Each sample group includes a sample image, positive text description information of the sample image for positive sample objects, negative text description information of the sample image for negative sample objects, and standard text information corresponding to the sample image. The sample group is encoded based on the coding network to obtain the first hybrid code corresponding to the positive sample object, the second hybrid code corresponding to the negative sample object, and the standard hybrid code. The first hybrid encoding, the second hybrid encoding, and the standard hybrid encoding are respectively input into the feature extraction model to obtain the first object feature corresponding to the positive sample object, the second object feature corresponding to the negative sample object, and the standard object feature. The first feature distance and the second feature distance are obtained by using the feature adaptation network, the first object feature, the second object feature, and the standard object feature; Based on the first feature distance and the second feature distance, a distance decision threshold and the feature adaptation network are trained. Based on the encoding network, the feature extraction model, the trained feature adaptation network, and the distance decision threshold, an object recognition model is constructed. The object recognition model is used to identify the type of the target object based on the text description information provided by the target object. The distance decision threshold is used to identify the language ability type of the target object.

2. The method according to claim 1, characterized in that, The encoding network includes a text encoder and an image encoder; the encoding process based on the encoding network to obtain the first hybrid code corresponding to the positive sample object, the second hybrid code corresponding to the negative sample object, and the standard hybrid code includes: The positive text description information, the negative text description information, and the standard text information are respectively input into the text encoder to obtain the positive text code corresponding to the positive text description information, the negative text code corresponding to the negative text description information, and the standard text code corresponding to the standard text information; The sample image is input into the image encoder to obtain the image code; The positive text code is concatenated with the image code to obtain the first mixed code; the negative text code is concatenated with the image code to obtain the second mixed code; and the standard text code is concatenated with the image code to obtain the standard mixed code.

3. The method according to claim 1, characterized in that, The process of obtaining the first feature distance and the second feature distance through the feature adaptation network, the first object feature, the second object feature, and the standard object feature includes: The first object feature, the second object feature, and the standard object feature are respectively input into the feature adaptation network to obtain the first object adaptation feature, the second object adaptation feature, and the standard object adaptation feature; According to a preset distance algorithm, the first feature distance between the first object adaptation feature and the standard object adaptation feature, and the second feature distance between the second object adaptation feature and the standard object adaptation feature are calculated.

4. The method according to claim 1, characterized in that, The distance decision threshold trained based on the first feature distance and the second feature distance includes: The distance decision threshold is obtained by training a support vector machine based on the first feature distance and the second feature distance.

5. An object recognition method, characterized in that, The method further includes: Obtain the target image corresponding to the target object, the text description information of the target object to the target image, and the standard text information corresponding to the target image; The target image, the text description information, and the standard text information are input into the object recognition model to obtain the category of the target object; The object recognition model is trained using the object recognition model training method described in any one of claims 1 to 4.

6. The method according to claim 5, characterized in that, The object recognition model includes an encoding network, a feature extraction model, and a feature adaptation network; The step of inputting the target image, the text description information, and the standard text information into the object recognition model to obtain the category of the target object includes: The target image, the text description information, and the standard text information are encoded using an encoding network to obtain the target hybrid encoding and the target standard hybrid encoding corresponding to the target object. The target hybrid encoding and target standard hybrid encoding are respectively input into the feature extraction model to obtain the target object features and target standard object features corresponding to the target object; The target feature distance is obtained by using a feature adaptation network, the target object features, and the target standard object features; The target feature distance is compared with the distance decision threshold in the object recognition model, and the type of the target object is determined based on the comparison result.

7. A training device for an object recognition model, characterized in that, The device includes: The acquisition module is used to acquire a training sample set, which includes sample groups. Each sample group includes a sample image, positive text description information of the positive sample object to the sample image, negative text description information of the negative sample object to the sample image, and standard text information corresponding to the sample image. The encoding module is used to encode the sample group based on the encoding network to obtain a first hybrid code corresponding to the positive sample object, a second hybrid code corresponding to the negative sample object, and a standard hybrid code. The extraction module is used to input the first hybrid encoding, the second hybrid encoding, and the standard hybrid encoding into the feature extraction model respectively to obtain the first object feature corresponding to the positive sample object, the second object feature corresponding to the negative sample object, and the standard object feature; An adaptation module is used to obtain a first feature distance and a second feature distance through a feature adaptation network, the first object feature, the second object feature, and the standard object feature; The training module is used to train a distance decision threshold and the feature adaptation network based on the first feature distance and the second feature distance, and to construct an object recognition model based on the encoding network, the feature extraction model, the trained feature adaptation network and the distance decision threshold. The object recognition model is used to identify the type of the target object based on the text description information provided by the target object; the distance decision threshold is used to identify the language ability type of the target object.

8. An object recognition device, characterized in that, The device includes: The acquisition module is used to acquire the target image corresponding to the target object, the text description information of the target object to the target image, and the standard text information corresponding to the target image; The recognition module is used to input the target image, the text description information, and the standard text information into the object recognition model to obtain the category of the target object; The object recognition model is trained using the object recognition model training method described in any one of claims 1 to 4.

9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 4 or 5 to 6.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 4 or 5 to 6.