Training of image recognition model, image recognition method, device and storage medium

By adjusting the parameters of the image recognition model, and based on the localization accuracy and format fit of the sample recognition results, the problems of accuracy and format consistency of the image recognition model were solved, achieving a more efficient image recognition effect.

CN122244602APending Publication Date: 2026-06-19TENCENT TECHNOLOGY (SHENZHEN) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
TENCENT TECHNOLOGY (SHENZHEN) CO LTD
Filing Date
2026-05-25
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing image recognition models have low recognition accuracy, and the format of the recognition results does not conform to the standard data format.

Method used

The sample images are identified using an image recognition model. The accuracy of the localization of the sample recognition results and the degree of conformity between the predicted data format and the standard data format are determined. The model parameters are adjusted based on the composite reward value to ensure the accuracy and format conformity of the recognition results.

Benefits of technology

This improved the accuracy and format consistency of the recognition results output by the image recognition model, achieving more efficient image recognition results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244602A_ABST
    Figure CN122244602A_ABST
Patent Text Reader

Abstract

This application discloses an image recognition model training method, apparatus, electronic device, and storage medium, relating to the field of electronic information technology. The training method includes: recognizing sample images using an image recognition model to obtain sample recognition results; determining a composite reward value based on a first reward value and a second reward value of the sample recognition results; the first reward value indicating the localization accuracy of at least one predicted sample image region; the second reward value indicating the fit between the predicted data format and a preset standard data format, wherein the predicted data format is the format indicating at least one predicted sample image region; and adjusting the parameters of the image recognition model based on the composite reward value. According to the method of this application, the trained image recognition model has strong recognition ability and high recognition accuracy, and the output image recognition results have high accuracy.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of electronic information technology, and more specifically, to an image recognition model training, image recognition method, apparatus, electronic device, and storage medium. Background Technology

[0002] With the development of technology, visual recognition technology has become increasingly mature. For example, image recognition models can be used to identify whether an image contains animals, objects, or tampered content. However, a problem exists in related technologies: the recognition accuracy of image recognition models is relatively low. Summary of the Invention

[0003] In view of this, embodiments of this application propose an image recognition model training method, an image recognition apparatus, an electronic device, and a storage medium.

[0004] In a first aspect, embodiments of this application provide a training method for an image recognition model. The method includes: recognizing a sample image using the image recognition model to obtain a sample recognition result for the sample image; the sample recognition result is used to indicate at least one predicted sample image region containing a target object in the sample image predicted by the image recognition model; determining a composite reward value based on a first reward value and a second reward value of the sample recognition result; the first reward value indicates the positioning accuracy of at least one predicted sample image region; the second reward value indicates the degree of fit between at least one predicted data format and a preset standard data format, wherein the predicted data format is the format of data indicating the predicted sample image region; the first reward value is determined based on the positional differences and / or overlap relationships between at least one predicted sample image region and the region where the target object is located in the sample image; and adjusting the parameters of the image recognition model based on the composite reward value.

[0005] In a second aspect, embodiments of this application provide an image recognition method, the method comprising: acquiring an image to be recognized; recognizing the image to be recognized using an image recognition model to obtain an image recognition result of the image to be recognized; the image recognition result being used to indicate whether the image to be recognized includes a target object and / or a predicted image region in the image to be recognized; wherein the image recognition model is trained according to the method in the first aspect.

[0006] Thirdly, embodiments of this application provide a training apparatus for an image recognition model. The apparatus includes: a first recognition module, used to recognize a sample image through the image recognition model to obtain a sample recognition result of the sample image; the sample recognition result is used to indicate at least one predicted sample image region containing a target object in the sample image predicted by the image recognition model; a determination module, used to determine a composite reward value based on a first reward value and a second reward value of the sample recognition result; the first reward value indicates the positioning accuracy of at least one predicted sample image region; the second reward value indicates the degree of fit between at least one predicted data format and a preset standard data format, wherein the predicted data format is the format of data indicating the predicted sample image region; the first reward value is determined based on the positional differences and / or overlap between at least one predicted sample image region and the region where the target object is located in the sample image; and an adjustment module, used to adjust the parameters of the image recognition model based on the composite reward value.

[0007] Optionally, the determining module is further configured to obtain the annotation results of the sample image; the annotation results of the sample image indicate at least one annotated image region in the sample image that includes the target object; match at least one predicted sample image region and at least one annotated image region to match a corresponding predicted sample image region for each annotated image region; determine a first reward value based on the positional difference and / or overlap relationship between each annotated image region and the corresponding predicted sample image region; determine a second reward value based on the format difference between the annotation data format of each annotated image region and the prediction data format of the corresponding predicted sample image region; the annotation data format of the annotated image region is the format indicating the data of the annotated image region; and determine a composite reward value based on the first reward value and the second reward value.

[0008] Optionally, the determining module is further configured to, for each labeled image region, determine a region localization offset reward value corresponding to the labeled image region based on the overlap relationship between the labeled image region and the corresponding predicted sample image region; determine a region quantity offset reward value corresponding to the labeled image region based on the positional difference between the labeled image region and the corresponding predicted sample image region; determine a single region localization reward value corresponding to the labeled image region based on the region localization offset reward value and / or the region quantity offset reward value corresponding to the labeled image region; and determine a first reward value based on the single region localization reward values ​​corresponding to multiple labeled image regions.

[0009] Optionally, the determining module is further configured to, for each labeled image region, if the labeled data format of the labeled image region is consistent with the prediction data format of the corresponding prediction sample image region, obtain a first value as the single-region format reward value of the labeled image region; if the labeled data format of the labeled image region is inconsistent with the prediction data format of the corresponding prediction sample image region, obtain a second value as the single-region format reward value of the labeled image region; the first value is greater than the second value; and the second reward value is determined based on the single-region format reward values ​​of multiple labeled image regions.

[0010] Optionally, there are multiple sample images; the adjustment module is further configured to recognize the multiple sample images separately using the parameter-adjusted image recognition model, and obtain the verification recognition result for each sample image; the verification recognition result of the sample image is used to indicate the verification image region containing the target object in the sample image predicted by the parameter-adjusted image recognition model; based on the difference between the verification recognition result and the annotation result corresponding to each sample image, candidate sample images are determined from the multiple sample images; the annotation result of the sample image indicates the annotation image region containing the target object in the sample image; based on the verification recognition result corresponding to the candidate sample image, the composite reward value of the candidate sample image is determined; based on the composite reward value of the candidate sample image, the parameters of the parameter-adjusted image recognition model are adjusted.

[0011] Optionally, each labeled image region corresponds to a verification image region; the adjustment module is also used to determine the image crossover ratio of the sample image for each sample image based on the overlap relationship between each labeled image region of the sample image and its corresponding verification image region; if the image crossover ratio of the sample image is less than a preset crossover ratio threshold, the sample image is obtained as a candidate sample image.

[0012] Optionally, there are multiple sample images; each sample image corresponds to its own composite reward value; the adjustment module is also used to determine the relative advantage value of each sample image based on the composite reward values ​​corresponding to the multiple sample images; the relative advantage value is used to indicate the difference between the composite reward value of the corresponding sample image and the average level of the composite reward values ​​of the multiple sample images; the parameters of the image recognition model are adjusted based on the relative advantage values ​​of the multiple sample images.

[0013] Optionally, the image recognition model includes a visual encoder and a result predictor; the first recognition module is further configured to extract image features from the sample image through the visual encoder to obtain the sample image features corresponding to the sample image; and to perform prediction processing on the sample image based on the sample image features through the result predictor to obtain the sample recognition result.

[0014] Optionally, the adjustment module is also used to fix the model parameters of the visual encoder, adjust the parameters of the result predictor based on the composite reward value, and adjust the parameters of the image recognition model.

[0015] Optionally, the device further includes a pre-training module, used to extract image features from the pre-training sample images through the initial visual encoder to obtain the pre-training sample image features corresponding to the pre-training sample images; for each target sample region in multiple target sample regions in the pre-training sample images, the result predictor performs prediction processing on the target sample region based on the pre-training sample image features to obtain the region prediction probability of the target sample region; each target sample region includes a target object, and the region prediction probability is used to indicate the probability that the result predictor predicts that the corresponding target sample region includes a target object; based on the region prediction probabilities of multiple target sample regions, the parameters of the initial visual encoder are adjusted to obtain the parameter-adjusted initial visual encoder, which serves as the visual encoder.

[0016] Optionally, the pre-training module is further configured to sort multiple target sample regions based on the positional relationship between multiple target sample regions in the pre-training sample image to obtain a target sample region sequence; and through the result predictor, according to the arrangement order of each target sample region in the target sample region sequence, perform prediction processing on each target sample region based on the features corresponding to each target sample region in the features of the pre-training sample image to obtain the region prediction probability of each target sample region.

[0017] Optionally, the pre-training module is also used to obtain the reference point corresponding to each target sample region in the pre-training sample image; each reference point is located within the corresponding target sample region; based on the positional relationship between the reference points corresponding to each of the multiple target sample regions in the pre-training sample image, the multiple target sample regions are sorted to obtain a target sample region sequence.

[0018] Optionally, the visual encoder includes multiple key visual feature extraction layers, each of which is located before the last feature extraction layer of the visual encoder; the result predictor includes an associated prediction layer corresponding to each key visual feature extraction layer; the first recognition module is further configured to input sample image features into the first prediction layer of the result predictor and input the features extracted by each key visual feature extraction layer into the corresponding associated prediction layer, so that the result predictor performs prediction processing on the sample image based on the sample image features and the features extracted by each key visual feature extraction layer to obtain the sample recognition result.

[0019] Fourthly, embodiments of this application provide an image recognition device, the device comprising: an acquisition module for acquiring an image to be recognized; a second recognition module for recognizing the image to be recognized using an image recognition model to obtain an image recognition result of the image to be recognized; the image recognition result is used to indicate whether the image to be recognized includes a target object and / or a predicted image region in the image to be recognized; wherein the image recognition model is trained according to the method in the first aspect.

[0020] Fifthly, embodiments of this application provide an electronic device, including a processor and a memory; the memory stores computer-readable instructions, which, when executed by the processor, implement the above-described method.

[0021] Sixthly, embodiments of this application provide a computer-readable storage medium storing computer-readable instructions that, when executed by a processor, implement the above-described method.

[0022] In a seventh aspect, embodiments of this application provide a computer program product, including computer instructions, which, when executed by a processor, implement the method described above.

[0023] This application provides an image recognition model training method, apparatus, electronic device, and storage medium. In this application, a sample image is first identified using an image recognition model to obtain the sample recognition result. Then, a composite reward value is determined based on a first reward value and a second reward value of the sample recognition result. The composite reward value is determined based on a first reward value indicating the accuracy of the predicted sample image region's localization in the sample image and a second reward value indicating the degree of conformity between the predicted data format of the predicted sample image region and a preset standard data format. This composite reward value not only indicates the accuracy of the image recognition model in recognizing the sample image but also indicates whether the format of the output result conforms to the standard data format. Therefore, after adjusting the parameters of the image recognition model based on the composite reward value, the recognition result output by the image recognition model is not only accurate in content but also highly conforms to the standard data format. The format of the recognition result output by the image recognition model is also more accurate, achieving the goal of improving the accuracy of both the content and format of the recognition result output by the image recognition model. Attached Figure Description

[0024] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0025] Figure 1 A schematic diagram is shown illustrating the application scenarios applicable to the embodiments of this application; Figure 2 A flowchart illustrating a training method for an image recognition model according to an embodiment of this application is shown; Figure 3 A flowchart of a training method for an image recognition model according to another embodiment of this application is shown; Figure 4 A flowchart of an image recognition method according to yet another embodiment of this application is shown; Figure 5 A schematic diagram illustrating the training process of an image recognition model according to an embodiment of this application is shown; Figure 6 A block diagram of a training apparatus for an image recognition model according to an embodiment of this application is shown; Figure 7 A block diagram of an image recognition device according to one embodiment of this application is shown; Figure 8 A structural block diagram of an electronic device for performing an image recognition method according to an embodiment of this application is shown. Detailed Implementation

[0026] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of this application without creative effort are within the scope of protection of this application.

[0027] In the following description, the terms "first" and "second" are used merely to distinguish similar objects and do not represent a specific ordering of objects. It is understood that "first" and "second" may be interchanged in a specific order or sequence where permitted, so that the embodiments of this application described herein can be implemented in an order other than that illustrated or described herein.

[0028] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit the application. It should be noted that "multiple" as used herein refers to two or more. "And / or" describes the association relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, and B alone. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship.

[0029] The definitions of proper nouns and abbreviations used in this application are as follows: DITL (Document Image Tampering Localization): This technique aims to locate manipulated areas within an image. In this application, the content being tampered with in the document image can be a form of the target object.

[0030] Native MLLM (Native Multimodal Large Language Model): A native multimodal large language model that directly builds an end-to-end generative mapping from raw pixels to spatial coordinates, without relying on external expert detectors. In this application, a native MLLM typically includes a visual encoder, an aligner, and an outcome predictor, with the outcome predictor usually being a large language model.

[0031] Hybrid Expert MLLMs: Existing joint architectures that utilize specialized visual networks to extract low-level features and use fine-tuned MLLMs to generate interpretive insights based on the detector's discrimination results.

[0032] FFT (Full Fine-Tuning): Full fine-tuning (also called full adjustment) involves completely updating the parameters of the visual encoder to enable it to capture forensic features and high-frequency signals specific to the task. In this application, FFT is mainly used in the pre-training process of the visual encoder.

[0033] GRPO (Group Relative Policy Optimization) is an alignment method used in reinforcement learning processes. It explores spatial alignment by generating multiple rollouts from the model's own distribution to stabilize training and refine output boundaries with high precision. In this application, the GRPO strategy is primarily used in the training process of the outcome predictor.

[0034] Negative Scaling Effect: In document forensics tasks, the counterintuitive phenomenon that the accuracy of localization decreases as the model parameter size increases.

[0035] Semantic inertia: A phenomenon in large models where strong pre-existing semantic priors hinder the transformation of the model from general object recognition to pixel-level forgery artifact detection.

[0036] Plasticity Inversion: Smaller models are able to reconstruct deeper layers to extract forensic features, while larger models exhibit a similarity phenomenon at deeper layers, bouncing back to pre-trained semantic features.

[0037] Please refer to Figure 1 The diagram illustrates an application scenario applicable to the embodiments of this application. This application scenario includes a terminal 110, a server 120, and a database 130.

[0038] Terminal 110 can be a smartphone, tablet, e-book reader, music player, wearable device, smart home device, in-vehicle terminal, etc. Terminal 110 has an image recognition client installed. This client can recognize the image to be recognized to obtain the image recognition result, or it can send an image recognition request so that server 120 can recognize the image to be recognized based on the request, obtain the image recognition result, and return the result to the client in terminal 110.

[0039] The server 120 can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms.

[0040] Database 130 can be an independent storage device, a cluster or distributed system consisting of multiple storage devices, or a cloud storage device providing cloud services. Database 130 can store sample data, which may include sample images and pre-trained sample images, and may also include annotation results of sample images, etc.

[0041] In some implementations, server 120 can obtain sample images from database 130, train an image recognition model based on the sample images, and deploy the trained image recognition model locally for image recognition.

[0042] Of course, when the terminal 110 implements the image recognition process of this application, after the server 120 obtains the trained image recognition model, it can send the trained image recognition model to the terminal 110, and the terminal 110 stores the trained image recognition model so that the terminal can implement image recognition.

[0043] It is easy to understand that the terminal 110 can also obtain sample images from the database 130 through the server 120 and train the image recognition model based on the sample images.

[0044] It is worth mentioning that, in this application, the visual encoder and result predictor in the image recognition model can also be trained in the terminal 110 or in the server 120 according to the aforementioned process, which will not be elaborated here.

[0045] In some other implementations, database 130 may be a storage module integrated within server 120 to store sample data.

[0046] In addition, terminal 110 can generate and send an image recognition request to server 120 (through an image recognition client). Server 120 recognizes the image to be recognized based on the image recognition request to obtain the image recognition result of the image to be recognized. Finally, server 120 returns the image recognition result of the image to be recognized to terminal 110 so that terminal 110 can display the image recognition result of the image to be recognized.

[0047] In some implementations, the server 120 can directly respond to local image recognition requests (which may be triggered by an image recognition client installed locally) to recognize the image to be recognized, so as to obtain the image recognition result of the image to be recognized.

[0048] Alternatively, the terminal 110 itself can directly respond to a local image recognition request (which may be triggered by an image recognition client installed locally) to recognize the image to be recognized, so as to obtain the image recognition result of the image to be recognized.

[0049] To more clearly explain the solution of this application, the following embodiments all use electronic devices as the training method of the image recognition model and the execution subject of the image recognition method for explanation.

[0050] Please see Figure 2 , Figure 2 This application presents a flowchart illustrating a training method for an image recognition model according to an embodiment of the present application. This method is applied to an electronic device, which may be... Figure 1 The terminal 110 or server 120 in the middle, the method may include: S110. The sample image is identified using an image recognition model to obtain the sample recognition result.

[0051] The sample recognition result is used to indicate at least one predicted sample image region in the sample image predicted by the image recognition model that includes the target object. The target object can refer to the object identified by the image recognition model. The target object can be a concrete object, such as a specific animal, person, building, vehicle, or smart home device. The target object can also be an abstract object, such as tampered content in a document image, AI (Artificial Intelligence) generated content in a document image, or infringing content in a document image.

[0052] Sample images can refer to images that include the target object. Thus, when training an image recognition model, the model can learn to locate the target object. Alternatively, sample images can also include images with and without the target object. Thus, when training an image recognition model, the model can more accurately distinguish between the target object and non-target objects, and its ability to locate the target object is stronger.

[0053] Image recognition models can be visual encoders in clip models (Contrastive Language-Image Pre-training, multimodal pre-trained large language models), encoders in variational autoencoders, BERT models, or convolutional neural networks, etc.

[0054] Accordingly, S110 may include: inputting the sample image into an image recognition model for recognition, so that the image recognition model can directly predict based on the sample image and output a prediction result, which serves as the sample prediction result. In this case, if the sample image includes a target object, the region (usually at least one region) in the sample image recognized by the image recognition model that includes the target object is used as the predicted sample image region. The predicted sample image region can be indicated by a rectangular selection box, which can be indicated by the corner coordinates and side length of the rectangular selection box. Of course, the sample prediction result may also include the prediction probability of each predicted sample image region, where the prediction probability is the probability that a target object exists in the predicted sample image region.

[0055] Of course, image recognition models can also be composite models (models composed of multiple models). For example, an image recognition model can be a composite model consisting of a visual encoder and a result predictor. The visual encoder recognizes the image, and the result predictor predicts the recognition result based on the features extracted by the visual encoder.

[0056] Visual encoders can be visual encoders in clip models, encoders in variational autoencoders, FFDN, ADCD-Net, and ASCForme, etc. Result predictors can include fully connected networks, classifiers, and large language models, etc.

[0057] Large Language Models (LLMs) are deep learning models trained on massive amounts of text data, enabling them to generate natural language text or understand the meaning of language text. These models can provide in-depth knowledge and language production on a wide range of topics through training on large datasets. Their core idea is to learn patterns and structures of natural language through large-scale unsupervised training, thus mimicking human language cognition and generation processes to some extent.

[0058] Accordingly, S110 may include: extracting image features from the sample image using a visual encoder to obtain sample image features corresponding to the sample image; and performing prediction processing on the sample image based on the sample image features using a result predictor to obtain the sample recognition result. In this case, when the sample image includes a target object, the region in the sample image recognized by the image recognition model that includes the target object is used as the predicted sample image region.

[0059] Generally, an image recognition model is a pre-trained native MLLM, which includes a visual encoder, an aligner, and a result predictor composed of a large language model. However, since the input of the large language model is usually a text token, while the output of the visual encoder is an image token (each image token is a feature of a primitive in the image, and a primitive refers to a pixel region in the image), the aligner performs feature alignment processing on the sample image features output by the visual encoder to obtain a text token that the result predictor can process as a sample alignment feature. Then, the result predictor performs prediction processing on the sample image based on the sample alignment feature to obtain the sample recognition result.

[0060] Correspondingly, a cue message can be constructed based on the text token processed by the aligner. This cue message is then input into a large language model, which performs prediction processing based on the cue message to obtain primitives identified as containing the target object. Primitives within a connected component that are identified as containing the target object constitute a prediction sample image region. The mean or weighted sum of the probabilities of multiple primitives containing the target object within the same prediction sample image region is used as the prediction probability of the prediction sample image region. Typically, primitives with a probability of containing the target object greater than a preset threshold (e.g., 0.8 or 0.7) are considered primitives containing the target object.

[0061] In some implementations, the visual encoder includes multiple key visual feature extraction layers, each located before the last feature extraction layer of the visual encoder; the result predictor includes a corresponding association prediction layer for each key visual feature extraction layer; accordingly, the aforementioned prediction processing of the sample image based on the sample image features by the result predictor to obtain the sample recognition result includes: inputting the sample image features into the first prediction layer of the result predictor and inputting the features extracted by each key visual feature extraction layer into the corresponding association prediction layer, so that the result predictor performs prediction processing on the sample image based on the sample image features and the features extracted by each key visual feature extraction layer to obtain the sample recognition result.

[0062] In other words, a portion of the network layers of the visual encoder and a portion of the network layers of the result predictor are connected to achieve cross-layer connection between the visual encoder and the result predictor. In this application, this cross-layer connection can be achieved through residual connections or other cross-layer connection methods. Of course, the sample image features and the features extracted by each key visual feature extraction layer can also be aligned by an aligner before being input into the first prediction layer and the corresponding associated prediction layer of the result predictor.

[0063] Therefore, instead of simply inputting the sample image features output from the last feature extraction layer of the visual encoder into the first prediction layer of the result predictor, the features extracted by the key visual feature extraction layer in the middle are input into the corresponding associated prediction layer. This achieves cross-layer feature fusion, enabling gradients to propagate back more effectively, mitigating the gradient vanishing problem, supporting deep network training, and reducing information loss. This allows deep models to better utilize the outputs of each layer, improving training efficiency and convergence speed. Furthermore, by fusing low-level details with high-level semantics, the model's ability to capture complex patterns is enhanced, along with its feature representation capabilities. Moreover, the fusion of visual and textual modalities also improves multimodal understanding capabilities.

[0064] However, to improve the recognition performance of image recognition models, pre-training can be performed before training the actual image recognition model. This process can include: A1. Extract image features from the pre-training sample images using the initial visual encoder to obtain the pre-training sample image features corresponding to the pre-training sample images.

[0065] A2. For each target sample region in multiple target sample regions in the pre-training sample image, the result predictor performs prediction processing on the target sample region based on the features of the pre-training sample image to obtain the region prediction probability of the target sample region; each target sample region includes the target object, and the region prediction probability is used to indicate the probability that the corresponding target sample region includes the target object.

[0066] A3. Based on the region prediction probability of multiple target sample regions, the parameters of the initial visual encoder are adjusted to obtain the parameter-adjusted initial visual encoder, which is used as the visual encoder.

[0067] The pre-training sample images can refer to images that include the target object. Thus, the image recognition model obtained through pre-training can learn the ability to find the target object. Alternatively, the pre-training sample images can also include images with and without the target object. Thus, the image recognition model obtained through pre-training can more accurately distinguish between the target object and non-target objects, and has a higher ability to find the target object.

[0068] The target sample region refers to the region in the pre-training sample image that includes the target object. It can be obtained by the user annotating the pre-training sample image. The target sample region can be in the form of a selection box, etc.

[0069] First, an initial visual encoder and a result predictor (which may also include an aligner) are obtained as the image recognition model to be pre-trained. Then, the initial visual encoder extracts image features from the pre-trained sample images to obtain the pre-trained sample image features corresponding to the pre-trained sample images. After that, the result predictor performs prediction processing on each target sample region based on the pre-trained sample image features to obtain the region prediction probability of each target sample region in the pre-trained sample images.

[0070] If the image recognition model to be pre-trained also includes an aligner, the features of the pre-trained sample image can be aligned using the aligner to obtain aligned pre-trained sample image features. Then, the result predictor can predict each target sample region based on the aligned pre-trained sample image features to obtain the region prediction probability of each target sample region in the pre-trained sample image.

[0071] Typically, pre-training sample images can be segmented into multiple primitives, each representing a pixel region within the image. An initial visual encoder extracts features from these primitives, resulting in individual primitive features for each primitive. These features are then aggregated to form the pre-training sample image features. Next, a result predictor uses these features to predict the probability that each primitive includes the target object—the primitive prediction probability. For each target sample region, the region prediction probability—the probability that the target sample region includes the target object—is determined based on the primitive prediction probabilities of each primitive within that region. For example, the weighted sum or average of the primitive prediction probabilities of each primitive within the target sample region can be used as the region prediction probability. The set of region prediction probabilities for all target sample regions serves as the pre-training sample recognition result. Finally, a visual loss value is constructed based on the region prediction probabilities for each target sample region. This visual loss value is then used to adjust the parameters of the initial visual encoder, resulting in a parameter-adjusted initial visual encoder, which serves as the final visual encoder. The visual loss value here can be determined using the cross-entropy loss function, the mean squared error loss function, or the absolute value error loss function.

[0072] Specifically, when the result predictor performs prediction processing on the pre-training sample image based on its features to obtain the prediction probability of each pixel, it actually performs prediction processing on each pixel separately based on the features corresponding to each pixel in the pre-training sample image features to obtain the prediction probability of each pixel. Similarly, when the result predictor performs prediction processing on the pre-training sample image based on its features to obtain the region prediction probability of each target sample region, it actually performs prediction processing on each target sample region separately based on the features corresponding to the target sample region in the pre-training sample image features (the features corresponding to the pixels included in the target sample region in the pre-training sample image features) to obtain the region prediction probability of each target sample region.

[0073] In some implementations, prior to step A2, the method may further include: A4. Based on the positional relationship between multiple target sample regions in the pre-trained sample image, sort the multiple target sample regions to obtain a target sample region sequence. Correspondingly, A2 includes: using a result predictor to predict each target sample region according to the arrangement order of each target sample region in the target sample region sequence, based on the features corresponding to each target sample region in the pre-trained sample image features, to obtain the region prediction probability of each target sample region.

[0074] First, a region arrangement order can be set. For example, the region arrangement order can be from left to right and from top to bottom, or from right to left and from top to bottom, or from right to left and from bottom to top.

[0075] Next, based on the positional relationships of multiple target sample regions in the pre-training sample image, the target sample regions are sorted according to their arrangement order to obtain a target sample region sequence. For example, the region arrangement order can be from left to right or from top to bottom. Correspondingly, in the same row of the pre-training sample image, the target sample region on the left is placed before the target sample region on the right in the target sample region sequence, and in the same column of the pre-training sample image, the target sample region above is placed before the target sample region below in the target sample region sequence.

[0076] However, when the sizes of the target sample regions are inconsistent, it is difficult to determine whether they are in the same row or column. Therefore, A4 may also include: obtaining the reference point corresponding to each target sample region in the pre-training sample image; each reference point being located within its corresponding target sample region; and sorting the multiple target sample regions based on the positional relationship between the reference points corresponding to each target sample region in the pre-training sample image to obtain a sequence of target sample regions. For example, the reference point can be the center point, the upper left corner point, or the upper right corner point of the target sample region.

[0077] In other words, each target sample region is represented by a reference point, and the positional relationship between multiple target sample regions is indicated by the positional relationship between these reference points. For example, the regions can be arranged from left to right or from top to bottom. If any two target sample regions are considered... and The coordinates of the reference point are ( ),like ,Sure Ranked In front, or, if ,Sure Ranked Front.

[0078] In practice, the result predictor performs prediction processing on each target sample region in the order of their arrangement in the target sample region sequence. For each target sample region in the prediction process, the result predictor performs prediction processing on the target sample region based on the primitive features of the primitives in the target sample region (that is, the features in the pre-trained sample image features in step A2 above that correspond to the target sample region) to obtain the probability that each primitive in the target sample region includes the target object (primitive prediction probability). Then, based on the probability that each primitive in the target sample region includes the target object, the probability that the target sample region includes the target object is further determined (region prediction probability).

[0079] Next, based on the region prediction probabilities corresponding to multiple target sample regions, the visual loss value is determined. Then, the initial visual encoder is fully parameterized using the visual loss value (at this time, the model parameters of the result predictor and aligner can be fixed) to obtain the parameterized initial visual encoder, which serves as the visual encoder.

[0080] For example, the visual loss value can be determined based on the region prediction probabilities corresponding to multiple target sample regions, according to Formula 1, as follows: (one) in, Let y be the visual loss value, and y be the sequence of target sample regions formed by the target sample regions. The probability that the target sample region includes the target object is called the region prediction probability.

[0081] Given the visual loss value determined according to Formula 1, the initial visual encoder can be fully adjusted with the goal of minimizing the visual loss value. During training, the learning rate (LR) can be set to 1e-4, using the Cosine Decay scheduler, with the Warmup Ratio (the percentage increase in the learning rate during the initial training phase) set to 0.01, and the Batch Size set to 1 (i.e., one pre-training sample image constitutes one batch).

[0082] Therefore, by pre-training the visual encoder, the shallow feature space of the visual encoder is reconstructed into a high-frequency signal amplifier, which enables it to capture micro-forensic features such as image compression artifacts and resampling interpolation that are smoothed out in the general pre-training stage. This weakens the negative scaling effect, improves the feature extraction capability of the visual encoder, and makes the features extracted by the visual encoder more accurate.

[0083] Furthermore, the constructed target sample region sequence is used to constrain the generation order of the target sample regions by the model (mainly constraining the result predictor). Thus, during the training process, the target sample region sequence constrains the scanning order of the image tokens of the primitives when the result predictor performs prediction processing on the sample images. This ensures that the result predictor can comprehensively cover the image tokens of each primitive when performing prediction processing, avoiding the omission of image tokens and making the prediction results of the result predictor more accurate.

[0084] S120. Based on the first reward value and the second reward value of the sample recognition result, determine the composite reward value.

[0085] The first reward value indicates the localization accuracy of at least one predicted sample image region; the second reward value indicates the degree of fit between the predicted data format and the preset standard data format, wherein the predicted data format is the format of the data indicating at least one predicted sample image region; the first reward value is determined based on the positional differences and / or overlap between at least one predicted sample image region and the region where the target object is located in the sample image. The region where the target object is located in the sample image refers to a region in the sample image that includes the entire target object or a region that includes a portion of the target object.

[0086] In other words, the content of the predicted sample image region can be analyzed to determine the overlap and / or positional difference between the predicted sample image region and the region where the target object is located in the sample image. Based on this overlap and / or positional difference, a first reward value can be determined. The overlap relationship here can refer to whether there is overlap, the size of the overlap area, and the proportion of the overlap area (the ratio of the overlap area to the predicted sample image region). The positional difference refers to the difference in pixel coordinates between the predicted sample image region and the region where the target object is located in the sample image. The coordinate difference can refer to the difference in coordinates between the center (or other reference point, such as a corner point) of the predicted sample image region and the center (or other reference point, such as a corner point) of the region where the target object is located in the sample image.

[0087] Simultaneously, analysis can be performed based on the data format of the data indicating the predicted sample image region (i.e., the predicted data format) to determine whether the predicted data format of the data indicating the predicted sample image region matches the preset standard data format, thereby determining the second reward value.

[0088] The preset standard data format refers to the standardized data format configured for the data indicating the predicted sample image region. For example, if a rectangle is used to indicate the predicted sample image region, the standard data format can be [x1, y1, x2, y2], where (x1, y1) are the coordinates of the upper left corner of the predicted sample image region, and (x2, y2) are the coordinates of the lower right corner of the predicted sample image region.

[0089] In some implementations, the content of the predicted sample image region can be analyzed to obtain the overlap area between the predicted sample image region and the region where the target object is located in the sample image, as well as the coordinate difference of the pixel coordinates between the predicted sample image region and the region where the target object is located in the sample image. If the ratio of the overlap area to the predicted sample image region is greater than a specified ratio (e.g., 0.6 or 0.7), and the coordinate difference is less than a specified coordinate difference (e.g., 8 pixels or 10 pixels), the content reward value of the predicted sample image region is determined to be a first preset value. If the ratio of the overlap area to the predicted sample image region is not greater than the specified ratio, or the coordinate difference is not less than the specified coordinate difference, the content reward value of the predicted sample image region is determined to be a second preset value. If the ratio of the overlap area to the predicted sample image region is not greater than the specified ratio, and the coordinate difference is not less than the specified coordinate difference, the content reward value of the predicted sample image region is determined to be a third preset value. The first preset value is greater than the second preset value, and the second preset value is greater than the third preset value. Subsequently, the content reward values ​​of multiple predicted sample image regions are weighted, summed, averaged, etc., and the result is used as the first reward value of the sample image.

[0090] Accordingly, if the predicted data format of the data in the indicated predicted sample image region is consistent with the preset standard data format, the format reward value of the predicted sample image region is determined to be the fourth preset value; if the predicted data format of the data in the indicated predicted sample image region is inconsistent with the preset standard data format, the format reward value of the predicted sample image region is determined to be the fifth preset value, wherein the fourth reward value is greater than the fifth reward value. Subsequently, the format reward values ​​of multiple predicted sample image regions are weighted and summed, averaged, etc., and the result is used as the second reward value of the sample image.

[0091] In some implementations, S120 may include: B1. Obtain the annotation results of the sample image; the annotation results of the sample image indicate at least one annotated image region in the sample image that includes the target object.

[0092] B2. Match at least one predicted sample image region and at least one labeled image region to match a corresponding predicted sample image region for each labeled image region.

[0093] B3. Determine the first reward value based on the positional difference and / or overlap between each labeled image region and the corresponding predicted sample image region.

[0094] B4. Determine the second reward value based on the format difference between the annotation data format of each labeled image region and the prediction data format of the corresponding prediction sample image region; the annotation data format of the labeled image region is the format of the data in the labeled image region.

[0095] B5. Determine the composite reward value based on the first reward value and the second reward value.

[0096] The annotation results can be user-annotated or annotated by a large language model with certain capabilities; this application does not impose any limitations. Similarly, the annotated image region can also be a region indicated by a selection box. For example, the annotated image region can be a region indicated by a rectangular selection box, and the rectangular selection box is indicated by the coordinates of the upper left corner and the lower right corner.

[0097] The data format used to indicate the labeled image region is the label data format, and the label data format for the labeled image region is a preset standard data format.

[0098] It can be based on the differences between multiple predicted sample image regions and multiple labeled image regions, and determine a corresponding predicted sample image region for each labeled image region. For example, the predicted sample image region with the smallest difference can be obtained as the predicted sample image region corresponding to the labeled image region.

[0099] In some implementations, the Hungarian matching algorithm can be introduced to match multiple predicted sample image regions and multiple labeled image regions, so that each labeled region is matched with a predicted sample image region.

[0100] Then, for each predicted sample image region, a first reward value is determined based on the positional difference and / or overlap between the labeled image region and the corresponding predicted sample image region.

[0101] Optionally, the process of determining the first reward value may further include: for each labeled image region, determining the region positioning offset reward value corresponding to the labeled image region based on the overlap relationship between the labeled image region and the corresponding predicted sample image region; determining the region quantity offset reward value corresponding to the labeled image region based on the positional difference between the labeled image region and the corresponding predicted sample image region; determining the single region positioning reward value corresponding to the labeled image region based on the region positioning offset reward value and / or the region quantity offset reward value corresponding to the labeled image region; and determining the first reward value based on the single region positioning reward values ​​corresponding to multiple labeled image regions.

[0102] The overlap relationship here can be either the Intersection over Union (IOU) or the Generalized Intersection over Union (GIoU). In other words, the region localization offset reward value corresponding to the labeled image region can be determined based on the IOU or the generalized IOU between the labeled image region and the corresponding predicted sample image region.

[0103] The positional difference between the labeled image region and the corresponding predicted sample image region can refer to the difference in coordinates between the labeled image region and the corresponding predicted sample image region. For example, a rectangular labeled image region is indicated by the coordinates of its top-left and bottom-right corners, while a rectangular predicted sample image region is indicated by the coordinates of its top-left and bottom-right corners. The positional difference between the labeled image region and the corresponding predicted sample image region refers to the difference in the coordinates of the top-left and bottom-right corners between the labeled image region and the corresponding predicted sample image region.

[0104] Then, the region positioning offset reward value corresponding to the labeled image region can be obtained as the single region positioning reward value, or the region number offset reward value of the labeled image region can be obtained as the single region positioning reward value, or the region positioning offset reward value and the region number offset reward value corresponding to the labeled image region can be weighted and summed to obtain the result as the single region positioning reward value corresponding to the labeled image region.

[0105] Finally, the first reward value can be obtained by summing the single-region localization reward values ​​corresponding to multiple labeled image regions.

[0106] For example, the first reward value can be calculated using Formula 2, as follows: (two) in, The first reward value is given by n, where n is the total number of labeled image regions. This provides the location information (coordinates) of the i-th labeled image region. This refers to the location information (coordinates) of the predicted sample image region corresponding to the i-th labeled image region. Let be the generalized intersection-union ratio (GUC) between the i-th labeled image region and its corresponding predicted sample image region. and These are the preset weights.

[0107] The first reward value forces the model to achieve high-precision positioning by penalizing the mismatch between position offset (GIou indicator) and the number of detection boxes (position information indicator). It can more precisely constrain the accuracy of the prediction results output by the image recognition model, resulting in a higher image recognition accuracy of the trained image recognition model.

[0108] The second reward value can be determined based on the difference or magnitude of the difference between the labeled data format of each labeled image region and the predicted data format of the corresponding predicted sample image region.

[0109] Optionally, for each labeled image region, if the labeled data format of the labeled image region is consistent with the prediction data format of the corresponding prediction sample image region, a first value is obtained as the single-region format reward value of the labeled image region; if the labeled data format of the labeled image region is inconsistent with the prediction data format of the corresponding prediction sample image region, a second value is obtained as the single-region format reward value of the labeled image region; the first value is greater than the second value; a second reward value is determined based on the single-region format reward values ​​of multiple labeled image regions. The first value can be, for example, 1, and the second value can be, for example, 0.

[0110] Therefore, if the format of the labeled data in the labeled image region is consistent with the format of the predicted data in the corresponding predicted sample image region, and the data format of the predicted sample image region conforms to the preset standard data format, a higher reward value—the first value—is given. If the format of the labeled data in the labeled image region is inconsistent with the format of the predicted data in the corresponding predicted sample image region, and the data format of the predicted sample image region does not conform to the preset standard data format, a lower reward value—the second value—is given. Thus, by giving high rewards to data in the standard data format and low rewards to data in the non-standard data format, the goal of constraining the data format to the labeled data format is achieved, making the format of the recognition results output by the image recognition model more accurate.

[0111] After obtaining the first and second reward values, a weighted sum of the first and second reward values ​​can be calculated to obtain the composite reward value. That is, the composite reward value. ,in, As the second reward value, and The preset weights.

[0112] S130. Based on the composite reward value, adjust the parameters of the image recognition model.

[0113] After obtaining the composite reward value, the parameters of the image recognition model can be adjusted based on the composite reward value to obtain an image recognition model with stronger image recognition capabilities after parameter adjustment.

[0114] Optionally, there are multiple sample images; each sample image corresponds to its own composite reward value; accordingly, S130 may include: determining a relative advantage value for each sample image based on the composite reward values ​​corresponding to the multiple sample images; the relative advantage value is used to indicate the difference between the composite reward value of the corresponding sample image and the average level of the composite reward values ​​of the multiple sample images; and adjusting the parameters of the image recognition model based on the relative advantage values ​​of the multiple sample images.

[0115] For example, the mean of the composite reward values ​​of the sample images can be determined, and the ratio of the composite reward value of each sample image to the mean of the composite reward values ​​can be determined as the relative advantage value of the sample images.

[0116] Alternatively, the relative advantage value of the sample image can be determined using Formula 3, as follows: (three) in, The relative advantage of the j-th sample image. Let J be the composite reward value for the j-th sample image. The average of the compound reward values. This represents the standard deviation of the composite reward value.

[0117] After determining the relative advantage value, the model loss value can be determined based on the relative advantage, and then the parameters of the image recognition model can be adjusted based on the model loss value.

[0118] Typically, the parameter tuning process of an image recognition model includes M iterations; M is a positive integer greater than 1. For the q-th iteration, the model loss value is determined based on the relative advantage value of the q-th iteration, the model parameters of the image recognition model before the q-th iteration, and the model parameters of the image recognition model before parameter tuning (the image recognition model before the 1st iteration). Then, the parameters of the image recognition model are tuned with the goal of minimizing the model loss value.

[0119] Optionally, in this embodiment, the process of determining the model loss value can refer to Formula 4, which is as follows: (Four) in, This represents the model loss value. The total number of sample images. Let be the model parameters of the image recognition model in the q-th iteration process. These are the model parameters of the image recognition model before the q-th iteration. These are the model parameters of the image recognition model before the first iteration. as well as These are preset parameters, and this application does not impose any limitations on them. This refers to the KL divergence.

[0120] Here, the image recognition model is trained using the GRPO strategy indicated by Formula 4. The model reward value further converges the boundary accuracy of the image recognition model, improving the training effect of the image recognition model and resulting in a higher image recognition accuracy of the trained image recognition model.

[0121] In some implementations, the image recognition model includes a visual encoder and a result predictor; accordingly, S130 may include: fixing the model parameters of the visual encoder, adjusting the parameters of the result predictor based on the composite reward value, so as to adjust the parameters of the image recognition model.

[0122] In other words, when training an image recognition model using composite reward values, only the parameters of the result predictor are actually adjusted, not the model parameters of the visual encoder (and aligner).

[0123] Of course, it is not difficult to understand that the relative advantage can be determined based on the composite reward value, and then the parameters of the outcome predictor can be adjusted based on the relative advantage.

[0124] In this embodiment, the sample image is first identified by an image recognition model to obtain the sample recognition result. Then, based on the first reward value and the second reward value of the sample recognition result, a composite reward value is determined. The composite reward value is determined based on the first reward value, which indicates the accuracy of the localization of the predicted sample image region in the sample image, and the second reward value, which indicates the degree of fit between the predicted data format of the predicted sample image region and the preset standard data format. This makes the composite reward value not only indicate the accuracy of the image recognition model in recognizing the sample image, but also whether the format of the result output by the image recognition model when recognizing the sample image conforms to the standard data format. Therefore, after adjusting the parameters of the image recognition model based on the composite reward value, the recognition result output by the image recognition model is not only accurate in content, but also highly consistent with the standard data format. The format of the recognition result output by the image recognition model is also more accurate, thus achieving the goal of improving the accuracy of the content and format of the recognition result output by the image recognition model.

[0125] In some implementations, such as Figure 3 As shown, after S130, the method further includes: S210. The image recognition model with adjusted parameters is used to recognize multiple sample images separately, and the verification recognition result of each sample image is obtained.

[0126] The verification recognition result of the sample image is used to indicate the verification image region containing the target object in the sample image predicted by the parameter-adjusted image recognition model. In other words, the parameter-adjusted image recognition model re-identifies the sample image, and the result is used as the verification recognition result, which includes the region containing the target object identified by the parameter-adjusted image recognition model—the verification image region.

[0127] S220. Based on the difference between the verification and recognition results and the annotation results corresponding to each sample image, candidate sample images are determined from multiple sample images.

[0128] The annotation results of the sample images indicate the annotated image regions in the sample images that include the target object.

[0129] In this embodiment, the aim is to filter out sample images that are difficult for the image recognition model to recognize after parameter adjustment from multiple sample images. Therefore, based on the difference between the verification recognition result and the annotation result corresponding to each sample image, candidate sample images with a difference greater than a preset difference are determined from the sample images.

[0130] In some implementations, the difference between the verification and recognition results and the annotation results of the sample image can be determined based on factors such as the overlap, area, and quantity differences between the labeled image regions of the sample image.

[0131] For example, the area difference between the verification image region and the labeled image region of the sample image can be determined as the difference between the verification recognition result and the labeling result of the sample image. Alternatively, the difference between the number of verification image regions included in the verification recognition result of the sample image and the number of labeled image regions included in the labeling result can be determined as the difference between the verification recognition result and the labeling result of the sample image. Furthermore, whether the verification image region and the labeled image region of the sample image overlap can be statistically analyzed, and the ratio of the overlapping area to the total area of ​​the labeled image region can be calculated as the difference between the verification recognition result and the labeling result of the corresponding sample image.

[0132] In some other implementations, each labeled image region corresponds to a verification image region; thus, for each sample image, the image crossover ratio (CVR) of the sample image is determined based on the overlap relationship between each labeled image region of the sample image and its corresponding verification image region; if the image CVR of the sample image is less than a preset CVR threshold, the sample image is obtained as a candidate sample image.

[0133] The method for matching labeled image regions and verification image regions can be to determine the labeled image region and verification image region with the smallest difference as the matched labeled image region and verification image region, or to perform Hungarian matching on multiple labeled image regions and multiple verification image regions to match one verification image region for each labeled image region.

[0134] The overlap relationship here can be the Cross-Union Ratio (CUI) or the Generalized CUI. It can be calculated by averaging the CUI of multiple labeled image regions and their corresponding verification image regions included in the sample image. If the CUI of the sample image is less than a preset CUI threshold (e.g., 0.8 or 0.7), it is determined that the image recognition model after parameter adjustment has a low recognition accuracy for the sample image, and the sample image is obtained as a candidate sample image. If the CUI of the sample image is not less than the preset CUI threshold (e.g., 0.8 or 0.7), it is determined that the image recognition model after parameter adjustment has a high recognition accuracy for the sample image, and the sample image is not obtained as a candidate sample image.

[0135] S230. Based on the verification and recognition results corresponding to the candidate sample images, determine the composite reward value of the candidate sample images; based on the composite reward value of the candidate sample images, adjust the parameters of the parameter-adjusted image recognition model.

[0136] After determining the candidate sample images, the composite reward value is determined for the candidate sample images in accordance with the aforementioned step S120. Then, in accordance with step S130, the parameters of the parameter-adjusted image recognition model are adjusted based on the composite reward value of the candidate sample images, so as to further adjust the parameters of the parameter-adjusted image recognition model and improve the image recognition accuracy of the image recognition model.

[0137] It is easy to understand that when training the initial visual encoder using pre-trained sample images, the trained visual encoder can also recognize multiple pre-trained sample images separately, obtaining a pre-training verification recognition result for each pre-trained sample image. The pre-training verification recognition result of the pre-training sample image is used to indicate the pre-training verification image region containing the target object in the pre-training sample image predicted by the parameter-adjusted image recognition model. Based on the difference between the pre-training verification recognition result and the pre-training annotation result corresponding to each pre-training sample image, candidate pre-training sample images are determined from multiple pre-training sample images. The pre-training annotation result of the pre-training sample image indicates the pre-training annotation image region containing the target object in the pre-training sample image. Based on the pre-training verification recognition result corresponding to the candidate pre-training sample image, the composite reward value of the candidate pre-training sample image is determined. Based on the composite reward value of the candidate pre-training sample image, the parameters of the visual encoder are adjusted.

[0138] In this embodiment, by using explicit copying technology, the frequency of candidate sample images with slightly lower recognition accuracy in the training stream is doubled (i.e., repeated sampling is incorporated into the training set). By increasing the gradient update frequency, the model is forced to fit difficult samples (i.e., candidate sample images) with weak visual traces such as "blurred edges" or "high transparency coverage", which improves the recognition ability of the image recognition model and makes the image recognition results output by the image recognition model after parameter readjustment have a higher accuracy.

[0139] Please see Figure 4 , Figure 4 This application illustrates a flowchart of an image recognition method according to an embodiment of the present application. The method is applied to an electronic device, which may be... Figure 1 The terminal 110 or server 120 in the middle, the method may include: S310. Obtain the image to be recognized.

[0140] The image to be identified refers to the image for which the target object is to be identified. It can be a captured image, a screenshot, or an AI-generated image.

[0141] S320. The image to be recognized is recognized by the image recognition model to obtain the image recognition result of the image to be recognized.

[0142] The image recognition result is used to indicate whether the image to be recognized includes the target object and / or the predicted image region where the target object is located in the image to be recognized. If the target object is not recognized in the image to be recognized, the image recognition result indicates that the image to be recognized does not include the target object. If the target object is recognized in the image to be recognized, the image recognition result indicates that the image to be recognized includes the target object. Furthermore, the image recognition result can also further indicate the predicted image region where the target object is located in the image to be recognized. The predicted image region can be, for example, the region indicated by a rectangular selection box.

[0143] The image recognition model is trained according to any of the methods in the foregoing embodiments, which will not be repeated here.

[0144] As mentioned above, the image to be recognized can be input into the image recognition model, which will then perform image recognition to obtain at least one initial predicted image region and the probability that each initial predicted image region includes the target object. If the probability that the initial predicted image region includes the target object is higher than a preset threshold (e.g., 0.6 or 0.65), the initial predicted image region is obtained as the predicted image region, thereby obtaining the image recognition result.

[0145] It is not difficult to understand that if the image recognition model is a composite model (a model composed of multiple models): the image recognition model is a composite model consisting of a visual encoder and a result predictor. The visual encoder can extract features from the image to be recognized to obtain image features, and then the image features are input into the result predictor for prediction to obtain the image recognition result.

[0146] Optionally, if the image recognition model is a composite model and also includes an aligner, the image features can be extracted from the image to be recognized by a visual encoder to obtain image features. Then, the image features are aligned by the aligner to obtain aligned image features. Finally, the aligned image features are input into the result predictor for prediction to obtain the image recognition result.

[0147] In this embodiment, the trained image recognition model has strong recognition capabilities, resulting in a high recognition accuracy when the image recognition model identifies the image to be recognized, and the obtained image recognition result has a high accuracy.

[0148] To provide a more intuitive understanding of the solution presented in this application, an example is provided below to explain the image recognition method. In this embodiment, the image recognition model can be a model using the Qwen3-VL-2B architecture, and the target object is the tampered content in the document image.

[0149] like Figure 5 As shown, firstly, the input preprocessed sample images (including document images with tampered content) and sample images (including document images with tampered content) are preprocessed to improve the resolution of the preprocessed sample images and sample images, so that the images used to train the image recognition model have high resolution and clearer and more accurate information, thereby improving the training effect of the image recognition model and making the image recognition model have a higher image recognition accuracy after training.

[0150] Then, the preprocessed sample images and their pixel coordinates are mapped to the integer space [0, 1000] to normalize the pixel coordinates, so that the preprocessed sample images and their pixel coordinates are mapped to the image format that the image recognition model can directly receive.

[0151] Subsequently, the target sample regions containing the target object (tampered information) in the preprocessed sample image (the image after the aforementioned normalization process) are sorted "from top to bottom and from left to right" according to the center point coordinates of the target sample regions to obtain the target sample region sequence. This target sample region sequence serves as the ground truth generated by the autoregressive generation of the visual encoder, realizing spatial sequence stabilization and ensuring the consistency between the spatial location distribution and the token sequence generation logic of the large language model in the image recognition model.

[0152] In the specific training process, firstly, the model parameters of the aligner and result predictor (usually a large language model) in the image recognition model are frozen. The image recognition model is then used to predict the pre-processed sample images. Based on the prediction results of the pre-trained samples, the model parameters of the visual encoder in the image recognition model are adjusted. This adjustment of model parameters can be full fine-tuning (FFT) to reconstruct the shallow feature space of the visual encoder into a high-frequency signal amplifier, enabling it to capture micro-forensic features such as JPEG compression artifacts and resampling interpolation that are smoothed out in the general pre-training stage, effectively weakening the negative amplification effect.

[0153] After that, the pre-training sample images can be recognized by the image recognition model to obtain the pre-training verification recognition results of the pre-training sample images. Then, based on the pre-training verification recognition results, candidate pre-training sample images are selected from the pre-training sample images. Then, the visual encoder is trained again based on the candidate pre-training sample images (by combining the model parameters of the aligner and result predictor in the image recognition model) to further improve the visual encoder's feature extraction capability for difficult samples.

[0154] After the visual encoder completes full fine-tuning, the sample image can be predicted using the image recognition model to obtain the sample prediction result. The sample prediction result includes the predicted sample image region—the prediction box. Based on the difference between the predicted sample image region and the ground truth box (the labeled image region), a first reward value and a second reward value are determined. Then, the first reward value and the second reward value are combined to obtain a composite reward value. Then, the parameters of the aligner and the visual encoder in the image recognition model are fixed. Based on the GRPO loss value constructed from the composite reward value (that is, the aforementioned model loss value), the model parameters of the result predictor are adjusted to achieve parameter adjustment of the image recognition model and obtain the trained image recognition model.

[0155] After that, the sample images can be identified by the image recognition model to obtain the pre-trained verification recognition results of the sample images. Then, based on the verification recognition results, candidate sample images are selected from the sample images. Then, the result predictor is trained again based on the candidate sample images (by the model parameters of the aligner and visual encoder in the image recognition model) to further improve the prediction ability of the result predictor for difficult samples.

[0156] Finally, after completing the aforementioned training, a trained image recognition model is obtained, which can then locate the area where the tampered content is located in the document image to be recognized.

[0157] This example uses mathematical methods to quantify how to overcome the "negative scaling effect" of large models in DITL tasks: The feature extraction performance of the visual encoder can be analyzed by performing spectral analysis on the features extracted by the visual encoder. Specifically, this can be achieved by using 2D Discrete Fourier Transform (2D-DFT) to quantize the energy distribution of the extracted feature map f(x, y). in, The quantized feature map is based on the quantized... The distribution shows that as the model depth increases, the high-pass filtering characteristics in the features weaken (i.e., high-frequency energy decay). However, in this application, this decay is compensated by FFT of the visual encoder and cross-layer fusion.

[0158] Centered kernel alignment can also be introduced to quantify the correlation between interlayer features, so as to analyze the plasticity inversion of the image recognition model through the CKA value. Specifically, the CKA value is determined according to the following formula: Here, K and L represent the image recognition model before and after training, respectively, and HSIC refers to the Hilbert-Schmidt Independence Criterion. Analysis using this CKA value shows that the larger the image recognition model, the stronger the plasticity inversion; conversely, the smaller the model, the weaker the plasticity inversion. The 2b model chosen in this example exhibits "representation rebound," enabling the deep network to successfully recover its sensitivity to low-level pixel edges while processing complex semantics.

[0159] The image recognition model in this application avoids representation misalignment and semantic interference through a unified architecture. Compared to the state-of-the-art (SOTA) methods in the current document authentication field, the image recognition model in this application achieves extremely high pixel-level overall IoU on both the DocTamper and FSTS datasets, reaching the new SOTA level, while also demonstrating excellent generalization ability on the OOD dataset.

[0160] For example, for the Doctamper dataset, the performance comparison of the image recognition model in this application with that of the image recognition model in the prior art is shown in Table 1, as follows: Table 1 The model categories involved include SPECIALIZED VISION-BASED MODELS, HYBRID EXPERT MLLMS, and pure multimodal large language models. The vision-based specialized models include ASCFORMER, ADCD-NET, DITL, DTD, and FFDN models. Hybrid expert networks include FAKESHIELD and SIDA models. The pure multimodal large language models refer to the OURS model in this application. P-oIoU refers to the panoptic IoU metric, used to measure segmentation accuracy, and I-f1 refers to the instance F1 score.

[0161] As shown in Table 1, the image recognition model of this application has high values ​​in Testingset, Fréchet ChemNet Distance (FCD), Similarity to ClosestDistance (SCD), and Mean (MEAN, the arithmetic mean of the first three P-oIoU values), which means that the image recognition model of this application has good recognition performance.

[0162] The various performance indicators of the image recognition model of this application compared with those of the existing image recognition models are shown in Table 2, as follows: Table 2 Among them, Instance Precision (I-prec) refers to the ratio of positive instances to total instances, and Instance Recall (I-rec) refers to the proportion of all real positive instances that are correctly identified by the model.

[0163] As shown in Table 2, the image recognition model of this application has high performance indicators, and is the highest for all indicators in Table 2. This means that compared with the other 6 existing models shown in Table 2, the image recognition model of this application has a higher recognition effect, and the recognition accuracy and recognition effect are the highest.

[0164] However, by training the visual encoder separately and fully fine-tuning it as described in this application, the model can overcome the perceptual limitations of pre-trained weights and reduce its reliance on language priors, thereby alleviating the illusion phenomenon during the localization process to some extent. Table 3 shows the comparison between the full fine-tuning and efficient fine-tuning (LoRA) strategies for image recognition models with different architectures. Table 3 Here, VIT refers to the visual encoder, ALG refers to the aligner, and LLM is the large language model that acts as the outcome predictor.

[0165] As shown in Table 3, the full fine-tuning training strategy for the visual encoder alone can achieve high performance across all metrics of the image recognition model, ensuring its recognition effectiveness. This means that compared to the joint training of multiple networks and efficient fine-tuning methods in the image recognition model, the full fine-tuning training strategy for the visual encoder adopted in this application has better training results.

[0166] Furthermore, thanks to the in-depth understanding of the "negative scaling effect," this application demonstrates that smaller models outperform larger models in document forensics tasks. Using smaller models not only surpasses the detection performance of larger-scale parameter models, but also reduces computational costs in practical applications due to the fewer parameters. Table 4 shows the comparison of various metrics for image recognition models of different sizes when performing document image tampering detection. Table 4 Among them, family refers to the family to which the model belongs, including LLAVA-OV family, INTERNVL3 family, Qwen2.5-VL family and Qwen3-VL family. The number of parameters of the model (PARAMS) indicates the size of the model.

[0167] As shown in Table 4, for different families of models, models with smaller parameter volumes have higher performance indicators than models with larger parameter volumes. This means that image recognition models are not necessarily better the larger the parameter volume is. On the contrary, choosing a model with a smaller parameter volume can achieve better results, and the "negative scaling effect" of models with smaller parameter volumes is weaker.

[0168] The method described in this application eliminates the inherent structural information bottleneck and perception gap in decoupled design, establishes an end-to-end unified architecture, directly generates a mapping from raw pixels to spatial coordinates, ensures that high-frequency evidence collection signals are preserved and transmitted to the language model layer, and successfully overcomes the evidence collection blind spot problem of the native model under zero samples through full fine-tuning and optimization strategies.

[0169] Please see Figure 6 , Figure 6 This illustration shows a block diagram of a training apparatus for an image recognition model according to an embodiment of this application. The training apparatus 1300 for the image recognition model includes: The first recognition module 1310 is used to recognize the sample image through an image recognition model to obtain the sample recognition result of the sample image; the sample recognition result is used to indicate that the sample image predicted by the image recognition model includes at least one predicted sample image region containing the target object; The determination module 1320 is used to determine a composite reward value based on a first reward value and a second reward value of the sample recognition result; the first reward value indicates the positioning accuracy of at least one predicted sample image region; the second reward value indicates the degree of fit between at least one predicted data format and a preset standard data format, wherein the predicted data format is the format of the data indicating the predicted sample image region; the first reward value is determined based on the positional differences and / or overlap between at least one predicted sample image region and the region where the target object is located in the sample image. The adjustment module 1330 adjusts the parameters of the image recognition model based on the composite reward value.

[0170] Optionally, the determining module 1320 is further configured to obtain the annotation results of the sample image; the annotation results of the sample image indicate at least one annotated image region in the sample image that includes the target object; match at least one predicted sample image region and at least one annotated image region to match a corresponding predicted sample image region for each annotated image region; determine a first reward value based on the positional difference and / or overlap relationship between each annotated image region and the corresponding predicted sample image region; determine a second reward value based on the format difference between the annotation data format of each annotated image region and the prediction data format of the corresponding predicted sample image region; the annotation data format of the annotated image region is the format indicating the data of the annotated image region; and determine a composite reward value based on the first reward value and the second reward value.

[0171] Optionally, the determining module 1320 is further configured to, for each labeled image region, determine a region positioning offset reward value corresponding to the labeled image region based on the overlap relationship between the labeled image region and the corresponding predicted sample image region; determine a region quantity offset reward value corresponding to the labeled image region based on the positional difference between the labeled image region and the corresponding predicted sample image region; determine a single region positioning reward value corresponding to the labeled image region based on the region positioning offset reward value and / or the region quantity offset reward value corresponding to the labeled image region; and determine a first reward value based on the single region positioning reward values ​​corresponding to multiple labeled image regions.

[0172] Optionally, the determining module 1320 is further configured to, for each labeled image region, if the labeled data format of the labeled image region is consistent with the prediction data format of the corresponding prediction sample image region, obtain a first value as the single-region format reward value of the labeled image region; if the labeled data format of the labeled image region is inconsistent with the prediction data format of the corresponding prediction sample image region, obtain a second value as the single-region format reward value of the labeled image region; the first value is greater than the second value; and the second reward value is determined based on the single-region format reward values ​​of multiple labeled image regions.

[0173] Optionally, there are multiple sample images; the adjustment module 1330 is further configured to recognize the multiple sample images separately using the parameter-adjusted image recognition model, and obtain the verification recognition result for each sample image; the verification recognition result of the sample image is used to indicate the verification image region containing the target object in the sample image predicted by the parameter-adjusted image recognition model; based on the difference between the verification recognition result and the annotation result corresponding to each sample image, candidate sample images are determined from the multiple sample images; the annotation result of the sample image indicates the annotation image region containing the target object in the sample image; based on the verification recognition result corresponding to the candidate sample image, the composite reward value of the candidate sample image is determined; based on the composite reward value of the candidate sample image, the parameters of the parameter-adjusted image recognition model are adjusted.

[0174] Optionally, each labeled image region corresponds to a verification image region; the adjustment module 1330 is also used to determine the image crossover ratio of the sample image for each sample image based on the overlap relationship between each labeled image region of the sample image and its corresponding verification image region; if the image crossover ratio of the sample image is less than a preset crossover ratio threshold, the sample image is obtained as a candidate sample image.

[0175] Optionally, there are multiple sample images; each sample image corresponds to its own composite reward value; the adjustment module is also used to determine the relative advantage value of each sample image based on the composite reward values ​​corresponding to the multiple sample images; the relative advantage value is used to indicate the difference between the composite reward value of the corresponding sample image and the average level of the composite reward values ​​of the multiple sample images; the parameters of the image recognition model are adjusted based on the relative advantage values ​​of the multiple sample images.

[0176] Optionally, the image recognition model includes a visual encoder and a result predictor; the first recognition module 1310 is further configured to extract image features from the sample image through the visual encoder to obtain sample image features corresponding to the sample image; and to perform prediction processing on the sample image based on the sample image features through the result predictor to obtain the sample recognition result.

[0177] Optionally, the adjustment module 1330 is also used to fix the model parameters of the visual encoder and adjust the parameters of the result predictor based on the composite reward value in order to adjust the parameters of the image recognition model.

[0178] Optionally, the device further includes a pre-training module, used to extract image features from the pre-training sample images through the initial visual encoder to obtain the pre-training sample image features corresponding to the pre-training sample images; for each target sample region in multiple target sample regions in the pre-training sample images, the result predictor performs prediction processing on the target sample region based on the pre-training sample image features to obtain the region prediction probability of the target sample region; each target sample region includes a target object, and the region prediction probability is used to indicate the probability that the result predictor predicts that the corresponding target sample region includes a target object; based on the region prediction probabilities of multiple target sample regions, the parameters of the initial visual encoder are adjusted to obtain the parameter-adjusted initial visual encoder, which serves as the visual encoder.

[0179] Optionally, the pre-training module is further configured to sort multiple target sample regions based on the positional relationship between multiple target sample regions in the pre-training sample image to obtain a target sample region sequence; and through the result predictor, according to the arrangement order of each target sample region in the target sample region sequence, perform prediction processing on each target sample region based on the features corresponding to each target sample region in the features of the pre-training sample image to obtain the region prediction probability of each target sample region.

[0180] Optionally, the pre-training module is also used to obtain the reference point corresponding to each target sample region in the pre-training sample image; each reference point is located within the corresponding target sample region; based on the positional relationship between the reference points corresponding to each of the multiple target sample regions in the pre-training sample image, the multiple target sample regions are sorted to obtain a target sample region sequence.

[0181] Optionally, the visual encoder includes multiple key visual feature extraction layers, each of which is located before the last feature extraction layer of the visual encoder; the result predictor includes an associated prediction layer corresponding to each key visual feature extraction layer; the first recognition module 1310 is further configured to input sample image features into the first prediction layer of the result predictor and input the features extracted by each key visual feature extraction layer into the corresponding associated prediction layer, so that the result predictor performs prediction processing on the sample image based on the sample image features and the features extracted by each key visual feature extraction layer to obtain the sample recognition result.

[0182] Please see Figure 7 , Figure 7 This illustration shows a block diagram of an image recognition device according to an embodiment of this application. The image recognition device 1400 includes: The acquisition module 1410 is used to acquire the image to be recognized; The second recognition module 1420 is used to recognize the image to be recognized through an image recognition model to obtain the image recognition result of the image to be recognized; the image recognition result is used to indicate whether the image to be recognized includes the target object and / or the predicted image region where the target object is located in the image to be recognized; wherein, the image recognition model is trained according to the method in the aforementioned embodiment.

[0183] It should be noted that the device embodiments in this application correspond to the aforementioned method embodiments. The specific principles in the device embodiments can be found in the content of the aforementioned method embodiments, and will not be repeated here.

[0184] Figure 8 A structural block diagram of an electronic device for performing an image recognition method according to an embodiment of this application is shown. The electronic device may be... Figure 1 The terminal (110) or server (120), etc., should be noted. Figure 8 The computer system 1200 of the electronic device shown is merely an example and should not impose any limitation on the functionality and scope of use of the embodiments of this application.

[0185] like Figure 8As shown, the computer system 1200 includes a Central Processing Unit (CPU) 1201, which can perform various appropriate actions and processes, such as executing the methods described in the above embodiments, based on programs stored in Read-Only Memory (ROM) 1202 or programs loaded from storage portion 1208 into Random Access Memory (RAM) 1203. The RAM 1203 also stores various programs and data required for system operation. The CPU 1201, ROM 1202, and RAM 1203 are interconnected via a bus 1204. An Input / Output (I / O) interface 1205 is also connected to the bus 1204.

[0186] The following components are connected to I / O interface 1205: an input section 1206 including a keyboard, mouse, etc.; an output section 1207 including a cathode ray tube (CRT), liquid crystal display (LCD), etc., and speakers, etc.; a storage section 1208 including a hard disk, etc.; and a communication section 1209 including a network interface card such as a LAN (Local Area Network) card, modem, etc. The communication section 1209 performs communication processing via a network such as the Internet. A drive 1210 is also connected to I / O interface 1205 as needed. Removable media 1211, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., are installed on drive 1210 as needed so that computer program products read from them can be installed into storage section 1208 as needed.

[0187] Specifically, according to embodiments of this application, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this application include a computer program product containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program product can be downloaded and installed from a network via communication section 1209, and / or installed from removable media 1211. When the computer program product is executed by the central processing unit (CPU) 1201, it performs various functions defined in the system of this application.

[0188] It should be noted that the computer-readable medium shown in the embodiments of this application can be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), flash memory, optical fiber, portable compact disc read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this application, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this application, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such transmitted data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. The computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to wireless, wired, etc., or any suitable combination thereof.

[0189] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. Each block in a flowchart or block diagram may represent a module, segment, or portion of code, which contains one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0190] The units described in the embodiments of this application can be implemented in software or hardware, and the described units can also be located in a processor. The names of these units do not necessarily limit the specific unit itself.

[0191] In another aspect, this application also provides a computer-readable storage medium, which may be included in the electronic device described in the above embodiments; or it may exist independently and not assembled into the electronic device. The aforementioned computer-readable storage medium carries computer-readable instructions that, when executed by a processor, implement the methods in any of the above embodiments.

[0192] According to one aspect of the embodiments of this application, a computer program product is provided, the computer program product including computer instructions stored in a computer-readable storage medium. A processor of an electronic device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, causing the electronic device to perform the methods of any of the above embodiments.

[0193] In the embodiments of this application, the terms "module" or "unit" refer to a part of a computer program product with a predetermined function, which works together with other related parts to achieve a predetermined goal, and can be implemented wholly or partially using software, hardware (e.g., processing circuitry or memory), or a combination thereof. Similarly, a processor (or multiple processors or memory) can be used to implement one or more modules or units. Furthermore, each module or unit can be part of an overall module or unit that functions as a whole.

[0194] It should be noted that although several modules or units for the device used to perform actions have been mentioned in the detailed description above, this division is not mandatory. In fact, according to the embodiments of this application, the features and functions of two or more modules or units described above can be embodied in one module or unit. Conversely, the features and functions of one module or unit described above can be further divided and embodied by multiple modules or units.

[0195] Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein can be implemented by software or by combining software with necessary hardware. Therefore, the technical solutions according to the embodiments of this application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (such as a CD-ROM, USB flash drive, external hard drive, etc.) or on a network, including several instructions to cause an electronic device (such as a personal computer, server, touch terminal, or network device, etc.) to execute the methods according to the embodiments of this application.

[0196] Other embodiments of this application will readily conceive of by those skilled in the art upon consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary techniques in the art not disclosed herein. It should be understood that this application is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this application is limited only by the appended claims.

[0197] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.

Claims

1. A training method for an image recognition model, characterized in that, The method comprises: identifying a sample image through an image recognition model to obtain a sample recognition result of the sample image; the sample recognition result is used to indicate at least one predicted sample image region including a target object in the sample image predicted by the image recognition model; determining a composite reward value based on a first reward value and a second reward value of the sample recognition result; the first reward value indicates positioning accuracy of the at least one predicted sample image region; the second reward value indicates a degree of fit between a predicted data format and a preset standard data format, the predicted data format being a format of data indicating the at least one predicted sample image region; the first reward value is determined based on a position difference and / or an overlapping relationship between the at least one predicted sample image region and a region where the target object is located in the sample image; adjusting parameters of the image recognition model based on the composite reward value.

2. The method of claim 1, wherein, The method comprises: obtaining a label result of the sample image; the label result of the sample image indicates at least one labeled image region including a target object in the sample image; matching the at least one predicted sample image region and the at least one labeled image region to match each labeled image region with a corresponding predicted sample image region; determining the first reward value based on a position difference and / or an overlapping relationship between each labeled image region and the corresponding predicted sample image region; determining the second reward value based on a format difference between a label data format of each labeled image region and a predicted data format of the corresponding predicted sample image region; the label data format of the labeled image region being a format of data indicating the labeled image region; determining the composite reward value based on the first reward value and the second reward value.

3. The method of claim 2, wherein, The method comprises: for each labeled image region, determining a region positioning offset reward value corresponding to the labeled image region based on an overlapping relationship between the labeled image region and the corresponding predicted sample image region; determining a region number offset reward value corresponding to the labeled image region based on a position difference between the labeled image region and the corresponding predicted sample image region; determining a single-region positioning reward value corresponding to the labeled image region based on the region positioning offset reward value and / or the region number offset reward value corresponding to the labeled image region; determining the first reward value based on single-region positioning reward values corresponding to a plurality of labeled image regions.

4. The method of claim 2, wherein, The method comprises: determining the second reward value based on a format difference between a label data format of each labeled image region and a predicted data format of the corresponding predicted sample image region; the label data format of the labeled image region being a format of data indicating the labeled image region. For each labeled image region, if the labeled data format of the labeled image region is consistent with the prediction data format of the corresponding prediction sample image region, the first value is obtained as the single-region format reward value of the labeled image region. If the format of the labeled image region and the format of the corresponding predicted sample image region are inconsistent, a second value is obtained as the single-region format reward value of the labeled image region; the first value is greater than the second value. The second reward value is determined based on the single-region format reward value of the multiple labeled image regions.

5. The method of claim 1, wherein, The sample images are multiple; After adjusting the parameters of the image recognition model based on the composite reward value, the method further includes: The multiple sample images are identified by the parameter-adjusted image recognition model to obtain a verification recognition result for each sample image; the verification recognition result of the sample image is used to indicate the verification image region of the target object predicted by the parameter-adjusted image recognition model. Based on the differences between the verification and recognition results and the annotation results corresponding to each sample image, candidate sample images are determined from the plurality of sample images; the annotation results of the sample images indicate the annotated image regions in the sample images that include the target object; Based on the verification and recognition results corresponding to the candidate sample images, the composite reward value of the candidate sample images is determined; Based on the composite reward value of the candidate sample images, the parameters of the image recognition model after parameter adjustment are adjusted.

6. The method of claim 5, wherein, Each labeled image region corresponds to a verification image region; The step of determining candidate sample images from the plurality of sample images based on the difference between the verification and recognition results and the annotation results corresponding to each sample image includes: For each of the sample images, the image intersection-over-union ratio of the sample image is determined based on the overlap relationship between each labeled image region of the sample image and its corresponding verification image region; If the cross-union ratio of the sample image is less than a preset cross-union ratio threshold, the sample image is obtained as the candidate sample image.

7. The method of claim 1, wherein, There are multiple sample images; each sample image corresponds to its own composite reward value. The step of adjusting the parameters of the image recognition model based on the composite reward value includes: Based on the composite reward values ​​corresponding to the multiple sample images, the relative advantage value of each sample image is determined; The relative advantage value is used to indicate the difference between the composite reward value of the corresponding sample image and the average level of the composite reward value of the plurality of sample images; The parameters of the image recognition model are adjusted based on the relative advantage values ​​of the multiple sample images.

8. The method of claim 1, wherein, The image recognition model includes a visual encoder and a result predictor; The step of recognizing the sample image using an image recognition model to obtain the sample recognition result includes: The visual encoder extracts image features from the sample image to obtain the sample image features corresponding to the sample image. The result predictor performs prediction processing on the sample image based on the features of the sample image to obtain the sample recognition result.

9. The method of claim 8, wherein, The step of adjusting the parameters of the image recognition model based on the composite reward value includes: By fixing the model parameters of the visual encoder and adjusting the parameters of the result predictor based on the composite reward value, the parameters of the image recognition model are adjusted.

10. The method of claim 8, wherein, Before extracting image features from the sample image using the visual encoder to obtain the sample image features corresponding to the sample image, the method further includes: The image features of the pre-training sample images are obtained by extracting image features from the pre-training sample images through an initial visual encoder; For each target sample region in a pre-training sample image, the result predictor performs prediction processing on the target sample region based on the features of the pre-training sample image to obtain the region prediction probability of the target sample region; each target sample region includes the target object, and the region prediction probability is used to indicate the probability that the result predictor predicts that the corresponding target sample region includes the target object. Based on the region prediction probabilities of the multiple target sample regions, the parameters of the initial visual encoder are adjusted to obtain the parameter-adjusted initial visual encoder, which is used as the visual encoder.

11. The method of claim 10, wherein, Before the method for predicting the region prediction probability of each target sample region in a pre-trained sample image based on the features of the pre-trained sample image using the result predictor, the method further includes: Based on the positional relationship between the multiple target sample regions in the pre-trained sample image, the multiple target sample regions are sorted to obtain a target sample region sequence; For each of the multiple target sample regions in the pre-trained sample image, the result predictor performs prediction processing on the target sample region based on the features of the pre-trained sample image to obtain the region prediction probability of the target sample region, including: The result predictor performs prediction processing on each target sample region according to the arrangement order of each target sample region in the target sample region sequence, based on the features corresponding to each target sample region in the pre-trained sample image features, so as to obtain the region prediction probability of each target sample region.

12. The method of claim 11, wherein, The step of sorting the multiple target sample regions based on the positional relationships between them in the pre-trained sample image to obtain a target sample region sequence includes: In the pre-training sample image, obtain the reference point corresponding to each target sample region; each reference point is located within its corresponding target sample region. Based on the positional relationship between the reference points corresponding to the multiple target sample regions in the pre-trained sample image, the multiple target sample regions are sorted to obtain a target sample region sequence.

13. The method of claim 8, wherein, The visual encoder includes multiple key visual feature extraction layers, each of which is located before the last feature extraction layer of the visual encoder; the result predictor includes an association prediction layer corresponding to each of the key visual feature extraction layers. The step of performing prediction processing on the sample image based on the sample image features by the result predictor to obtain the sample recognition result includes: The sample image features are input into the first prediction layer of the result predictor, and the features extracted by each of the key visual feature extraction layers are input into the corresponding associated prediction layer, so that the result predictor performs prediction processing on the sample image based on the sample image features and the features extracted by each of the key visual feature extraction layers, and obtains the sample recognition result.

14. An image recognition method characterized by, The method includes: Acquire the image to be recognized; The image to be identified is identified by an image recognition model to obtain an image recognition result of the image to be identified; the image recognition result is used to indicate whether the image to be identified includes a target object and / or the predicted image region in the image to be identified; wherein the image recognition model is trained according to the method described in any one of claims 1-13.

15. A training device for an image recognition model, characterized in that, The device includes: The first recognition module is used to recognize the sample image through an image recognition model to obtain the sample recognition result of the sample image; the sample recognition result is used to indicate that the image recognition model predicts that at least one predicted sample image region in the sample image includes the target object; The determination module is used to determine a composite reward value based on a first reward value and a second reward value of the sample recognition result; the first reward value indicates the localization accuracy of the at least one predicted sample image region; the second reward value indicates the degree of fit between the predicted data format and a preset standard data format, wherein the predicted data format is a format indicating the data of the at least one predicted sample image region; the first reward value is determined based on the positional differences and / or overlap between the at least one predicted sample image region and the region where the target object is located in the sample image. The adjustment module adjusts the parameters of the image recognition model based on the composite reward value.

16. An image recognition apparatus characterized by comprising: The device includes: The acquisition module is used to acquire the image to be recognized; The second recognition module is used to recognize the image to be recognized by an image recognition model to obtain an image recognition result of the image to be recognized; the image recognition result is used to indicate whether the image to be recognized includes a target object and / or the predicted image region in the image to be recognized; wherein the image recognition model is trained according to the method described in any one of claims 1-13.

17. An electronic device, comprising: include: processor; A memory storing computer-readable instructions that, when executed by the processor, implement the method as described in any one of claims 1-13, or the method as described in claim 14.

18. A computer readable storage medium, characterized in that, It stores computer-readable instructions that, when executed by a processor, implement the method as described in any one of claims 1-13, or implement the method as described in claim 14.

19. A computer program product, characterised in that, Includes computer instructions that, when executed by a processor, implement the method as described in any one of claims 1-13, or implement the method as described in claim 14.