An artificial intelligence data training image understanding method, device and medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing bidirectional retrieval verification and alternating training, the problem of insufficient utilization of semantic conflict samples in image understanding technology is solved, thereby improving the stability of cross-modal semantic representation and semantic discrimination ability.

CN122241230APending Publication Date: 2026-06-19SUZHOU JITIAN INFORMATION TECHNOLOGY CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: SUZHOU JITIAN INFORMATION TECHNOLOGY CO LTD
Filing Date: 2026-03-26
Publication Date: 2026-06-19

Application Information

Patent Timeline

26 Mar 2026

Application

19 Jun 2026

Publication

CN122241230A

IPC: G06F18/214; G06F18/22; G06F18/213; G06F40/30; G06F16/53; G06F16/33; G06V10/40

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Text and image bidirectional alignment method and system based on multi-hop parallel reasoning
CN121301909A
Image-Text Bidirectional Retrieval Method Based on Multi-View Joint Embedding Space
CN107330100B
Image-text bothway retrieval method based on multi-view unite embedded space
CN107330100A
Image-text cross-modal retrieval method based on joint features
CN114722224A
Bidirectional image-text retrieval method based on multi-view joint embedding space
WO2019007041A1

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing image understanding technologies often rely on single-round matching and sorting or one-way retrieval mechanisms in the process of aligning image region features with text phrase entries. This results in insufficient utilization of semantic conflict samples and a lack of consistency constraints, which affects the stability of cross-modal semantic representation.

⚗Method used

By employing a bidirectional retrieval verification and alternating training method, conflict attribution and adversarial pairing processing are performed on inconsistent phrases to generate a semantic contradiction set, which is then jointly trained to form a more stable image-text consistency model.

🎯Benefits of technology

It improves the semantic discrimination and scene understanding capabilities of image semantic results in computer vision tasks, and achieves a more stable cross-modal representation structure through bidirectional retrieval verification and alternating training.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122241230A_ABST

Patent Text Reader

Abstract

This invention discloses an image understanding method, device, and medium for training with artificial intelligence data, relating to the field of multimodal alignment technology. The method includes: acquiring image samples and text samples, pairing them to generate an image-text set; extracting region features and phrase entries from the image-text set, and performing similarity ranking to generate a fine-grained alignment set; performing bidirectional retrieval verification on the fine-grained alignment set, and performing conflict attribution and contradiction intensity ranking on inconsistent region phrases in the bidirectional retrieval verification; based on the contradiction intensity ranking, performing adversarial pairing processing on the inconsistent region phrases to generate a semantic contradiction set; and performing alternating joint training on the semantic contradiction set and consistent region phrases in the bidirectional retrieval verification to generate an image-text consistency model. This invention improves the semantic discrimination and scene understanding capabilities of image semantic results in computer vision tasks through bidirectional retrieval verification and alternating training.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of multimodal alignment technology, and in particular to an image understanding method, device and medium for training artificial intelligence data. Background Technology

[0002] With the development of computer vision and artificial intelligence data training technologies, image understanding methods have gradually evolved from target recognition based on single-modal feature representations to cross-modal alignment mechanisms that integrate image region features and text semantic information. Deep neural network structures have been widely applied in image feature encoding and text semantic encoding, achieving semantic association analysis between image content and text description through region-level feature extraction, phrase-level semantic modeling, and cross-modal vector space mapping. Within the framework of artificial intelligence data training, the structured organization of image and text samples, the construction of fine-grained alignment relationships, and the optimization of consistency constraints have become important research directions for improving the accuracy of image semantic parsing.

[0003] Existing image understanding technologies often rely on single-round matching and ranking or one-way retrieval mechanisms in aligning image region features with text phrase entries. They lack conflict attribution and differentiation of contradiction intensity for inconsistent region phrases during bidirectional retrieval verification, leading to insufficient utilization of semantically contradictory samples during the training phase of the image-text consistency model. This affects the stability of cross-modal semantic representation. Current cross-modal alignment techniques typically improve model discrimination by increasing the training data scale or introducing negative sample sampling strategies, but these suffer from insufficient utilization of semantically contradictory samples and a lack of simplistic consistency constraints. Improvements are usually made by increasing the weight of contrastive loss or introducing hard example mining mechanisms. Summary of the Invention

[0004] In view of the aforementioned existing problems, the present invention is proposed.

[0005] Therefore, this invention provides an image understanding method for training with artificial intelligence data to solve the problems of insufficient utilization of semantically conflicting samples and a single consistency constraint.

[0006] To solve the above-mentioned technical problems, the present invention provides the following technical solution: In a first aspect, the present invention provides an image understanding method trained with artificial intelligence data, comprising, Obtain image samples and text samples, pair them, and generate an image-text set; Extract region features and phrase entries from image-text sets, perform similarity ranking, and generate fine-grained alignment sets; Bidirectional retrieval verification is performed on the fine-grained alignment set, and conflict attribution and contradiction intensity ranking are performed on the inconsistent regional phrases in the bidirectional retrieval verification. Based on the contradiction intensity ranking, adversarial pairing processing is performed on the inconsistent regional phrases to generate a semantic contradiction set. The image-text consistency model is generated by alternating the training of semantically contradictory sets and consistent regional phrases in bidirectional retrieval verification. The image-text consistency model is then used to perform phrase response parsing and relation inference on the image-text set to generate image semantic results.

[0007] As a preferred embodiment of the image understanding method for training artificial intelligence data according to the present invention, wherein: the generation of the image text set specifically comprises, Based on the sample identifiers in the image and text samples, the image and text samples are associated and organized to generate a set of paired entries; Expand each pair of entries in the set of paired entries, and perform an identifier consistency check on the pairing identifiers in the paired entries to form a set of verified paired entries; The set of verification and matching entries is merged and organized to generate an image and text set.

[0008] As a preferred embodiment of the image understanding method for training artificial intelligence data according to the present invention, wherein: the generation of fine-grained alignment sets specifically refers to... Expand each image-text pair in the image-text set one by one, and extract region candidates from the image samples in the image-text pairs to form a region candidate set; Extract regional features from the candidate regional set and aggregate them into a regional feature set; The text samples in the image-text pair are segmented by phrase boundaries and phrase indices are added to form a set of phrase entries; Based on the set of phrase entries, phrase-guided matching is performed on the set of regional features to form a pair of associated indexes between the regional index and the phrase index; Perform similarity ranking on the region features and phrase entries in the associated index pairs, and aggregate the similarity ranking results to generate a fine-grained alignment set.

[0009] As a preferred embodiment of the image understanding method for training artificial intelligence data according to the present invention, the similarity ranking refers to the ordered arrangement of regional features and phrase entries in the associated index pair according to their degree of matching.

[0010] As a preferred embodiment of the image understanding method for training artificial intelligence data according to the present invention, wherein: the generation of the semantic contradiction set specifically involves, Alignment entries are extracted from the fine-grained alignment set, and the alignment entries are retrieved and organized to form a bidirectional retrieval verification sequence; The region features in the bidirectional retrieval verification sequence are sorted by image-to-text retrieval, and the phrase entries in the bidirectional retrieval verification sequence are sorted by text-to-image retrieval, forming a bidirectional retrieval sorting sequence; Based on the bidirectional retrieval sorting sequence, the area index and phrase index are back-checked for verification, and bidirectional retrieval verification results are generated. From the bidirectional retrieval verification results, select the region-phrase index that failed the retrospective verification to form an inconsistent region-phrase set; Based on the bidirectional retrieval sorting sequence and bidirectional retrieval verification results, conflict attribution is performed on the inconsistent region phrase set to form an inconsistent region phrase set with conflict type labels. The inconsistent phrases in the set of inconsistent phrases with conflict type labels are sorted by contradiction intensity to form a set of contradictory phrases; Perform adversarial recombination pairing on the region index and phrase index corresponding to the contradictory phrase set to form an adversarial pair set; The set of opposing pairs and the set of contradictory phrases are combined and organized to generate a set of semantic contradictions.

[0011] In a preferred embodiment of the image understanding method for training artificial intelligence data according to the present invention, the step of forming a set of inconsistent region phrases with conflict type labels specifically involves: Read the region index and phrase index from the inconsistent region phrase set, and simultaneously retrieve the corresponding image-to-text retrieval sorting position and text-to-image retrieval sorting position in the bidirectional retrieval sorting sequence to form sorting mapping entries; Based on the sorting mapping entries and the bidirectional retrieval verification results, a joint comparison is performed on the regional index and the phrase index to form mismatch judgment entries; The mismatch determination entries distinguish between one-way mismatch and two-way mismatch states, forming conflict source entries; Based on the conflict source entries, write conflict type flags to the inconsistent region phrase set, and output the inconsistent region phrase set with conflict type flags.

[0012] As a preferred embodiment of the image understanding method for training artificial intelligence data according to the present invention, wherein: the generation of the image-text consistency model specifically comprises, The bidirectional retrieval verification results are filtered, and the region-phrase index that passes the back-check verification is extracted to form a consistent region-phrase set. The contradictory phrase set and the adversarial pair set are merged and organized into a contradictory training set; The consistent region phrase set and the contradictory training set are arranged in rounds to form a round scheduling sequence, and the consistent region phrase set and the contradictory training set are alternately scheduled according to the round scheduling sequence; In the consistency phase of the round scheduling sequence, joint training of consistency constraints is carried out on the consistent region phrase set. In the contradiction phase of the round scheduling sequence, joint training of contradiction constraints is carried out on the contradiction training set. The parameter states of the consistency constraint joint training and the contradiction constraint joint training are sequentially inherited to generate the graph-text consistency model.

[0013] As a preferred embodiment of the image understanding method for training artificial intelligence data according to the present invention, wherein: the generation of image semantic results specifically includes, The image and text set is input into the image-text consistency model one by one for consistency verification, and a consistency response sequence is generated. Phrase entries in an image-text set are located using a consistent response sequence, and phrase responses are parsed to form a set of phrase response entries. Based on the phrase response entry set, regional features are extracted from the image text set, and the regional features are organized by region orientation to form a set of regional phrases carrying region indexes; Regional phrase entries with common regional indices are selected from the regional phrase set, and then connected, organized, and their relationships deduced to form a relationship deduction set; The system aggregates phrase response entries, region phrases, and relation inference sets to generate image semantic results.

[0014] In a second aspect, the present invention provides a computer device including a memory and a processor, wherein the memory stores a computer program, wherein: when the computer program is executed by the processor, it implements any step of the image understanding method for training artificial intelligence data as described in the first aspect of the present invention.

[0015] Thirdly, the present invention provides a computer-readable storage medium having a computer program stored thereon, wherein: when the computer program is executed by a processor, it implements any step of the image understanding method for training artificial intelligence data as described in the first aspect of the present invention.

[0016] The beneficial effects of this invention are as follows: consistency enhancement is achieved through bidirectional retrieval verification and alternating training. Based on bidirectional retrieval verification, conflict attribution and adversarial pairing processing are performed on inconsistent region phrases, allowing semantic contradiction sets to participate in the joint training process. This promotes a more stable cross-modal representation structure in the image-text consistency model under the alternating effects of consistency and contradiction constraints. The feature vector representations corresponding to region features and the semantic vector representations corresponding to phrase entries maintain a clear directional distinction within a unified dimensional space, explicitly characterizing semantic parallel and semantic conflict relationships. This enhances the semantic discrimination and scene understanding capabilities of image semantic results in computer vision tasks. Attached Figure Description

[0017] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the following description of the embodiments will be briefly introduced. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0018] Figure 1 A flowchart of an image understanding method trained on artificial intelligence data.

[0019] Figure 2 This is a flowchart of the text-image consistency model structure.

[0020] Figure 3 This is a flowchart for generating bidirectional retrieval verification and semantic contradiction sets.

[0021] Figure 4 A flowchart for generating image semantic results from phrase response parsing and relation inference. Detailed Implementation

[0022] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

[0023] Many specific details are set forth in the following description in order to provide a full understanding of the invention. However, the invention may also be practiced in other ways different from those described herein, and those skilled in the art can make similar extensions without departing from the spirit of the invention. Therefore, the invention is not limited to the specific embodiments disclosed below.

[0024] Secondly, the term "one embodiment" or "embodiment" as used herein refers to a specific feature, structure, or characteristic that may be included in at least one implementation of the present invention. The phrase "in one embodiment" appearing in different places in this specification does not necessarily refer to the same embodiment, nor is it a single or selective embodiment that is mutually exclusive with other embodiments.

[0025] Reference Figures 1-4 This is one embodiment of the present invention, which provides an image understanding method trained with artificial intelligence data, comprising the following steps: S1. Obtain image samples and text samples, and pair them to generate an image-text set.

[0026] S1.1. Read image and text data from the data storage location, perform format consistency and content integrity checks, and generate image and text samples, specifically: After reading image and text data from the data storage location, perform format consistency verification and content integrity verification on the image and text data; format consistency verification includes file type verification, encoding format verification, and field structure verification.

[0027] File type verification compares the image data file type with the set of image file types, and the text data file type with the set of text file types; encoding format verification compares the image data pixel encoding format with the image pixel encoding identifier, and the text data character encoding format with the text character encoding identifier; field structure verification checks whether the image data contains an image identifier field, a resolution field, and a pixel matrix field, and whether the text data contains a text identifier field and a text content field.

[0028] If any of the following three situations occur: inconsistent file type, inconsistent encoding format, or inconsistent field structure, the format consistency verification is deemed to have failed, and no image or text samples are generated.

[0029] If the file type, encoding format, and field structure are consistent, the format consistency verification is considered passed. The content integrity verification includes pixel matrix validity verification and text content validity verification. The pixel matrix validity verification checks that the pixel matrix fields have valid pixel values and meet the preset image size condition. The text content validity verification checks that the text content fields have valid characters and meet the preset text length condition. If the pixel matrix validity verification or the text content validity verification fails, the content integrity verification is considered failed and no image sample or text sample is generated. If both the pixel matrix validity verification and the text content validity verification pass, the content integrity verification is considered passed and image samples and text samples are generated.

[0030] It should be noted that the image size condition is set based on the overall distribution of the resolution field in the image samples, and is set to 20% of the mean of the resolution field. The basis for this value is that there is a clear statistical boundary between the low value range and the main concentration range in the overall distribution of the resolution field. A stable inflection point is formed at 20% of the mean in the distribution curve, which can serve as a statistical boundary value to distinguish between low-resolution abnormal samples and main samples.

[0031] The text length condition is set based on the overall distribution of the text length field in the text samples. The value is 10% of the mean of the text length field. The basis for this value is that the overall distribution of the text length field shows a sparse tail feature in the low value range, and a statistical turning point in the length distribution is formed near 10% of the mean, which can reflect the distribution boundary between low-length samples and the main samples.

[0032] S1.2. Based on the sample identifiers in the image and text samples, perform association processing on the image and text samples to generate a set of paired entries, specifically: The process involves reading sample identifiers from image samples one by one and searching for identical sample identifiers in text samples. Image samples and text samples with matching sample identifiers are then matched one-to-one to form an association record. When a sample identifier corresponds to a unique image sample and a unique text sample, a valid association record is generated. When a sample identifier corresponds to multiple image samples or multiple text samples, it is determined to be a duplicate sample identifier and no association record is generated. When a sample identifier in an image sample is not matched with a corresponding sample identifier in a text sample, or vice versa, it is determined to be a missing sample identifier and no association record is generated. All valid association records are then collected and organized to generate a set of paired entries.

[0033] S1.3. Expand each pairing entry in the set of paired entries and perform an identifier consistency check on the pairing identifiers in the paired entries to form a set of verified paired entries, specifically as follows: Read the image sample identifier and text sample identifier from each pairing entry, and compare the image sample identifier and text sample identifier as pairing identifiers. When the image sample identifier and text sample identifier are completely consistent in character content, character order and encoding representation, the pairing identifiers are determined to be consistent and the corresponding pairing entry is registered as a valid pairing entry. When the image sample identifier and text sample identifier have differences in character content, character order or encoding representation, the pairing identifiers are determined to be inconsistent and the corresponding pairing entry is marked as an invalid pairing entry. Invalid pairing entries are not included in the subsequent collection. All valid pairing entries determined to be consistent in pairing identifiers are collected to form a set of verification pairing entries.

[0034] S1.4 Merge and organize the set of verification matching items to generate an image-text set; whereby merging and organizing means deduplicating the valid matching items in the set of verification matching items according to the sample identifier, arranging them in ascending order according to the character sequence of the image sample identifier, and then encapsulating the image sample and text sample in a one-to-one correspondence to output the image-text set.

[0035] S2. Extract region features and phrase entries from the image text set, and perform similarity ranking to generate a fine-grained alignment set.

[0036] S2.1. Expand each image-text pair in the image-text set, and extract region candidates from the image samples in the image-text pairs to form a region candidate set; extract region features from the region candidate set and aggregate them into a region feature set, specifically: Image samples are subjected to size unification processing to form a standardized image matrix. The feature mapping result of the candidate region generation network is constructed on the standardized image matrix. Anchor box set is generated on the feature mapping result of the candidate region generation network. The anchor box set is labeled with the target existence probability and the boundary position is corrected. Anchor boxes with the highest target existence probability and the cross-union ratio (CUI) is lower than the preset CUI threshold are retained as valid candidate boxes. The region index and boundary coordinates are registered for each valid candidate box. All valid candidate boxes are collected to form a region candidate set.

[0037] Extracting region features from the region candidate set involves: cropping the corresponding region sub-matrix from the standardized image matrix based on the boundary coordinates in the region candidate set; performing size unification processing on the region sub-matrix and inputting it into the feature mapping result layer of the candidate region generation network; reading the feature mapping vector at the corresponding position and compressing its spatial dimension to form the feature vector representation corresponding to the region feature; and binding and registering the region index with the feature vector representation corresponding to the region feature. All bound and registered region indices and feature vector representations corresponding to the region features are collected to form the region feature set.

[0038] The training process of the candidate region generation network is as follows: using image samples containing target boundary annotation information as training input, the anchor boxes output by the candidate region generation network are matched with the labeled boundaries. The successfully matched anchor boxes are registered as positive sample anchor boxes, and the anchor boxes that fail to match and whose cross-union ratio (CUI) is lower than the preset CUI threshold are registered as negative sample anchor boxes. The weights of the convolutional layers in the candidate region generation network are adjusted through parameter update operators so that the target existence probability corresponding to the positive sample anchor boxes tends to a higher value and the target existence probability corresponding to the negative sample anchor boxes tends to a lower value. At the same time, the boundary position is adjusted by correction until the anchor box positions output by the candidate region generation network and the labeled boundary positions tend to be stably distributed, thus completing the training of the candidate region generation network.

[0039] The preset crossover ratio threshold is set based on the overall central tendency of the crossover ratio distribution between anchor boxes and labeled boundaries in the candidate set of regions. The value is taken as the position corresponding to the crossover ratio standard deviation above the mean of the crossover ratio distribution. The basis for this value is that the crossover ratio distribution shows a main concentrated form near the mean, and the position corresponding to the crossover ratio standard deviation above the mean usually corresponds to the statistical turning point where the distribution curve separates from the main concentrated area to the high value tail. Therefore, choosing the position corresponding to the crossover ratio standard deviation above the mean can reflect the natural boundary characteristics inside the crossover ratio distribution.

[0040] S2.2. Perform phrase boundary segmentation on the text samples in the image-text pair and attach phrase indices to form a set of phrase entries, specifically: The text content field in the text sample is read and divided into continuous characters according to the rules of punctuation and space separation. Semantic continuous segments composed of nouns, adjectives and verbs are identified in the continuous character sequence. Each semantic continuous segment is determined as a phrase boundary. When the continuous character sequence does not contain a combination of nouns, adjectives and verbs, no phrase boundary is formed. After completing the phrase boundary segmentation, a unique phrase index is assigned to each phrase according to the order of appearance of the phrase in the text content field. The phrase and the corresponding phrase index are bound and registered to form a set of phrase entries composed of phrases and phrase indexes.

[0041] S2.3. Based on the set of phrase entries, perform phrase-guided matching on the set of regional features to form a pair of associated indexes between the regional index and the phrase index, specifically as follows: Phrases and phrase indices are extracted from the phrase entry set, and regional features and regional indices are extracted from the regional feature set. The semantic representation of the phrase is compared with the feature representation of the regional feature. When the matching degree reaches the matching threshold, the corresponding regional index and the corresponding phrase index are bound and registered to form an associated index pair. When the matching degree does not reach the matching threshold, no binding and registration are performed. All regional indexes and phrase indexes that have completed binding and registration are gathered to form an associated index pair of regional indexes and phrase indexes.

[0042] It should be noted that the matching threshold is set based on the matching degree distribution in the fine-grained alignment set, and is set to one standard deviation of matching degree above the mean matching degree, which is used to distinguish between the main matching interval and the low matching interval.

[0043] S2.4. Perform similarity ranking on the region features and phrase entries in the associated index pairs, and aggregate the results of the similarity ranking to generate a fine-grained alignment set; whereby, similarity ranking refers to the ordered arrangement of region features and phrase entries in the associated index pairs according to their degree of matching, specifically as follows: The system reads region indexes, region features, phrase indexes, and phrase entries from the associated index pairs. A matching degree sequence is formed based on the matching degree between region features and phrase entries. Multiple region indexes corresponding to the same phrase index are arranged in descending order of matching degree, and similarly, multiple phrase indexes corresponding to the same region index are arranged in descending order of matching degree. After this bidirectional arrangement, region and phrase indexes at the top of the ranking are retained as valid alignment entries, while those at the bottom are marked as low-matching entries and not included in the valid alignment entries. All valid alignment entries are then collected and organized to form a fine-grained alignment set. Specifically, "at the top of the ranking" refers to a combination of region and phrase indexes whose matching degree values are within one standard deviation above the mean of the matching degree sequence; "at the bottom of the ranking" refers to a combination of region and phrase indexes whose matching degree values are within one standard deviation below the mean of the matching degree sequence.

[0044] The expression for the degree of matching is: ; In the formula, The matching degree is a dimensionless real number ranging from -1 to positive 1, used to represent the degree of directional consistency between the feature vector representation corresponding to the region features and the semantic vector representation corresponding to the phrase entries. The feature vector representation corresponding to the regional features is a multidimensional vector composed of multiple dimensionless real eigenvalues. It is a semantic vector representation of the phrase entries, a multidimensional vector composed of multiple dimensionless real feature values.

[0045] It should be noted that the expression for the matching degree is based on cosine similarity, which measures the degree of directional consistency between two vectors. Cosine similarity is a mature and widely used conventional technique in the task of aligning image feature representation with semantic representation. The influence of vector length differences on the matching result is eliminated by the proportional relationship between the vector inner product and the vector Euclidean norm, so that the matching degree only reflects the directional consistency between the feature vector representation corresponding to the region feature and the semantic vector representation corresponding to the phrase entry. Therefore, the matching degree is a dimensionless real number.

[0046] S3. Perform bidirectional retrieval verification on the fine-grained alignment set, and perform conflict attribution and contradiction intensity ranking on the inconsistent region phrases in the bidirectional retrieval verification. Based on the contradiction intensity ranking, perform adversarial pairing processing on the inconsistent region phrases to generate a semantic contradiction set.

[0047] S3.1 Extract alignment entries from the fine-grained alignment set, and organize the alignment entries to form a bidirectional retrieval verification sequence, specifically as follows: Each alignment entry in the fine-grained alignment set is organized into a unified search entry format. The search entry format includes four fields: region index, region feature, phrase index, and phrase entry. The region index and phrase index are registered as search locator keys, the region feature is registered as the search input for image-to-text search sorting, and the phrase entry is registered as the search input for text-to-image search sorting. After the search entry format is organized, the search entry formats are arranged in ascending order according to the character sequence of the region index, and within the same region index, they are arranged in ascending order according to the character sequence of the phrase index. The arranged search entry sequence is the bidirectional search verification sequence.

[0048] S3.2. The region features in the bidirectional retrieval verification sequence are sorted by image-to-text retrieval, and the phrase entries in the bidirectional retrieval verification sequence are sorted by text-to-image retrieval, forming a bidirectional retrieval sorting sequence, specifically as follows: In the bidirectional retrieval verification sequence, using region features as retrieval input, the matching degree between the region features and all phrase entries in the bidirectional retrieval verification sequence is calculated, and they are arranged in descending order of matching degree to obtain the image-to-text retrieval ranking result corresponding to each region index. Simultaneously, using phrase entries as retrieval input, the matching degree between the phrase entries and all region features in the bidirectional retrieval verification sequence is calculated, and they are arranged in descending order of matching degree to obtain the text-to-image retrieval ranking result corresponding to each phrase index. The image-to-text retrieval ranking result and the text-to-image retrieval ranking result are then registered according to the region index and phrase index to form a bidirectional retrieval ranking sequence.

[0049] S3.3. Based on the bidirectional retrieval sorting sequence, perform a back-check verification of the region index and phrase index to generate bidirectional retrieval verification results, specifically as follows: In the image-to-text retrieval ranking results, determine the first phrase index corresponding to each region index, and find the ranking position corresponding to the first phrase index in the text-to-image retrieval ranking results. When the first region index of the phrase index in the text-to-image retrieval ranking results is consistent with the original region index, it is determined that the region index and the phrase index have passed the back-check verification; when the first region index of the phrase index in the text-to-image retrieval ranking results is inconsistent with the original region index, it is determined that the region index and the phrase index have failed the back-check verification.

[0050] All region indexes and phrase indexes that pass the back-check verification are registered as consistent region index pairs, and all region indexes and phrase indexes that fail the back-check verification are registered as inconsistent region index pairs. Consistent and inconsistent region index pairs are then combined to form a bidirectional retrieval verification result.

[0051] S3.4. From the bidirectional retrieval verification results, select the region-phrase indexes that failed the back-check verification to form an inconsistent region phrase set, specifically: The region indexes and phrase indexes that failed the back-check verification were used as the filtering objects. They were collected according to the correspondence between the region indexes and phrase indexes. Each group of region indexes and phrase indexes that failed the back-check verification were associated with the corresponding region features and phrase entries and registered together to form a set structure consisting of region indexes, phrase indexes, region features and phrase entries. After all the region indexes and phrase indexes that failed the back-check verification were associated and registered, they were collected into an inconsistent region phrase set.

[0052] S3.5. Based on the bidirectional retrieval sorting sequence and bidirectional retrieval verification results, conflict attribution is performed on the inconsistent region phrase set to form an inconsistent region phrase set with conflict type labels, specifically: The region index and phrase index in the inconsistent region phrase set are mapped to the image-to-text retrieval sorting position and the text-to-image retrieval sorting position in the bidirectional retrieval sorting sequence. When the region index is first in the image-to-text retrieval sorting position but the corresponding phrase index is not first in the text-to-image retrieval sorting position, it is determined to be a text-to-image mismatch type in a one-way mismatch state; when the phrase index is first in the text-to-image retrieval sorting position but the corresponding region index is not first in the image-to-text retrieval sorting position, it is determined to be an image-to-text mismatch type in a one-way mismatch state; when both the region index and the corresponding phrase index are not first in the image-to-text retrieval sorting position, it is determined to be a bidirectional mismatch state.

[0053] Based on the one-way mismatch state and the two-way mismatch state, write the corresponding conflict type flags into the inconsistent region phrase set to form an inconsistent region phrase set with conflict type flags.

[0054] S3.6. Read the region index and phrase index from the inconsistent region phrase set, and simultaneously retrieve the corresponding image-to-text retrieval sorting position and text-to-image retrieval sorting position in the bidirectional retrieval sorting sequence to form sorting mapping entries, specifically: In the inconsistent region phrase set, each set of region index and phrase index is determined, and the image-to-text retrieval ranking position corresponding to the region index is located in the bidirectional retrieval ranking sequence. At the same time, the text-to-image retrieval ranking position corresponding to the phrase index is located. The region index, phrase index, image-to-text retrieval ranking position and text-to-image retrieval ranking position are structured and combined and registered to form a ranking mapping entry containing four fields: region index, phrase index, image-to-text retrieval ranking position and text-to-image retrieval ranking position. All ranking mapping entries are collected to form a ranking mapping entry set.

[0055] S3.7. Based on the sorting mapping entries and the bidirectional retrieval verification results, perform a joint comparison of the region index and the phrase index to form mismatch judgment entries, specifically: The region index and phrase index in the sorting mapping entry are matched and verified against consistent and inconsistent region index pairs in the bidirectional retrieval verification results. When the region index and phrase index in the sorting mapping entry are registered as inconsistent region index pairs in the bidirectional retrieval verification results and the image-to-text retrieval sorting position or text-to-image retrieval sorting position is not at the first position, it is determined to be a mismatch state and a corresponding mismatch judgment entry is generated. When the region index and phrase index in the sorting mapping entry are registered as consistent region index pairs in the bidirectional retrieval verification results and both the image-to-text retrieval sorting position and text-to-image retrieval sorting position are at the first position, it is determined to be a consistent state and no mismatch judgment entry is generated. All sorting mapping entries determined to be mismatched are structurally registered to form a mismatch judgment entry set.

[0056] S3.8. Distinguish between one-way and two-way mismatch states using mismatch determination entries, forming conflict source entries, specifically: Based on the image-to-text retrieval ranking position and text-to-image retrieval ranking position in the mismatch judgment entries, the following states are distinguished: when the image-to-text retrieval ranking position is first and the text-to-image retrieval ranking position is not first, it is judged as a one-way mismatch state of text-to-image mismatch; when the text-to-image retrieval ranking position is first and the image-to-text retrieval ranking position is not first, it is judged as a one-way mismatch state of image-to-text mismatch; when neither the image-to-text retrieval ranking position nor the text-to-image retrieval ranking position is first, it is judged as a two-way mismatch state. Based on the judgment results, a corresponding conflict source marker is written for each mismatch judgment entry, forming a conflict source entry set.

[0057] S3.9. Based on the conflict source entries, write conflict type tags to the inconsistent region phrase set, and output the inconsistent region phrase set with conflict type tags; sort the inconsistent region phrases in the inconsistent region phrase set with conflict type tags according to their conflict intensity to form a conflict phrase set, specifically: Based on the sorting positions of image-to-text retrieval and text-to-image retrieval in the sorting mapping entries, inconsistent phrases that are not at the top of either the image-to-text or text-to-image retrieval ranking are marked as high contradiction intensity, while inconsistent phrases that are at the top of either the image-to-text or text-to-image retrieval ranking are marked as medium contradiction intensity. They are then arranged in order of high contradiction intensity first, followed by medium contradiction intensity. Within the same contradiction intensity level, they are arranged in order of matching degree from low to high. The set of inconsistent phrases after sorting is the contradictory phrase set.

[0058] S3.10. Perform adversarial recombination pairing on the region index and phrase index corresponding to the contradictory phrase set to form an adversarial pair set, specifically as follows: Each pair of region indices and phrase indices in the contradictory phrase set is retained as the original pairing relationship. At the same time, different region indices are selected from the contradictory phrase set and their corresponding phrase indices are cross-combined to form new pairing relationships, such that the same region index corresponds to different phrase indices and the same phrase index corresponds to different region indices. During the cross-combination process, the region indexes and phrase indices are kept from being registered repeatedly. All original pairing relationships and the new pairing relationships after cross-combination are registered in a structured manner to form a set of adversarial pairs containing multiple combinations of region indexes and phrase indices.

[0059] S3.11. Combine and organize the set of opposing pairs and the set of contradictory phrases to generate a set of semantic contradictions.

[0060] S4. Jointly train the semantic contradiction set and the consistent regional phrases in the bidirectional retrieval verification by alternating scheduling to generate the image-text consistency model. Then, use the image-text consistency model to perform phrase response parsing and relation inference on the image-text set to generate image semantic results.

[0061] S4.1. Filter the bidirectional retrieval verification results, extract the region-phrase indexes that have passed the back-check verification, and form a consistent region phrase set, specifically: Using consistent region index pairs as the filtering object, the region indexes and phrase indexes that have passed the back-check verification are aggregated according to their correspondence. Each pair of region indexes and phrase indexes, along with their corresponding region features and phrase entries, are structurally registered to form a stable mapping relationship between region indexes, phrase indexes, region features, and phrase entries. After all the region indexes and phrase indexes that have passed the back-check verification have completed the mapping registration, they are aggregated into a consistent region phrase set. The contradictory phrase set and the adversarial pair set are merged and organized into a contradictory training set.

[0062] S4.2. Arrange the consistent region phrase set and the contradictory training set in rounds to form a round scheduling sequence, and alternately schedule them between the consistent region phrase set and the contradictory training set according to the round scheduling sequence, specifically as follows: The consistent region phrase set and the contradictory training set are registered as two independent training subsequences, and arranged alternately according to a preset number of rounds. This ensures that the consistent region phrase set and the contradictory training set appear alternately in the round scheduling sequence. The arrangement rule of the round scheduling sequence is that the consistent region phrase set is in the first round and the contradictory training set is in the second round. The consistent region phrase set and the contradictory training set are alternately arranged until all rounds are registered. The list of the registered sequences is the round scheduling sequence. During the execution phase, the consistent region phrase set and the contradictory training set are called in the training process in the order of the round scheduling sequence, thereby realizing the alternating scheduling between the consistent region phrase set and the contradictory training set.

[0063] It should be noted that the preset number of rounds is set based on the trend of the convergence curve of the loss value of the consistent region phrase set and the contradictory training set during the training of the image-text consistency model. The value is the round corresponding to when the decrease in loss value is less than 10% of the initial decrease.

[0064] S4.3. In the consistency phase of the round scheduling sequence, perform joint training on the consistency constraint set of consistent region phrases; in the contradiction phase of the round scheduling sequence, perform joint training on the contradiction constraint set of contradiction training. Then, sequentially connect the parameter states of the consistency constraint joint training and the contradiction constraint joint training to generate a graph-text consistency model. Specifically: During the consistency phase of the round-based scheduling sequence, the regional features and phrase entries in the consistent region phrase set are input into the image-text consistency model. This makes the feature vector representations corresponding to the regional features and the semantic vector representations corresponding to the phrase entries tend to have a high degree of matching. The image encoding network weights, text encoding network weights, feature projection layer weights, and matching degree output layer parameters are adjusted through parameter update operators to make the region indexes and phrase indexes in the consistent region phrase set form a stable high degree of matching relationship.

[0065] The "tending towards high value" is defined as the interval where the matching degree is above the center position of the matching degree distribution by more than one standard deviation of matching degree. The value is determined by the interval in the matching degree distribution that is above the main concentration interval and is more than one standard deviation of matching degree from the center position. It is used to indicate that there is a stable directional consistency between the feature vector representation corresponding to the regional feature and the semantic vector representation corresponding to the phrase entry.

[0066] In the conflict phase of the round scheduling sequence, the adversarial pair set in the conflict training set is input into the image-text consistency model, causing the region index and phrase index in the adversarial pair set to tend to a low matching degree. The image encoding network weight, text encoding network weight, feature projection layer weight, and matching degree output layer parameters are further adjusted through parameter update operators to make the region index and phrase index in the conflict training set form a stable low matching degree relationship. The image encoding network weight, text encoding network weight, feature projection layer weight, and matching degree output layer parameters formed at the end of the consistency constraint joint training serve as the initial parameter state for the conflict constraint joint training. The image encoding network weight, text encoding network weight, feature projection layer weight, and matching degree output layer parameters formed at the end of the conflict constraint joint training serve as the initial parameter state for the next round of consistency constraint joint training. After all rounds of the round scheduling sequence, the image-text consistency model is obtained.

[0067] The "tendency towards low value" is defined as the interval where the matching degree is located on the lower side of the matching degree distribution and deviates from the center position by more than one standard deviation of matching degree. The value is determined by the interval in the matching degree distribution that is located on the lower side of the main concentration interval and is more than one standard deviation of matching degree from the center position. This is used to indicate that there is a significant directional deviation between the feature vector representation corresponding to the regional feature and the semantic vector representation corresponding to the phrase entry.

[0068] It should be noted that sequential succession means that the model parameters obtained after the previous stage of training are used as the initial parameters for the next stage of training to continue to optimize and update. After all stages of scheduling, the final converged model parameters are formed, thus generating a consistent image and text model.

[0069] S4.31 The training process of the image-text consistency model is as follows: The image-text consistency model adopts a dual-encoding structure, which includes an image encoding network, a text encoding network, a feature projection layer, and a matching degree output layer. The image encoding network receives regional features and outputs the feature vector representations corresponding to the regional features. The text encoding network receives phrase entries and outputs the semantic vector representations corresponding to the phrase entries. The feature projection layer maps the feature vector representations corresponding to the regional features and the semantic vector representations corresponding to the phrase entries to a unified dimensional space. The matching degree output layer outputs the matching degree based on cosine similarity.

[0070] The number of layers in the image coding network is set based on the spatial structural complexity of the regional features. An example value of twelve blocks is used, chosen to ensure a stable distribution of the feature vector representations corresponding to the regional features while maintaining feature expressiveness. The number of layers in the text coding network is set based on the semantic structure depth of phrase entries. An example value of twelve coding layers is used, chosen to form stable semantic vector representations after multi-layer semantic mapping. The number of feature projection layers is set based on the need for cross-modal feature alignment. An example value of two fully connected layers is used, chosen to achieve linear alignment between the feature vector representations corresponding to the regional features and the semantic vector representations corresponding to the phrase entries through two-layer mapping. The projection dimension is set based on the balanced expressiveness of the regional feature dimension and the semantic dimension of the phrase entries. An example value of 512 dimensions is used, chosen to ensure that the feature vector representations corresponding to the regional features and the semantic vector representations corresponding to the phrase entries maintain sufficient expressiveness and computational stability within the same dimensional space.

[0071] The parameters of the image-text consistency model include the weights of the image encoding network, the weights of the text encoding network, the weights of the feature projection layer, and the parameters of the matching degree output layer. During the training phase, a mini-batch training method is used to organize the sample input of the consistent region phrase set and the contradictory training set. The parameter update operator is used to maintain the sequential succession of the image encoding network weights, text encoding network weights, feature projection layer weights, and matching degree output layer parameters. The momentum parameter of the parameter update operator is set to 90% in an example, based on balancing the historical gradient direction and the current gradient direction during the update process. The weight decay is set to 1% in an example, based on limiting the excessive fluctuation of the image encoding network weights and text encoding network weights during training.

[0072] In the consistent constraint joint training phase, positive correspondences are constructed using the consistent region phrase set, and negative correspondences are constructed using the intra-batch non-correspondences. In the contradictory constraint joint training phase, strong negative correspondences are constructed using the contradictory training set, while positive correspondences are constructed using the consistent region phrase set. After each round of the round scheduling sequence, the image encoding network weights, text encoding network weights, feature projection layer weights, matching degree output layer parameters, and parameter update operator states are retained and used as the initial states for the next round of consistent constraint joint training or contradictory constraint joint training. The training process ends when the loss value decreases by less than 10% of the initial decrease, and the image-text consistency model is output.

[0073] S4.4. Input the image and text set into the image-text consistency model line by line for consistency verification, and generate a consistency response sequence, specifically as follows: Image samples and text samples from the image-text set are input into the image-text consistency model. The image-text consistency model outputs the feature vector representation corresponding to the region features and the semantic vector representation corresponding to the phrase entries. Based on the matching degree calculation result, the consistency of the correspondence between the image samples and text samples is determined. When the matching degree is higher than the matching threshold, it is registered as a consistent response result. When the matching degree is lower than the matching threshold, it is registered as an inconsistent response result. The consistent response results and inconsistent response results are registered sequentially according to the original arrangement order in the image-text set to form a consistent response sequence that corresponds one-to-one with the image-text set.

[0074] S4.5. Locate phrase entries in the image-text set using the consistent response sequence, and perform phrase response parsing on the phrase entries to form a phrase response entry set, specifically: Based on the consistent response results in the consistent response sequence, the phrase entries at the corresponding positions in the image-text set are located and registered. A mapping relationship is established between the phrase index corresponding to the consistent response result and the phrase entries in the image-text set. For each phrase entry, the matching degree and consistency judgment result corresponding to the image-text consistency model are output. The phrase entries, phrase indexes, matching degree and consistency judgment results are structurally combined and registered to form a set of phrase response entries containing four fields: phrase entry, phrase index, matching degree and consistency judgment result.

[0075] S4.6. Based on the phrase response entry set, extract region features from the image text set, and organize the region features into region-directed sets to form a set of region phrases carrying region indices, specifically: Based on the phrase index and matching degree in the phrase response entry set, the corresponding regional features in the image text set are located and associated. The phrase index in the phrase response entry set is mapped to the regional index in the image text set, and the regional index is bound and registered with the corresponding regional feature. When the same phrase index corresponds to multiple regional indices, they are arranged in descending order of matching degree and the regional index with the highest matching degree is retained. The phrase index, the corresponding regional index, and the regional feature are structurally combined and registered so that each group of phrase indexes carries a clear regional index identifier. All the phrase index and regional index combination relationships that have been bound and registered are collected to form a regional phrase set carrying the regional index.

[0076] S4.7. Select regional phrase entries with common regional indices from the regional phrase set, and perform connection organization and relationship deduction to form a relationship deduction set, specifically as follows: Multiple phrase entries with the same regional index are grouped into the same regional index group. Within the same regional index group, the phrase entries are connected and registered according to the order of their phrase indexes, forming a semantic association chain among multiple phrase entries sharing the same regional index. After the connection registration is completed, the matching degree and consistency judgment results among the phrase entries within the same regional index group are analyzed. When multiple phrase entries are all registered as consistent response results in the consistency response sequence, it is presumed that there is a semantic parallel relationship between the phrase entries. When multiple phrase entries have consistent response results and inconsistent response results that appear interchangeably in the consistency response sequence, it is presumed that there is a semantic conflict relationship between the phrase entries. The semantic parallel relationship and the semantic conflict relationship are structured and registered to form a relationship inference set.

[0077] S4.8. Gather the phrase response entry set, the region phrase set, and the relation inference set to generate image semantic results.

[0078] This embodiment also provides a computer device suitable for an image understanding method for training artificial intelligence data, comprising: a memory and a processor; the memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions to implement the image understanding method for training artificial intelligence data as proposed in the above embodiment.

[0079] The computer device can be a terminal, comprising a processor, memory, communication interface, display screen, and input devices connected via a system bus. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, carrier networks, NFC (Near Field Communication), or other technologies. The display screen can be an LCD screen or an e-ink screen. The input devices can be a touch layer covering the display screen, buttons, a trackball, or a touchpad on the computer device's casing, or an external keyboard, touchpad, or mouse.

[0080] This embodiment also provides a storage medium storing a computer program, which, when executed by a processor, implements the image understanding method for training artificial intelligence data as proposed in the above embodiments. The storage medium can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Red-Only Memory (PROM), Read-Only Memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk.

[0081] In summary, this invention achieves enhanced consistency through bidirectional retrieval verification and alternating training. Based on bidirectional retrieval verification, conflict attribution and adversarial pairing processing are applied to phrases in inconsistent regions, allowing semantic contradiction sets to participate in the joint training process. This promotes a more stable cross-modal representation structure in the image-text consistency model under the alternating effects of consistency and contradiction constraints. The feature vector representations corresponding to region features and the semantic vector representations corresponding to phrase entries maintain a clear directional distinction within a unified dimensional space, explicitly characterizing semantic parallel and semantic conflict relationships. This enhances the semantic discrimination and scene understanding capabilities of image semantic results in computer vision tasks.

[0082] It should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all such modifications or substitutions should be covered within the scope of the claims of the present invention.

Claims

1. An image understanding method trained with artificial intelligence data, characterized in that: include, Obtain image samples and text samples, pair them, and generate an image-text set; Extract region features and phrase entries from image-text sets, perform similarity ranking, and generate fine-grained alignment sets; Bidirectional retrieval verification is performed on the fine-grained alignment set, and conflict attribution and contradiction intensity ranking are performed on the inconsistent regional phrases in the bidirectional retrieval verification. Based on the contradiction intensity ranking, adversarial pairing processing is performed on the inconsistent regional phrases to generate a semantic contradiction set. The image-text consistency model is generated by alternating the training of semantically contradictory sets and consistent regional phrases in bidirectional retrieval verification. The image-text consistency model is then used to perform phrase response parsing and relation inference on the image-text set to generate image semantic results.

2. The image understanding method for training artificial intelligence data as described in claim 1, characterized in that: The generated image-text set specifically includes... Based on the sample identifiers in the image and text samples, the image and text samples are associated and organized to generate a set of paired entries; Expand each pair of entries in the set of paired entries, and perform an identifier consistency check on the pairing identifiers in the paired entries to form a set of verified paired entries; The set of verification and matching entries is merged and organized to generate an image and text set.

3. The image understanding method for training artificial intelligence data as described in claim 1, characterized in that: The generation of fine-grained aligned sets specifically refers to... Expand each image-text pair in the image-text set one by one, and extract region candidates from the image samples in the image-text pairs to form a region candidate set; Extract regional features from the candidate regional set and aggregate them into a regional feature set; The text samples in the image-text pair are segmented by phrase boundaries and phrase indices are added to form a set of phrase entries; Based on the set of phrase entries, phrase-guided matching is performed on the set of regional features to form a pair of associated indexes between the regional index and the phrase index; Perform similarity ranking on the region features and phrase entries in the associated index pairs, and aggregate the similarity ranking results to generate a fine-grained alignment set.

4. The image understanding method for training artificial intelligence data as described in claim 3, characterized in that: The similarity ranking refers to the ordered arrangement of regional features and phrase entries in the associated index pairs according to their degree of matching.

5. The image understanding method for training artificial intelligence data as described in claim 1, characterized in that: The generation of the semantic contradiction set specifically refers to, Alignment entries are extracted from the fine-grained alignment set, and the alignment entries are retrieved and organized to form a bidirectional retrieval verification sequence; The region features in the bidirectional retrieval verification sequence are sorted by image-to-text retrieval, and the phrase entries in the bidirectional retrieval verification sequence are sorted by text-to-image retrieval, forming a bidirectional retrieval sorting sequence; Based on the bidirectional retrieval sorting sequence, the area index and phrase index are back-checked for verification, and bidirectional retrieval verification results are generated. From the bidirectional retrieval verification results, select the region-phrase index that failed the retrospective verification to form an inconsistent region-phrase set; Based on the bidirectional retrieval sorting sequence and bidirectional retrieval verification results, conflict attribution is performed on the inconsistent region phrase set to form an inconsistent region phrase set with conflict type labels. The inconsistent phrases in the set of inconsistent phrases with conflict type labels are sorted by contradiction intensity to form a set of contradictory phrases; Perform adversarial recombination pairing on the region index and phrase index corresponding to the contradictory phrase set to form an adversarial pair set; The set of opposing pairs and the set of contradictory phrases are combined and organized to generate a set of semantic contradictions.

6. The image understanding method for training artificial intelligence data as described in claim 5, characterized in that: The formation of the set of inconsistent region phrases with conflict type markers specifically refers to, Read the region index and phrase index from the inconsistent region phrase set, and simultaneously retrieve the corresponding image-to-text retrieval sorting position and text-to-image retrieval sorting position in the bidirectional retrieval sorting sequence to form sorting mapping entries; Based on the sorting mapping entries and the bidirectional retrieval verification results, a joint comparison is performed on the regional index and the phrase index to form mismatch judgment entries; The mismatch determination entries distinguish between one-way mismatch and two-way mismatch states, forming conflict source entries; Based on the conflict source entries, write conflict type flags to the inconsistent region phrase set, and output the inconsistent region phrase set with conflict type flags.

7. The image understanding method for training artificial intelligence data as described in claim 1, characterized in that: The generated image-text consistency model is specifically as follows: The bidirectional retrieval verification results are filtered, and the region-phrase index that passes the back-check verification is extracted to form a consistent region-phrase set. The contradictory phrase set and the adversarial pair set are merged and organized into a contradictory training set; The consistent region phrase set and the contradictory training set are arranged in rounds to form a round scheduling sequence, and the consistent region phrase set and the contradictory training set are alternately scheduled according to the round scheduling sequence; In the consistency phase of the round scheduling sequence, joint training of consistency constraints is carried out on the consistent region phrase set. In the contradiction phase of the round scheduling sequence, joint training of contradiction constraints is carried out on the contradiction training set. The parameter states of the consistency constraint joint training and the contradiction constraint joint training are sequentially inherited to generate the graph-text consistency model.

8. The image understanding method for training artificial intelligence data as described in claim 1, characterized in that: The generated image semantic result is specifically as follows: The image and text set is input into the image-text consistency model one by one for consistency verification, and a consistency response sequence is generated. Phrase entries in an image-text set are located using a consistent response sequence, and phrase responses are parsed to form a set of phrase response entries. Based on the phrase response entry set, regional features are extracted from the image text set, and the regional features are organized by region orientation to form a set of regional phrases carrying region indexes; Regional phrase entries with common regional indices are selected from the regional phrase set, and then connected, organized, and their relationships deduced to form a relationship deduction set; The system aggregates phrase response entries, region phrases, and relation inference sets to generate image semantic results.

9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that: When the processor executes the computer program, it implements the steps of the image understanding method for training artificial intelligence data as described in any one of claims 1 to 8.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that: When the computer program is executed by the processor, it implements the steps of the image understanding method for training artificial intelligence data as described in any one of claims 1 to 8.