A multi-modal data label generation method and system

By performing modality-specific deconstruction and constructing cross-modal projection channels for multimodal data, the problems of semantic ambiguity and conceptual conflict in multimodal data label generation are solved, achieving accurate and consistent label generation.

CN122045899BActive Publication Date: 2026-06-26ITKC TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ITKC TECH CO LTD
Filing Date
2026-04-02
Publication Date
2026-06-26

Smart Images

  • Figure CN122045899B_ABST
    Figure CN122045899B_ABST
Patent Text Reader

Abstract

The application discloses a kind of multimodal data label generation method and system, it is related to multimodal data processing technical field, including acquisition contains original multimodal data sample of image, text, audio;Respectively to each data stream is exclusively deconstructed, corresponding unit set is extracted and single-modal concept graph is constructed;Cross-modal projection channel is established in modal fusion layer, the interactive mapping of each concept graph node and edge is realized, and cross-modal associated concept graph is formed;Extraction consensus concept node and arbitrate conflict node, generate unified concept candidate node pool;Labeling distillation is executed to candidate node pool, and preliminary multimodal label set is obtained;After label standardization template processing, standardized multimodal data label is generated.The method can realize the accurate labeling of multimodal data, solve the problems of label semantic ambiguity, conflict and the like in the prior art.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of multimodal data processing technology, specifically a method and system for generating multimodal data labels. Background Technology

[0002] Multimodal data labeling is a crucial component of multimodal data processing, and its quality directly impacts the effectiveness of subsequent applications. Current technologies, for raw multimodal data samples containing image data streams, text description streams, and audio signal streams, often involve directly concatenating or shallowly fusing the features of each modality, and then generating corresponding labels based on the fused features. Some technologies perform simple feature extraction for a single modality, but they lack dedicated deep structural deconstruction and semantic / acoustic refinement for each modality's characteristics, and they fail to construct independent conceptual representations for each modality, thus failing to fully uncover the core features and inherent semantic relationships of each modality.

[0003] Existing technologies lack effective cross-modal interaction mapping mechanisms during feature fusion, failing to achieve deep association and collaboration between concepts across different modalities. This easily leads to conflicts between modal concepts and inconsistent semantic expressions. Furthermore, the lack of dedicated deconstruction and concept graph construction for each modality results in extracted features lacking clear semantic direction, leading to semantic ambiguity, hierarchical confusion, and low matching degree with the core information of the original multimodal data in the generated labels. A technical solution is needed that can perform dedicated deep deconstruction for each modality, construct independent concept expression carriers, achieve deep association between modal concepts, resolve concept conflicts, and generate accurate and standardized labels. This solution aims to address the shortcomings of existing technologies and meet the practical needs of multimodal data label generation. Summary of the Invention

[0004] This invention aims to solve at least one of the technical problems existing in the prior art;

[0005] Therefore, this invention proposes a method for generating multimodal data tags, comprising:

[0006] Collect raw multimodal data samples, which include at least image data streams, text description streams, and audio signal streams;

[0007] The image data stream is destructured to extract a set of basic visual units; the text description stream is divided into semantic units to extract a set of text semantic units; and the audio signal stream is separated into acoustic features to extract a set of audio feature units.

[0008] The set of visual basic units is input into a spatial association network to construct a visual concept graph; the set of text semantic units is input into a semantic dependency parser to construct a text concept graph; and the set of audio feature units is input into an acoustic event builder to construct an audio concept graph.

[0009] In the modal fusion layer, a cross-modal projection channel is established between the visual concept map, text concept map, and audio concept map. The interaction mapping of nodes and edges is performed through the cross-modal projection channel to form a cross-modal associated concept map.

[0010] Based on the cross-modal related concept graph, consensus concept nodes are extracted and conflicting concept nodes are arbitrated to generate a unified pool of concept candidate nodes;

[0011] Tag distillation is performed on the unified concept candidate node pool to transform each concept candidate node into a tag entry with clear semantics, generating a preliminary multimodal tag set;

[0012] The format of the initial multimodal tag set is reorganized and the hierarchy is verified using a preset tag normalization template to generate normalized multimodal data tags.

[0013] Furthermore, the image data stream is destructured to extract a set of basic visual units, including:

[0014] A multi-scale region proposal network is applied to the image data stream to generate image region proposal boxes at different spatial scales and locations.

[0015] Deep convolution feature extraction is performed on the image content within each proposed image region to obtain the region visual feature vector;

[0016] Cluster analysis and visual vocabulary selection are performed on the visual feature vectors of the region to form a representative set of visual basic units. Each visual basic unit is represented by a visual vocabulary and its corresponding feature vector.

[0017] Furthermore, establishing a cross-modal projection channel between the visual concept map, text concept map, and audio concept map in the modal fusion layer includes:

[0018] A common latent space is established in the modal fusion layer, and a visual projector from the visual feature space to the common latent space, a text projector from the text feature space to the common latent space, and an audio projector from the audio feature space to the common latent space are constructed respectively.

[0019] The visual projector is used to project the node feature vectors in the visual concept graph to the common latent space to obtain visual latent nodes. The text projector is used to project the node feature vectors in the text concept graph to the common latent space to obtain text latent nodes. The audio projector is used to project the node feature vectors in the audio concept graph to the common latent space to obtain audio latent nodes.

[0020] Calculate the similarity between the visual hidden nodes, text hidden nodes, and audio hidden nodes in the common hidden space. If the similarity exceeds a set threshold, establish a cross-modal projection channel between the corresponding original concept graph nodes.

[0021] Furthermore, the step of extracting consensus concept nodes and arbitrating conflicting concept nodes based on the cross-modal related concept graph includes:

[0022] In the cross-modal association concept graph, consensus concept node clusters are identified. These clusters consist of concept nodes from different modalities that are closely connected through cross-modal projection channels.

[0023] Calculate the internal consistency score of consensus concept node clusters, and add the entire consensus concept node clusters whose internal consistency scores exceed the consensus threshold as candidate nodes to the candidate node pool.

[0024] In the cross-modal association concept graph, conflicting concept node clusters are identified. These clusters consist of different modal concept nodes that point to similar semantics but contradict each other in key features.

[0025] The conflicting concept node cluster is submitted to the modal arbitrator, which arbitrates based on the confidence score and context support of each modal node, selects a modal concept node as a representative node, and adds the selected representative node to the candidate node pool.

[0026] Furthermore, the modal arbitrator arbitrates based on the confidence score and contextual support of each modal node, including:

[0027] For each modality of the concept node in the conflicting concept node cluster, a modal confidence score is calculated. The modal confidence score is determined based on the centrality and path clarity of the concept node in its original concept graph.

[0028] Calculate context support for each modality of concept node in the conflicting concept node cluster. The context support is determined based on the number and strength of non-conflicting concept nodes associated with the concept node in the cross-modal association concept graph.

[0029] The modal confidence score and context support of each modality's concept node are weighted and fused to obtain the modal arbitration comprehensive score;

[0030] The concept node of the mode with the highest modal arbitration comprehensive score is selected as the representative node that wins the arbitration.

[0031] Furthermore, a label distillation process is performed on the unified pool of concept candidate nodes, transforming each concept candidate node into a label entry with explicit semantics, including:

[0032] For each concept candidate node in the unified concept candidate node pool, a preset multimodal knowledge graph is retrieved;

[0033] In the multimodal knowledge graph, one or more entity concepts that are closest to the feature vector of the concept candidate node are found and used as semantic anchors;

[0034] Using the semantic anchor as the core, and combining the contextual association of the concept candidate nodes in the original multimodal data samples, one or more natural language words that are most appropriate are selected from the preset label vocabulary to form the initial label description.

[0035] Redundant words are removed and synonyms are merged from the initial tag descriptions to form tag entries with clear semantics.

[0036] Furthermore, the step of using a preset tag normalization template to perform format reorganization and hierarchical verification on the preliminary multimodal tag set includes:

[0037] Load a preset tag hierarchy template, which defines the structure of the tag's main category, subcategories, and attribute description fields;

[0038] Each tag entry in the initial multimodal tag set is semantically matched with the tag hierarchy architecture template to determine the main category, subcategory, and required attributes of each tag entry in the tag hierarchy architecture template.

[0039] Based on the attribute filling rules of the tag entries under their respective main and subcategories, the natural language descriptions of the tag entries are restructured and filled into the corresponding attribute description fields.

[0040] For the set of tags that have been restructured, verify the logical mutual exclusion and integrity of tags at the same level, and perform semantic fine-tuning or reclassification of tags with logical conflicts.

[0041] Furthermore, the step of verifying the logical mutual exclusion and integrity of tags at the same level within the set of tags that have undergone structured reorganization includes:

[0042] Under the same subcategory of the tag hierarchy architecture template, extract the core semantic feature vectors of all tag entries;

[0043] Calculate the pairwise semantic distance between the core semantic feature vectors;

[0044] Tag pairs with a semantic distance less than the mutual exclusion threshold are identified as tag pairs with potential logical conflicts.

[0045] For tag pairs with potential logical conflicts, the description of the attribute description field of one of the tag entries is refined or modified according to its original context support or manually predefined conflict resolution rules in order to increase the semantic distance and eliminate the logical conflict.

[0046] Furthermore, the acquisition of raw multimodal data samples includes synchronous acquisition and asynchronous alignment processing:

[0047] The image data stream, text description stream, and audio signal stream of the target scene are captured synchronously by a sensor array, and a unified timestamp sequence is recorded.

[0048] For data segment misalignment between streams caused by transmission delay or processing rate differences, the unified timestamp sequence is used as a reference, and a time window sliding alignment algorithm is adopted to divide the data streams of different modalities into multiple time-aligned multimodal data frames. Each multimodal data frame contains image data segments, text description segments, and audio signal segments within the same time window, which together constitute the original multimodal data sample.

[0049] Furthermore, the present invention also includes a multimodal data tag generation system, the system including a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor, when executing the computer program, implements the steps of the multimodal data tag generation method described above.

[0050] Compared with the prior art, the beneficial effects of the present invention are:

[0051] Modality-specific deconstruction is performed on image data streams, text description streams, and audio signal streams respectively. For image data streams, structural deconstruction extracts a set of basic visual units, which is then input into a spatial association network to construct a visual concept map. For text description streams, semantic granularity partitioning extracts a set of semantic text units, which is then input into a semantic dependency parser to construct a text concept map. For audio signal streams, acoustic feature separation extracts a set of audio feature units, which is then input into an acoustic event builder to construct an audio concept map. This approach accurately captures the core features and semantic relationships of each modality, avoiding feature confusion between different modalities. This makes the conceptual expression of each modality more targeted and precise, solving the problems of feature redundancy and semantic ambiguity caused by the lack of modality-specific deconstruction and concept map construction in conventional techniques, thus providing a more solid foundation for label generation.

[0052] In the modal fusion layer, a cross-modal projection channel is established between visual concept maps, textual concept maps, and audio concept maps. This channel enables interactive mapping of nodes and edges among the three concept maps, forming a cross-modal associated concept graph. Based on this graph, consensus concept nodes are extracted, and conflicting concept nodes are arbitrated, generating a unified pool of candidate concept nodes. This approach enables deep interaction and collaboration between concepts from different modalities, effectively resolving concept conflicts and ensuring the uniformity and consistency of candidate concept nodes. It solves the problems of label conflicts and semantic inconsistencies caused by simple fusion or filtering in conventional techniques, making the generated labels more aligned with the overall semantics of multimodal data. Attached Figure Description

[0053] Figure 1 This is a flowchart illustrating the steps of a multimodal data tag generation method according to the present invention.

[0054] Figure 2 A flowchart for deconstructing the image data stream structure;

[0055] Figure 3 Flowchart for establishing a cross-modal projection channel;

[0056] Figure 4 A heatmap of the similarity of multimodal feature vectors;

[0057] Figure 5 A comparison chart showing the number of conflicts resolved and the success rate of conflict labeling using manual rules versus contextual support. Detailed Implementation

[0058] The technical solution of the present invention will be clearly and completely described below with reference to the embodiments. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0059] See Figure 1By collecting raw multimodal data samples, which include at least image data streams, text description streams, and audio signal streams, the system performs structured deconstruction on the image data streams to extract a set of basic visual units. Semantic granularity partitioning is performed on the text description streams to extract a set of text semantic units. Acoustic feature separation is performed on the audio signal streams to extract a set of audio feature units. The set of basic visual units is input into a spatial association network to construct a visual concept graph. The set of text semantic units is input into a semantic dependency parser to construct a text concept graph. The set of audio feature units is input into an acoustic event builder to construct an audio concept graph. In the modality fusion layer… A cross-modal projection channel is established between visual concept maps, text concept maps, and audio concept maps. Interactive mapping of nodes and edges is performed through this channel to form a cross-modal associated concept graph. Based on this graph, consensus concept nodes are extracted and conflicting concept nodes are arbitrated, generating a unified pool of candidate concept nodes. Label distillation is then performed on this pool to transform each candidate node into a label entry with clear semantics, generating a preliminary multimodal label set. Finally, a pre-defined label normalization template is used to reorganize the format and perform hierarchical verification on the preliminary multimodal label set, generating standardized multimodal data labels.

[0060] See Figure 2 In one embodiment of the present invention, the image data stream is structurally deconstructed to extract a set of basic visual units. Specifically, a multi-scale region proposal network is applied to the image data stream to generate image region proposal boxes at different spatial scales and locations. These proposal boxes cover all regions in the image data stream containing valid visual information. The spatial scale and location of the proposal boxes are determined based on the resolution of the image data stream and the target distribution. In another embodiment, deep convolutional feature extraction is performed on the image content within each proposal box to obtain a region visual feature vector. This vector is obtained by layer-by-layer abstraction and feature mapping of pixel information within the proposal box using a multi-layer convolutional neural network. The region visual feature vector can characterize the texture, shape, color, and spatial distribution information of the image content within the proposal box. Furthermore, cluster analysis and visual vocabulary selection are performed on the region visual feature vectors to form a representative set of basic visual units. Each basic visual unit is represented by a visual vocabulary and its corresponding feature vector. The visual vocabulary is obtained by semantically mapping the region visual feature vector corresponding to the cluster center, maintaining a one-to-one correspondence between the visual vocabulary and the region visual feature vector.

[0061] In some embodiments, acquiring raw multimodal data samples includes synchronous acquisition and asynchronous alignment processing. Image data streams, text description streams, and audio signal streams of the target scene are simultaneously captured by a sensor array, and a unified timestamp sequence is recorded. This unified timestamp sequence is generated by a unified clock source and synchronously marks the start and end times of acquisition for the image data stream, text description stream, and audio signal stream. It is understood that the unified timestamp sequence can eliminate clock skew between different acquisition devices and provides a unified time reference for the image data stream, text description stream, and audio signal stream.

[0062] Optionally, for data segment misalignment between streams caused by transmission delay or processing rate differences, a time window sliding alignment algorithm is used as a reference, based on a unified timestamp sequence, to divide the data streams of different modalities into multiple time-aligned multimodal data frames. The time window sliding alignment algorithm slides the window on the unified timestamp sequence with a fixed step size, and the length of the time window is set according to the sampling frequency of the multimodal data and the length of the data frame.

[0063]

[0064] in: Indicates the effective duration of a multimodal data frame. This indicates the total duration of the time window. This indicates the sliding step size of the time window.

[0065] It is understandable that the time window sliding alignment algorithm can match data segments within the same time interval in different modal data streams, and can correct the misalignment deviation of data segments in the time dimension between streams. In some embodiments, each multimodal data frame contains image data segments, text description segments, and audio signal segments within the same time window, which together constitute an original multimodal data sample. The image data segments, text description segments, and audio signal segments are in overlapping time intervals on a unified timestamp sequence, and the image data segments, text description segments, and audio signal segments together reflect all the information of the target scene within the same time segment.

[0066] Optionally, after time alignment is completed, multimodal data frames are stored and processed independently. As the smallest processing unit for subsequent structure deconstruction, semantic granularity partitioning and acoustic feature separation operations, the partitioning process of multimodal data frames does not change the original data content of image data stream, text description stream and audio signal stream.

[0067] See Figure 3In one embodiment of the present invention, a cross-modal projection channel is established between the visual concept map, text concept map, and audio concept map in the modal fusion layer. In a specific implementation, a common latent space is established in the modal fusion layer. The common latent space has independent feature dimensions and vector distribution rules. The common latent space is used to carry the unified feature expression after the projection of the visual concept map, text concept map, and audio concept map. In a specific implementation, a visual projector from the visual feature space to the common latent space, a text projector from the text feature space to the common latent space, and an audio projector from the audio feature space to the common latent space are constructed respectively. The visual projector, text projector, and audio projector are all built using a fully connected neural network structure. The visual projector, text projector, and audio projector are adapted to the dimensional distribution and distribution characteristics of the visual feature space, text feature space, and audio feature space, respectively.

[0068]

[0069] in: This represents the comprehensive projection characteristics in the public hidden space. This represents the original node feature vector output by the visual concept graph. The weight matrix represents the visual projector. This represents the original node feature vector output by the text concept graph. This represents the weight matrix of the text projector. This represents the original node feature vector output by the audio concept graph. This represents the weight matrix of the audio projector.

[0070] In specific implementations, a visual projector is used to project the node feature vectors in the visual concept graph onto a common latent space to obtain visual latent nodes. The visual latent nodes retain the semantic and structural information of the node feature vectors in the visual concept graph. In specific implementations, a text projector is used to project the node feature vectors in the text concept graph onto a common latent space to obtain text latent nodes. The text latent nodes retain the semantic and dependency information of the node feature vectors in the text concept graph. In specific implementations, an audio projector is used to project the node feature vectors in the audio concept graph onto a common latent space to obtain audio latent nodes. The audio latent nodes retain the acoustic event information and temporal correlation information of the node feature vectors in the audio concept graph.

[0071] It is understandable that visual hidden nodes, text hidden nodes, and audio hidden nodes follow a unified vector metric standard in the common latent space. Visual hidden nodes, text hidden nodes, and audio hidden nodes can directly perform numerical distance calculations and similarity calculations. It is also understandable that the common latent space eliminates the dimensional and distributional differences between the visual feature space, text feature space, and audio feature space, and enables direct interaction of different modal features within the same vector space.

[0072] In some embodiments, the similarity between visual latent nodes, text latent nodes, and audio latent nodes in a common latent space is calculated. The similarity is represented by the cosine distance between vectors in the common latent space. The similarity value directly reflects the tightness of the semantic association between latent nodes of different modalities. In some embodiments, a cross-modal projection channel is established between the corresponding original concept graph nodes when the similarity exceeds a set threshold. The cross-modal projection channel directly connects the corresponding entity nodes in the visual concept graph, text concept graph, and audio concept graph. The cross-modal projection channel records the correspondence and mapping path between concept nodes of different modalities.

[0073] Optionally, the visual projector, text projector, and audio projector maintain independent parameter updates during the projection process. The update process of the visual projector, text projector, and audio projector is adjusted according to the similarity distribution in the common latent space. Optionally, the cross-modal projection channel maintains a fixed node correspondence after its establishment. The cross-modal projection channel continues to carry the information transmission between nodes and edges in the subsequent cross-modal interaction mapping process.

[0074] In one embodiment of the present invention, consensus concept node extraction and conflict concept node arbitration are performed based on a cross-modal associated concept graph. In a specific implementation, consensus concept node clusters are identified in the cross-modal associated concept graph. The consensus concept node cluster is composed of concept nodes from different modalities that are closely connected through cross-modal projection channels. The consensus concept node cluster presents a continuous and dense node connection state in the cross-modal associated concept graph. The visual concept graph nodes, text concept graph nodes, and audio concept graph nodes within the consensus concept node cluster maintain a consistent semantic orientation.

[0075] In practice, the internal consistency score of the consensus concept node cluster is calculated. The internal consistency score is determined by the connection density of the nodes and the semantic overlap of the nodes within the consensus concept node cluster. In practice, the consensus concept node clusters whose internal consistency scores exceed the consensus threshold are regarded as candidate nodes and added to the candidate node pool. The candidate node pool is used to carry the unified concept nodes that have passed the consistency verification. The nodes in the candidate node pool are directly used for subsequent label distillation operations.

[0076] In practice, conflicting concept node clusters are identified in the cross-modal association concept graph. These clusters consist of different modal concept nodes that point to similar semantics but contradict each other in key features. Within each cluster, the different modal concept nodes maintain a correspondence in semantic core, but they also have contradictory expressions in detailed attributes. In practice, the conflicting concept node clusters are submitted to a modal arbitrator. The arbitrator arbitrates based on the confidence score and contextual support of each modal node. The arbitrator independently analyzes and determines the conflicting concept node clusters and selects a modal concept node as a representative node, adding the selected representative node to the candidate node pool.

[0077] It is understandable that the modal arbitrator maintains independent judgment logic when performing arbitration operations. The modal arbitrator completes the fusion calculation of modal confidence score and context support according to the preset calculation rules. It is also understandable that the representative node is directly included in the candidate node pool after being judged by the modal arbitrator. The representative node replaces all contradictory nodes in the conflict concept node cluster to participate in the subsequent label generation process.

[0078] In some embodiments, a modal confidence score is calculated for each modality of the concept node in the conflicting concept node cluster. The modal confidence score is determined based on the centrality and path clarity of the concept node in the visual concept graph, text concept graph, and audio concept graph. The centrality reflects the structural importance of the concept node in the corresponding concept graph, and the path clarity reflects the completeness of the connection path between the concept node and its associated nodes. In some embodiments, a context support score is calculated for each modality of the concept node in the conflicting concept node cluster. The context support score is determined based on the number and strength of the concept node and its associated non-conflicting concept nodes in the cross-modal associated concept graph. The number of non-conflicting concept nodes directly reflects the context coverage of the conflicting concept node, and the strength of the non-conflicting concept nodes directly reflects the context support strength of the conflicting concept node.

[0079] In one embodiment of the present invention, a label distillation is performed on a unified pool of concept candidate nodes, transforming each concept candidate node into a label entry with a clear semantic meaning. In a specific implementation, for each concept candidate node in the unified pool of concept candidate nodes, a preset multimodal knowledge graph is retrieved. The multimodal knowledge graph stores entity concepts, feature vectors, and semantic relationships. The multimodal knowledge graph covers all entity concept content corresponding to visual concepts, text concepts, and audio concepts. In a specific implementation, one or more entity concepts that are closest to the feature vector of the concept candidate node are found in the multimodal knowledge graph and used as semantic anchors. The semantic anchors directly correspond to entity concepts with definite semantic meaning in the multimodal knowledge graph, and the semantic anchors and concept candidate nodes maintain a minimum distance distribution in the feature space.

[0080] In practice, semantic anchors are the core, and the contextual associations of concept candidate nodes in the original multimodal data samples are combined. One or more natural language words that are most appropriate are selected from the preset label vocabulary to form the initial label expression. Each natural language word in the label vocabulary corresponds to a fixed semantic reference. The natural language words in the label vocabulary maintain a one-to-one mapping relationship with the entity concepts in the multimodal knowledge graph. In practice, redundant words are removed and synonyms are merged in the initial label expression to form label entries with clear semantics. The redundant word removal operation removes natural language words that appear repeatedly in the initial label expression, and the synonym merging operation integrates natural language words with different expressions but consistent semantics into a unified label expression.

[0081] It is understandable that multimodal knowledge graphs provide standardized semantic mapping paths for concept candidate nodes, and entity concepts in multimodal knowledge graphs can eliminate semantic expression differences caused by concept nodes of different modalities. It is also understandable that tag vocabularies provide unified natural language expression standards for tag entries, and natural language words in tag vocabularies can ensure the semantic consistency and expression standardization of the final generated tag entries (see Table 1).

[0082] Table 1: Correspondence between concept candidate nodes, semantic anchors, and tag entries

[0083]

[0084] In some embodiments, the selection process of semantic anchors maintains a unique correspondence with the feature vectors of concept candidate nodes. Semantic anchors possess independent and complete semantic attribute information in the multimodal knowledge graph, directly determining the core semantic content of the initial label representation. In some embodiments, redundant word removal and synonym merging are performed simultaneously. The merged synonyms directly replace semantically overlapping natural language words in the initial label representation, and the merging operation does not change the core semantic direction of the initial label representation. Optionally, the initial label representation fully retains the contextual association information of concept candidate nodes in the original multimodal data samples during its formation process. This contextual association information is used to improve the fit of natural language word selection. Optionally, label entries with clear semantics are directly included in the preliminary multimodal label set after generation, and the label entries in the preliminary multimodal label set maintain independent semantic expression and structural form.

[0085] See Figure 4In the modality fusion layer of the multimodal data label generation method, the construction of the common latent space and the calculation of cross-modal similarity are the core steps to achieve concept alignment. Specifically, the feature vectors of concept graph nodes from the visual, text, and audio modalities are mapped to a unified common latent space through their respective projectors, forming visual, text, and audio latent nodes. The quantification of cross-modal similarity is achieved by calculating the vector distance between latent nodes of different modalities in the common latent space; the similarity value directly reflects the semantic association strength between concepts of different modalities. As seen in the heatmap, the similarity between the visual and text modalities is 0.85, between the visual and audio modalities is 0.78, and between the text and audio modalities is 0.82. The similarity between each modality and itself is 1.00. This similarity distribution indicates that the visual and text modalities have the highest concept alignment in the common latent space, while the audio modality has relatively low association with other modalities. This feature can be used to guide the setting of thresholds for establishing cross-modal projection channels and the priority of extracting consensus concept nodes. In practice, the similarity threshold can be set to 0.80, and cross-modal projection channels between visual and text, and between text and audio can be established first to improve the construction efficiency and stability of cross-modal concept graphs.

[0086] In one embodiment of the present invention, a preset tag normalization template is used to reorganize the format and perform hierarchical verification on the initial multimodal tag set. In a specific implementation, a preset tag hierarchy architecture template is loaded. The tag hierarchy architecture template defines the structure of the main category, subcategory, and attribute description fields of the tag. The main category, subcategory, and attribute description fields in the tag hierarchy architecture template maintain a fixed hierarchical relationship. In a specific implementation, each tag entry in the initial multimodal tag set is semantically matched with the tag hierarchy architecture template to determine the main category, subcategory, and required attributes of each tag entry in the tag hierarchy architecture template. The semantic matching process is achieved by vector comparison between the core semantics of the tag entry and the category semantics in the tag hierarchy architecture template.

[0087] In practice, the natural language descriptions of tag entries are structurally reorganized according to the attribute filling rules of their respective main and subcategories, and the corresponding attribute description fields are filled in. The attribute filling rules limit the expression format and content composition of tag entries at the corresponding level. The structural reorganization process preserves the original semantics of the tag entries and adapts to the field requirements of the tag hierarchy architecture template. In practice, for the set of tag entries that have been structurally reorganized, the logical mutual exclusion and integrity of tags at the same level are verified. Tags with logical conflicts are semantically fine-tuned or reclassified. The logical mutual exclusion verification identifies tag entries at the same level that have semantic overlap or semantic contradictions. The integrity verification confirms that the tag entries at the same level cover the entire semantic range set by the tag hierarchy architecture template.

[0088] It is understandable that the tag hierarchy architecture template provides a unified structured organization form for the initial multimodal tag set. The main category, subcategory, and attribute description fields in the tag hierarchy architecture template realize the hierarchical classification and management of tag entries. It is also understandable that the format reorganization operation transforms the scattered tag entries into structured data that conforms to the hierarchical specifications. The format reorganization operation improves the retrieval efficiency and usage standardization of tag entries in subsequent calls.

[0089] In practice, for the set of tag entries that have been restructured, the logical mutual exclusion and integrity of tags at the same level are verified. In practice, under the same subcategory of the tag hierarchy architecture template, the core semantic feature vectors of all tag entries are extracted. The core semantic feature vectors fully represent the core semantic content of the tag entries. The dimension of the core semantic feature vectors is consistent with the dimension of the category semantic feature vectors in the tag hierarchy architecture template. In practice, the pairwise semantic distance between the core semantic feature vectors is calculated. The pairwise semantic distance directly reflects the degree of semantic difference between tag entries at the same level. In practice, tag entry pairs with a semantic distance less than the mutual exclusion threshold are identified as tag pairs with potential logical conflicts. Tag pairs with potential logical conflicts have semantic overlap or semantic contradictions in the same subcategory.

[0090] In some embodiments, for tag pairs with potential logical conflicts, the description of the attribute description field of one of the tag entries is refined or modified based on the original context support to increase the semantic distance and eliminate logical conflicts. The original context support directly reflects the support strength of the tag entry in the original multimodal data sample. The description refinement or modification operation preserves the core semantics of the tag entry and adjusts the detailed description content. In some embodiments, for tag pairs with potential logical conflicts, the description of the attribute description field of one of the tag entries is refined or modified based on manually predefined conflict resolution rules to increase the semantic distance and eliminate logical conflicts. The manually predefined conflict resolution rules pre-set the semantic distinction criteria of tags at the same level.

[0091] See Figure 5In the arbitration stage of conflicting concept nodes generated by multimodal data labeling, the quantitative performance of manual rule resolution, context-supported resolution, and overall resolution success rate was statistically analyzed for six subcategories: object recognition, scene classification, sentiment tendency, entity extraction, acoustic events, and timbre features. Specifically, the number of conflict labels processed was quantified as the "logarithm of conflict labels," and the resolution success rate was measured as a percentage. A dual-axis visualization model was constructed to achieve a collaborative presentation of the two types of indicators. Regarding the distribution of conflict resolution frequency, the number of manual rule resolutions peaked in the sentiment tendency subcategory, followed by object recognition and entity extraction, while the acoustic events and timbre features subcategories had relatively lower numbers. Similarly, the peak of context-supported resolutions also appeared in the sentiment tendency subcategory, decreasing sequentially in the entity extraction and object recognition subcategories, with the timbre features subcategory having the lowest value. The frequency distribution characteristics of the two resolution methods directly reflect the modal complexity and arbitration dependency characteristics of conflicting concept nodes under different subcategories. Regarding the variation in success rates, the sentiment subcategory performed best with a success rate exceeding 95%, followed closely by the object recognition subcategory. The acoustic event subcategory experienced a low success rate, dropping to the 90% range. The timbre feature subcategory, however, showed a significant upward trend. This success rate fluctuation quantitatively validates the subcategory adaptability of the "manual rules + contextual support" weighted fusion strategy in the modal arbitrator. The sentiment and object recognition subcategories, due to their high modal feature recognition, exhibited significant discriminative power in their comprehensive arbitration scores, maintaining high success rates. The acoustic event subcategory, affected by the ambiguity of audio features, faced increased difficulty in balancing contextual support and modal confidence, leading to a decrease in success rate. The timbre feature subcategory effectively improved the accuracy of arbitration decisions by refining the semantic distinction criteria of the attribute description fields.

[0092] The above embodiments are only used to illustrate the technical methods of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical methods of the present invention without departing from the spirit and scope of the technical methods of the present invention.

Claims

1. A method for generating multimodal data tags, characterized in that, The method includes: Collect raw multimodal data samples, which include at least image data streams, text description streams, and audio signal streams; The image data stream is destructured to extract a set of basic visual units; the text description stream is divided into semantic units to extract a set of text semantic units; and the audio signal stream is separated into acoustic features to extract a set of audio feature units. The set of visual basic units is input into a spatial association network to construct a visual concept graph; the set of text semantic units is input into a semantic dependency parser to construct a text concept graph; and the set of audio feature units is input into an acoustic event builder to construct an audio concept graph. In the modal fusion layer, a cross-modal projection channel is established between the visual concept map, text concept map, and audio concept map. The interaction mapping of nodes and edges is performed through the cross-modal projection channel to form a cross-modal associated concept map. Based on the cross-modal related concept graph, consensus concept nodes are extracted and conflicting concept nodes are arbitrated to generate a unified pool of concept candidate nodes; Tag distillation is performed on the unified concept candidate node pool to transform each concept candidate node into a tag entry with clear semantics, generating a preliminary multimodal tag set; The format of the preliminary multimodal tag set is reorganized and the hierarchy is verified using a preset tag normalization template to generate normalized multimodal data tags; The process of extracting consensus concept nodes and arbitrating conflicting concept nodes based on the cross-modal related concept graph includes: In the cross-modal association concept graph, consensus concept node clusters are identified. These clusters consist of concept nodes from different modalities that are closely connected through cross-modal projection channels. Calculate the internal consistency score of consensus concept node clusters, and add the entire consensus concept node clusters whose internal consistency scores exceed the consensus threshold as candidate nodes to the candidate node pool. In the cross-modal association concept graph, conflicting concept node clusters are identified. These clusters consist of different modal concept nodes that point to similar semantics but contradict each other in key features. The conflicting concept node cluster is submitted to the modal arbitrator, which arbitrates based on the confidence score and context support of each modal node, selects a modal concept node as a representative node, and adds the selected representative node to the candidate node pool.

2. The multimodal data tag generation method according to claim 1, characterized in that, The image data stream is structured and deconstructed to extract a set of basic visual units, including: A multi-scale region proposal network is applied to the image data stream to generate image region proposal boxes at different spatial scales and locations. Deep convolution feature extraction is performed on the image content within each proposed image region to obtain the region visual feature vector; Cluster analysis and visual vocabulary selection are performed on the visual feature vectors of the region to form a representative set of visual basic units. Each visual basic unit is represented by a visual vocabulary and its corresponding feature vector.

3. The multimodal data tag generation method according to claim 2, characterized in that, The establishment of a cross-modal projection channel between the visual concept map, text concept map, and audio concept map in the modal fusion layer includes: A common latent space is established in the modal fusion layer, and a visual projector from the visual feature space to the common latent space, a text projector from the text feature space to the common latent space, and an audio projector from the audio feature space to the common latent space are constructed respectively. The visual projector is used to project the node feature vectors in the visual concept graph to the common latent space to obtain visual latent nodes. The text projector is used to project the node feature vectors in the text concept graph to the common latent space to obtain text latent nodes. The audio projector is used to project the node feature vectors in the audio concept graph to the common latent space to obtain audio latent nodes. Calculate the similarity between the visual hidden nodes, text hidden nodes, and audio hidden nodes in the common hidden space. If the similarity exceeds a set threshold, establish a cross-modal projection channel between the corresponding original concept graph nodes.

4. The multimodal data tag generation method according to claim 3, characterized in that, The modal arbitrator arbitrates based on the confidence score and contextual support of each modal node, including: For each modality of the concept node in the conflicting concept node cluster, a modal confidence score is calculated. The modal confidence score is determined based on the centrality and path clarity of the concept node in its original concept graph. Calculate context support for each modality of concept node in the conflicting concept node cluster. The context support is determined based on the number and strength of non-conflicting concept nodes associated with the concept node in the cross-modal association concept graph. The modal confidence score and context support of each modal concept node are weighted and fused to obtain the modal arbitration comprehensive score; The concept node of the mode with the highest modal arbitration comprehensive score is selected as the representative node that wins the arbitration.

5. The multimodal data tag generation method according to claim 4, characterized in that, Perform label distillation on the unified pool of concept candidate nodes, transforming each concept candidate node into a label entry with explicit semantics, including: For each concept candidate node in the unified concept candidate node pool, a preset multimodal knowledge graph is retrieved; In the multimodal knowledge graph, one or more entity concepts that are closest to the feature vector of the concept candidate node are found and used as semantic anchors; Using the semantic anchor as the core, and combining the contextual association of the concept candidate nodes in the original multimodal data samples, one or more natural language words that are most appropriate are selected from the preset label vocabulary to form the initial label description. Redundant words are removed and synonyms are merged from the initial tag descriptions to form tag entries with clear semantics.

6. The multimodal data tag generation method according to claim 5, characterized in that, The step of using a preset tag normalization template to perform format reorganization and hierarchical verification on the preliminary multimodal tag set includes: Load a preset tag hierarchy template, which defines the structure of the tag's main category, subcategories, and attribute description fields; Each tag entry in the initial multimodal tag set is semantically matched with the tag hierarchy architecture template to determine the main category, subcategory, and required attributes of each tag entry in the tag hierarchy architecture template. Based on the attribute filling rules of the tag entries under their respective main and subcategories, the natural language descriptions of the tag entries are restructured and filled into the corresponding attribute description fields. For the set of tags that have been restructured, verify the logical mutual exclusion and integrity of tags at the same level, and perform semantic fine-tuning or reclassification of tags with logical conflicts.

7. The multimodal data tag generation method according to claim 6, characterized in that, The process of verifying the logical mutual exclusion and integrity of tags at the same level for the restructured tag set includes: Under the same subcategory of the tag hierarchy architecture template, extract the core semantic feature vectors of all tag entries; Calculate the pairwise semantic distance between the core semantic feature vectors; Tag pairs with a semantic distance less than the mutual exclusion threshold are identified as tag pairs with potential logical conflicts. For tag pairs with potential logical conflicts, the description of the attribute description field of one of the tag entries is refined or modified according to its original context support or manually predefined conflict resolution rules in order to increase the semantic distance and eliminate the logical conflict.

8. The multimodal data tag generation method according to claim 7, characterized in that, The acquisition of raw multimodal data samples includes synchronous acquisition and asynchronous alignment processing: The image data stream, text description stream, and audio signal stream of the target scene are captured synchronously by a sensor array, and a unified timestamp sequence is recorded. For data segment misalignment between streams caused by transmission delay or processing rate differences, the unified timestamp sequence is used as a reference, and a time window sliding alignment algorithm is adopted to divide the data streams of different modalities into multiple time-aligned multimodal data frames. Each multimodal data frame contains image data segments, text description segments, and audio signal segments within the same time window, which together constitute the original multimodal data sample.

9. A multimodal data tag generation system, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the multimodal data tag generation method as described in any one of claims 1 to 8.