Cognitive map-based multi-modal information association generation method and system
By monitoring the granularity of multimodal data matching, semantic alignment accuracy, and association enhancement, the problems of entity errors and excessive resource consumption in the process of multimodal information association generation were solved, thereby improving the accuracy and efficiency of multimodal information association generation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING YIJIAO LANTIAN TECH DEV CO LTD
- Filing Date
- 2026-03-26
- Publication Date
- 2026-06-12
Smart Images

Figure CN122196197A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of multimodal information processing technology, and in particular to a method and system for generating multimodal information associations based on cognitive graphs. Background Technology
[0002] With the development of multimodal large models and neural symbolic systems, cognitive graphs are gradually becoming a key architecture for achieving higher levels of cognitive intelligence, driving generative systems from data-driven to knowledge-driven. The existing method of generating multimodal information by association refers to using structured knowledge representation technology to integrate different modal information such as images, text, speech, and video into a unified knowledge graph framework. Through the semantic association of graph nodes and relationships, it drives the deep understanding and content generation of cross-modal information. Its core lies in building a graph network with cognitive entities, attributes, relationships, and events as elements, and using graph neural networks, cross-modal alignment, knowledge reasoning, and other technologies to achieve semantic association and generation enhancement of multi-source heterogeneous information.
[0003] To achieve the aforementioned semantic association and generation enhancement of multi-source heterogeneous information, existing technical approaches typically include the following key steps: First, extracting structured semantic units from raw multimodal data (such as natural language text, images, and videos) through multimodal information extraction (such as object detection, entity recognition, and relation extraction). Second, using embedding representation learning (such as cross-modal pre-trained models) to map features from different modalities to a unified vector space to achieve semantic alignment. Third, using a cognitive graph as an intermediary layer, mining cross-modal implicit associations through graph neural networks, graph attention mechanisms, and knowledge reasoning rules, and enhancing the logical consistency of the generation process based on graph structure constraints (such as path reasoning and subgraph retrieval). Finally, outputting text, images, or cross-modal content that is highly correlated with the multimodal context through graph-conditional sequence generation models (such as graph-enhanced Transformer and graph-to-text generation models).
[0004] For example, Chinese invention patent CN114925176B discloses a method, system, and medium for constructing a multimodal cognitive graph of an intelligent agent. The method includes: constructing an initial cognitive system for the multimodal cognitive graph of the intelligent agent; performing multimodal recognition and cognitive extraction on the acquired multimodal data based on the multimodal cognitive graph of the intelligent agent; fusing the extracted multimodal cognition, including multimodal entity linking and cognitive merging; and performing cognitive processing according to the cognitive graph and logical reasoning rules. Thus, by constructing the initial cognitive system for the multimodal cognitive graph of the intelligent agent, initial logical reasoning rules are constructed; multimodal recognition and cognitive extraction are performed on the multimodal data acquired by the sensor based on the existing cognitive system; and the extracted multimodal cognition is fused, including multimodal entity linking and cognitive merging; and cognitive processing is performed according to the cognitive graph and logical reasoning rules.
[0005] As can be seen from the above methods, they only achieve shallow extraction, feature mapping, and rule-based reasoning of multimodal data based on existing cognitive graphs. They are insufficient to solve the problems of entity errors and relationship ambiguity in the information extraction stage, and cannot address the limitations of existing technologies in the feature mapping and association reasoning stages.
[0006] In the process of generating multimodal information association, the original multimodal data may have uneven distribution of features within each modality. For example, the feature ratio of high-frequency words and low-frequency technical terms in text may be unbalanced, the pixel feature density of target area and background area in image may be significantly different, and the temporal feature distribution of key frames and redundant frames in video may be scattered. This may lead to granularity mismatch between different modalities, which usually introduces entity errors and ambiguous relationship interference during the information extraction stage.
[0007] Existing technologies typically employ a single cross-modal embedding model for feature mapping and rely on fixed rules or shallow graph structures for association reasoning. This approach may exacerbate the one-sidedness of graph node associations and the redundancy of generated content, resulting in insufficient matching between the generated results and the multimodal context. It is unable to dynamically and reasonably allocate node weights (such as attention weights for graph nodes) and mine deep association links. Furthermore, due to insufficient semantic alignment accuracy and low reasoning efficiency, it leads to error propagation and excessive resource consumption. The limitations of this approach will further lead to an imbalance in the allocation of node weights in the cognitive graph and insufficient mining of association links, which in turn will cause errors in the propagation of multimodal information associations and excessive resource consumption in the generation process, ultimately resulting in poor adaptability of the multimodal information association generation process. Summary of the Invention
[0008] To address the poor adaptability of existing technologies in the multimodal information association generation process, this invention provides a method and system for multimodal information association generation based on cognitive graphs. The technical solution is as follows: On the one hand, a method for generating multimodal information association based on cognitive graphs is provided, including: analyzing the granularity matching of corresponding features for multimodal data, and determining whether to implement the analysis strategy of multimodal features and granularity matching conditions based on the feedback results after analysis. The multimodal data represents a set of any two or three of the corresponding text features, image features, and video features. If the feedback result is to implement, the semantic alignment accuracy is evaluated after implementing the analysis strategy. Based on the evaluation results, it is determined whether to conduct a qualification assessment of the associated content of the graph nodes, so as to further determine whether to send a qualification prompt for multimodal information association generation. If the feedback result is not to implement, the analysis strategy of multimodal features and granularity matching conditions is continued after taking measures to improve the effectiveness and relevance of multimodal data.
[0009] On the other hand, a multimodal information association generation system based on cognitive graphs is provided, including: a granularity matching monitoring module, a semantic alignment accuracy monitoring module, and an association improvement monitoring module. The granularity matching monitoring module analyzes the granularity matching of corresponding features for multimodal data, and determines whether to implement the analysis strategy of multimodal features and granularity matching conditions based on the feedback results. The semantic alignment accuracy monitoring module, if the feedback result is "implementation," evaluates the semantic alignment accuracy after implementing the analysis strategy, and determines whether to conduct a qualification assessment of the associated content of the graph nodes based on the assessment results, further deciding whether to send a qualified prompt for multimodal information association generation. The association improvement monitoring module, if the feedback result is "not implementation," continues to implement the analysis strategy of multimodal features and granularity matching conditions after taking measures to improve the effectiveness and relevance of multimodal data.
[0010] The beneficial effects of the technical solutions provided in the embodiments of the present invention include at least the following: 1. For multimodal data, analyze the granularity matching of its corresponding features. The feedback from this analysis determines whether to implement the multimodal feature and granularity matching analysis strategy. This helps to accurately start and stop the analysis strategy and optimize resource allocation, avoiding unnecessary analysis processes that consume computing power. If the feedback indicates implementation, then after implementing the analysis strategy, evaluate the semantic alignment accuracy. Based on the evaluation results, determine whether to conduct a qualification assessment of the associated content of the graph nodes, further deciding whether to send multimodal information association to generate a qualification prompt. This helps to filter out results that meet semantic consistency and association reliability requirements, preventing unqualified association information from flowing into subsequent stages. If the feedback indicates non-implementation, after taking measures to improve the effectiveness and association of multimodal data, continue to implement the multimodal feature and granularity matching analysis strategy. This helps to proactively correct data defects, improve the accuracy of subsequent analysis results from the source, and reduce analysis bias and result distortion caused by problems with the quality of the original data.
[0011] 2. Based on any two or three combinations of the three sub-tasks of text feature proportion misbalancing, image feature proportion misbalancing, and video feature proportion misbalancing, the granularity matching situation is analyzed. Compared with the shortcomings of existing technologies that usually use single-modal misbalancing and ignore the synergistic effect of cross-modal feature synergy, this method helps to comprehensively evaluate the granularity matching status of multimodal features, accurately capture the synergistic effect of cross-modal feature proportion imbalance, and thus improve the comprehensiveness and accuracy of granularity matching analysis, providing a more reliable quantitative basis for subsequent information association generation.
[0012] 3. By evaluating the semantic alignment accuracy, the weight values corresponding to the graph nodes are obtained. If the weight value corresponding to the graph node is less than the weight threshold, the corresponding redundant nodes are removed based on the link pruning algorithm. Otherwise, the co-occurrence frequency of graph nodes is monitored. This addresses the shortcomings of existing methods, such as redundant node accumulation and low node association efficiency. It helps to simplify the cognitive graph structure, improve the efficiency of graph node association retrieval, thereby reducing the time for multimodal information association generation and strengthening the graph's ability to focus on core features.
[0013] 4. By evaluating the compliance of associated content, accurate transmission and consumption values are obtained. When the compliance evaluation conditions are met, a compliance prompt is generated by sending multimodal information association; otherwise, a non-compliance prompt is generated by sending multimodal information association. This addresses the shortcomings of existing methods, such as vague compliance judgment standards, and helps to achieve quantitative feedback and rapid calibration of association results. This ensures the consistency and compliance of multimodal information association generation results and provides a clear direction for subsequent process optimization.
[0014] 5. When keyframes are locally concentrated in long-term video, it is necessary to analyze the overall distribution of keyframes in the time dimension. By performing video feature proportion weighting to obtain the normalized global distribution entropy, when the normalized global distribution entropy is greater than the corresponding distribution entropy threshold, an analysis strategy of multimodal feature and granularity matching conditions is implemented; otherwise, it is determined that the temporal feature distribution is unbalanced. Compared with the shortcomings of existing technologies that usually use fixed keyframe truncation rules and ignore the balance of temporal distribution, which leads to video feature extraction deviation, this method helps to accurately identify the temporal distribution state of keyframes in long-term video, dynamically adapt the start and stop logic of the analysis strategy, and thus improve the effectiveness of video modal feature analysis and avoid multimodal correlation deviation caused by temporal distribution imbalance. Attached Figure Description
[0015] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0016] Figure 1 A flowchart illustrating the overall process of generating multimodal information association based on cognitive graphs, as provided in this application embodiment. Figure 2 This is a schematic diagram illustrating the text feature proportion loss scaling provided in the embodiments of this application; Figure 3 This is a schematic diagram illustrating the conformity assessment of related content provided in an embodiment of this application; Figure 4This is a schematic diagram of the structure of a multimodal information association generation system based on cognitive graphs provided in an embodiment of this application; Figure 5 A schematic diagram illustrating the generation of associated content provided in the embodiments of this application; Figure 6 This is a scatter plot of predicted values versus actual values provided in the embodiments of this application; Figure 7 The model training loss convergence curve provided in the embodiments of this application. Detailed Implementation
[0017] Embodiments of the present disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of the present disclosure are shown in the drawings, it should be understood that embodiments of the present disclosure may be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of the present disclosure.
[0018] It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure. In the description of the embodiments of this disclosure, the term "comprising" and similar terms should be understood as open-ended inclusion, i.e., "including but not limited to". The term "based on" should be understood as "at least partially based on". The term "one embodiment" or "this embodiment" should be understood as "at least one embodiment". The terms "first", "second", etc., may refer to different or the same objects.
[0019] To make the technical problems, technical solutions and advantages of the present invention clearer, a detailed description will be given below in conjunction with the accompanying drawings and specific embodiments.
[0020] Example 1: As Figure 1 The diagram shown is an overall flowchart of the multimodal information association generation method based on cognitive graphs provided in this application embodiment. The method includes the following steps: Granularity matching monitoring: In the defined multimodal information association generation scenario, the granularity matching of the corresponding features of the multimodal data is analyzed. The feedback results determine whether to implement the analysis strategy of multimodal features and granularity matching conditions. The multimodal data represents a set of any two or three of the corresponding text features, image features, and video features. By monitoring the granularity matching, it is helpful to accurately identify the granularity imbalance and matching status of multimodal features, prevent ineffective analysis processes from consuming computing resources, and provide a scientific basis for starting and stopping the analysis strategy, thus avoiding the distortion of association results caused by granularity mismatch from the source.
[0021] Semantic alignment accuracy monitoring: If the feedback result is "implementation", then after implementing the analysis strategy, the semantic alignment accuracy is evaluated. Based on the evaluation results, it is determined whether to conduct a qualification assessment on the associated content of the graph nodes, so as to further decide whether to send a qualified prompt for multimodal information association generation. By conducting semantic alignment accuracy monitoring, it is helpful to achieve quantitative control of cross-modal feature semantic consistency, accurately screen out graph node content that meets the association requirements, and provide a clear quality guide for the multimodal information association generation results.
[0022] Correlation Enhancement Monitoring: If the feedback result is not to implement, after taking measures to improve the effectiveness and correlation of multimodal data, continue to implement the analysis strategy of multimodal features and granularity matching conditions; by conducting correlation enhancement monitoring, it is helpful to proactively correct defects such as insufficient effectiveness and weak correlation of multimodal data, provide high-quality data support for restarting granularity matching analysis, and ensure the consistency and reliability of the overall process.
[0023] In this embodiment, correlation analysis is performed through granularity matching monitoring, semantic alignment accuracy monitoring, and correlation enhancement monitoring. The interaction of these three elements provides full-process, multi-level quality control support for the generation of multimodal information correlation, improving the accuracy, stability, and efficiency of multimodal feature correlation. This helps to solve core technical problems such as uneven distribution of cross-modal features, semantic bias, and insufficient data quality, thereby providing a reliable guarantee for the generation of multimodal correlation information.
[0024] Furthermore, for multimodal data, an analysis of the granularity matching of its corresponding features is conducted, specifically as follows: Multimodal data is acquired, and for any two or three of the text, image, and video features in the multimodal data, an analysis of the granularity matching is performed. This involves any two or three combinations of the following three sub-tasks: text feature proportion miscalculation, image feature proportion miscalculation, and video feature proportion miscalculation. The combined result of the data representing the internal feature distribution of the multimodal data and the data reflecting the granularity matching level is used as the quantitative evaluation basis for the granularity matching level. The data representing the internal feature distribution of the multimodal data includes: data representing the distribution of text features (e.g., the proportion of non-zero elements in the text feature vector), data representing the distribution of video features (e.g., the number of keyframes per unit time), and data representing the distribution of image features (e.g., the proportion of image edge region features to the total number of features).
[0025] It should be added that the text feature vector represents a numerical vector that can represent the core semantic information of the text, obtained from different semantic granularity levels (such as preset word level, preset phrase level, and preset sentence level) of the text based on a preset feature extraction algorithm (such as Word2Vec). For example, the Word2Vec 100-dimensional vector.
[0026] In this embodiment, by analyzing the granularity matching of any two or three of the text, image, and video features in the multimodal data, it is helpful to accurately capture the pain points of granularity imbalance and collaborative matching status of cross-modal features, avoid the one-sidedness caused by single-modal analysis, and thus help improve the accuracy and reliability of multimodal feature association. Furthermore, using the combined effect of data representing the distribution of features within the multimodal data and data reflecting the granularity matching level as a quantitative evaluation basis for the granularity matching level helps to achieve quantifiable and traceable evaluation of the granularity matching status, thereby providing precise quantitative support for the optimization and adjustment of subsequent analysis strategies.
[0027] Specifically, when multimodal data contains only two types of modal features, it is necessary to perform two corresponding combinations of the three sub-tasks: text feature proportion imbalance adjustment, image feature proportion imbalance adjustment, and video feature proportion imbalance adjustment. For example, when the multimodal data is a binary combination of text and image, performing text feature proportion imbalance adjustment and image feature proportion imbalance adjustment simultaneously helps to clarify the source of imbalance contribution, providing a basis for adjusting the text feature selection threshold and supplementing the core image features. When the multimodal data is a binary combination of video and image, performing image feature proportion imbalance adjustment and video feature proportion imbalance adjustment simultaneously helps to achieve synergistic optimization of feature distribution within the visual modality and accurately capture the granular matching deviation between image and video frames. When the multimodal data is a binary combination of video and text, performing text feature proportion imbalance adjustment and video feature proportion imbalance adjustment simultaneously helps to achieve granular alignment between text semantic features and video spatiotemporal features.
[0028] When multimodal data contains text, image, and video features, it is necessary to simultaneously perform three sub-tasks: text feature proportion imbalance measurement, image feature proportion imbalance measurement, and video feature proportion imbalance measurement. This helps to achieve full-dimensional multimodal feature proportion imbalance monitoring, accurately capture the collaborative imbalance patterns among the three modalities, and thus ensure the comprehensiveness and stability of multimodal information association generation, providing reliable quantitative support for the accurate association of cognitive graph nodes.
[0029] Prior to the design of the multimodal information association generation method based on cognitive graphs in this application, a database covering text feature imbalance thresholds and other parameters has been constructed to provide quantitative judgment basis for subsequent multimodal feature analysis. The database adopts a hierarchical distributed storage architecture, which can realize efficient storage, fast retrieval and real-time updates of various parameters.
[0030] like Figure 2The diagram illustrates the text feature imbalance quantification provided in an embodiment of this invention: A text feature imbalance value is obtained. When the text feature imbalance value is not greater than the text feature imbalance threshold, an analysis strategy based on multimodal features and granularity matching conditions is implemented. Otherwise, text feature enhancement processing is performed. The high-frequency words obtained after high-frequency word extraction are input into a preset text language model. Based on the cosine similarity corresponding to the output high-frequency words, it is determined whether to filter them out.
[0031] Furthermore, when multimodal data includes text features, the analysis of granular matching includes text feature imbalance quantification. The specific process is as follows: the quantified value of the imbalance of feature proportion in the text, i.e. the text feature imbalance value, is used as data to characterize the distribution of text features. The text feature imbalance value represents the proportion of high-frequency word frequency to the total word frequency of the text (the sum of the word frequencies of all words in the text). The high-frequency word frequency represents the number of times the target high-frequency words (i.e., words whose word frequency ranking is within the preset word frequency range) appear in the text as monitored by the counter.
[0032] To ensure the effectiveness and balance of text features in multimodal granular matching and semantic alignment, and to avoid semantic representation bias caused by excessive monopolization of the feature space by high-frequency words, this study aims to ensure that text features can form accurate cross-modal association matching with other modal features such as images and videos. The text feature imbalance value is used as a criterion for determining whether to perform text feature enhancement processing. If the text feature imbalance value is not greater than the text feature imbalance threshold, the corresponding text feature imbalance value is recorded as a qualified text feature to be analyzed, and the analysis strategy of multimodal features and granular matching conditions is implemented. Conversely, if the value is greater than the threshold, it indicates that high-frequency words excessively occupy the feature space, and text feature enhancement processing is performed. After processing, the analysis strategy of multimodal features and granular matching conditions is implemented. The text feature imbalance threshold is represented by the average value of text feature imbalance values over a historical time period.
[0033] Specifically, text feature enhancement processing is performed to reduce the interference of high-frequency redundant features on relation recognition. The process is as follows: High-frequency word truncation: The text feature imbalance value and the amount of multimodal data monitored by a network traffic monitor (such as Wireshark) are input into a defined window truncation length table. The window length corresponding to the multimodal data is captured, and this window length is used as the truncation length to truncate high-frequency words whose text feature imbalance value is greater than the text feature imbalance threshold. The high-frequency words obtained after high-frequency word truncation are input into a preset text language model, such as BERT (Bidirectional Encoder Representations from...). Transformers, a bidirectional encoder representation method, outputs the cosine similarity of corresponding high-frequency words. If the output cosine similarity is greater than the preset word similarity for high-frequency words, it indicates that the high-frequency words are highly overlapping at the semantic level and are filtered out. High-frequency words with an output cosine similarity no greater than the preset word similarity are retained. The preset word similarity is represented by the average cosine similarity of high-frequency words over a historical time period. By using the window length corresponding to the captured multimodal data as the truncation length, high-frequency words with text feature imbalance values greater than the text feature imbalance threshold are truncated. Based on the output cosine similarity of high-frequency words, high-frequency words with cosine similarity greater than the preset word similarity are filtered out, while those with cosine similarity less than the preset word similarity are retained. This helps to improve the semantic purity and core information density of text features, eliminate repetitive and redundant high-frequency word noise, thereby enhancing the efficiency of text feature association matching with cross-modal nodes and improving the accuracy of subsequent granular matching analysis and semantic alignment evaluation.
[0034] It should be added that, in order to achieve accurate matching and efficient retrieval of cross-modal features in the set multimodal information association generation scenario, it is necessary to construct a multimodal information association query set that supports one-to-one accurate mapping relationship and many-to-one composite mapping relationship. This set includes a window truncation length table, an image node weight allocation set, and a granular matching weight allocation set.
[0035] By associating newly acquired multimodal information to generate a data set, and inputting it into the corresponding multimodal information association query set, the corresponding multimodal information association query results can be output. The multimodal information association generated data set includes a combination of text feature imbalance values and multimodal data, image feature imbalance values, and cosine similarity of any two or three of the cognitive graph nodes corresponding to text feature vectors, image feature vectors, and video feature vectors. The multimodal information association query results include the window length, the node weights in the cognitive graph corresponding to the image feature imbalance values, and the corresponding node weights in the cognitive graph.
[0036] Specifically, the construction process of the multimodal information association query set is as follows: A large number of multimodal data samples of different types and sizes are collected, covering multiple modalities such as text, images, and videos. The samples are preprocessed, including data cleaning and feature extraction, to ensure data quality and consistency. Based on machine learning models or algorithms, such as the random forest model, features are extracted from the preprocessed data. The random forest model has powerful feature selection capabilities, able to filter out features that have a significant impact on information association from a large number of features, effectively selecting key features with significant impact on information association from high-dimensional multimodal feature sets. The preprocessed multimodal feature vectors are used as model input to cross-modal... Feature matching accuracy is the optimization target for model training. The number of decision trees is set to 100, the maximum depth to 15 layers, and the minimum number of samples for node splitting to 20. The model is trained and iterated using 5-fold cross-validation. The Gini importance coefficient of each feature is calculated, and the top 20% of features with the highest importance coefficients are selected as the core feature set for constructing the query set. Based on the selected key feature set, one-to-one exact mapping relationships and many-to-one composite mapping relationships are defined to obtain the preliminary query set. Test samples are selected and input into the preliminary query set to verify the matching accuracy of the query results. The model parameters and weight coefficients are iteratively adjusted until the matching accuracy meets the preset threshold, thus completing the construction of the multimodal information association query set.
[0037] It should also be noted that the embodiments of this invention provide a variety of training sets, including a set consisting of text feature vectors after high-frequency word extraction and the cosine similarity of preset high-frequency words, a set consisting of video feature imbalance values and the cosine similarity of redundant frames, a set of multimodal features that meet the matching conditions, a set of granular matching difference level values representing the granular matching difference level and a set consisting of preset granular matching results, multimodal data corresponding to unqualified granular matching, and a set consisting of the cosine similarity of any two or three corresponding cognitive graph nodes from text feature vectors, image feature vectors, and video feature vectors, etc.
[0038] The training set is input into the corresponding pre-defined model for training, such as BERT (Bidirectional Encoder Representations from Transformers), ResNet (Residual Neural Network), Random Forest, Convolutional Neural Network, etc., to obtain the trained pre-defined text language model, pre-defined video language model, pre-defined matching relevance model, and pre-defined lightweight model. The corresponding training process is as follows: For the BERT model, the semantic representation ability of text features is optimized by adjusting parameters such as the number of self-attention heads and the dimension of hidden layers to obtain the trained pre-defined text language model; for the ResNet model, batch normalization and ReLU (Rectified Linear Array) are introduced by adjusting the number of residual blocks, the size of the convolutional kernel, and the stride. The activation function of the Rectified Linear Unit (RLU) is optimized to improve gradient propagation efficiency. Combined with hybrid loss functions (such as mean squared error loss and cross-entropy loss), the model parameters are optimized to enhance the ability to identify imbalanced video features and the similarity of redundant frames, resulting in a pre-trained preset video language model. For the Random Forest model, by optimizing parameters such as the number of decision trees and node splitting thresholds, the accuracy of the model in determining the relevance of multimodal feature matching is improved, resulting in a pre-trained preset matching relevance model. For the Convolutional Neural Network model, by using depthwise separable convolution or grouped convolution to reduce computational overhead, dynamically adjusting channel attention weights to adapt to the characteristics of different modal feature distributions, and iteratively optimizing network parameters based on multimodal data samples with unqualified granularity matching, the cross-modal feature mapping and granularity calibration capabilities are improved, resulting in a pre-trained preset lightweight model.
[0039] By inputting the re-acquired text feature vector after high-frequency word extraction, video feature imbalance values, multimodal feature sets that meet the matching conditions, and sets of granular matching difference level values representing the granularity matching difference level, as well as multimodal data corresponding to unqualified granularity matching, into the trained model, the model can output the cosine similarity of preset high-frequency words, the cosine similarity corresponding to redundant frames, the preset granularity matching results, and the cosine similarity of any two or three corresponding cognitive graph nodes from the text feature vector, image feature vector, and video feature vector.
[0040] In this embodiment, the text feature imbalance value is obtained by performing text feature proportion imbalance quantification. When the text feature imbalance value is not greater than the text feature imbalance threshold, an analysis strategy for multimodal feature and granularity matching conditions is implemented. Conversely, text feature enhancement processing is performed. This helps to achieve precise control and quality optimization of text feature distribution, and avoids cross-modal granularity matching failure caused by excessive redundancy or severe lack of text features. Through the interaction and correlation of text feature proportion imbalance quantification and text feature enhancement processing, a solid text feature foundation is built for the generation of multimodal information association, and the risk of association deviation caused by text feature imbalance is reduced.
[0041] Furthermore, when multimodal data includes image features, the analysis of granular matching includes quantifying the image feature imbalance ratio. The specific process is as follows: Quantitative values of the differences in pixel feature distribution in the image are selected, i.e., image feature imbalance values, as data representing the distribution of image features; the image feature imbalance value is represented by the ratio of the number of pixels in the background region to the number of pixels in the target region monitored by a counter; the background region represents the image region that is not associated with the target entity in the cognitive map, the target entity represents the predefined object to be identified in the cognitive map, and the target region represents the image region that is associated with the target entity in the cognitive map.
[0042] To accurately determine the distribution balance of image features and ensure stable and accurate associations between image nodes and text / video nodes in the cognitive graph, a difference operation (i.e., difference calculation) is performed based on the image feature imbalance value and the image feature imbalance threshold. The results are compared: when the difference operation result is greater than 0, it indicates that the target image features are easily masked by background noise, which will interfere with subsequent cross-modal feature alignment and graph node construction. Image region feature enhancement processing is then performed, and an analysis strategy based on multimodal features and granularity matching conditions is implemented after the processing. When the difference operation result is not greater than 0, the corresponding image feature imbalance value is recorded as a qualified image feature to be analyzed, and an analysis strategy based on multimodal features and granularity matching conditions is implemented. The image feature imbalance threshold is represented by the average value of the image feature imbalance values over a historical time period.
[0043] Image region feature enhancement processing is used to improve the saliency and recognizability of target region features and suppress the interference of redundant background features. Specifically, it includes: marking pixels in the target region where the difference processing result is greater than 0 as pixels to be enhanced; inputting the image feature imbalance value into the image node weight allocation set, reading the corresponding node weights in the cognitive graph corresponding to the image feature imbalance value, and reallocating the corresponding node weights to the pixels to be enhanced, such as the attention weights of the graph nodes. Through the attention model in the cognitive graph (such as GAT (Graph Attention Network, multimodal attention fusion model), when the model calculates the correlation strength between nodes, it outputs an attention weight matrix, which directly records the attention coefficient of each node to other nodes as the attention weight), which helps to improve the feature recognition and weight ratio of the image region, weaken the interference of redundant background regions, and thus achieve accurate mapping between image features and cognitive graph nodes, strengthening the core supporting role of image modality in multimodal association generation.
[0044] In this embodiment, image feature imbalance values are obtained by performing image feature proportion imbalance quantification. When the difference between the image feature imbalance value and the image feature imbalance threshold is greater than 0, image region feature enhancement processing is performed. This helps to accurately locate the imbalanced regions of image features (such as blurred foreground and redundant background) and specifically enhance the representation ability of core visual features. When the difference processing result is not greater than 0, an analysis strategy of multimodal feature and granularity matching conditions is implemented. This helps to reduce the computational power consumed by ineffective enhancement operations, thereby ensuring the pertinence and efficiency of multimodal feature matching analysis and improving the accuracy of cross-modal association.
[0045] Furthermore, when multimodal data includes video features, the analysis of granular matching includes quantifying the imbalance of video feature proportions. The specific process is as follows: the quantified value of the imbalance in the temporal feature distribution in the video, i.e., the video feature imbalance value, is used as data to characterize the distribution of video features; the video feature imbalance value is represented by the ratio of the number of redundant frames monitored by a counter to the number of key frames within a unit time window (e.g., 1 second / window); redundant frames refer to image frames in a temporal sequence of consecutive image frames where the feature vectors of two adjacent frames have a feature cosine similarity higher than a preset cosine similarity threshold. The higher the cosine similarity with adjacent frames, the more frames in the video are repeated, and the insufficient proportion of key frames; key frames refer to image frames in a temporal sequence of consecutive image frames where the feature vectors of two adjacent frames do not have a feature cosine similarity higher than a preset cosine similarity threshold.
[0046] To accurately measure the rationality of the distribution of video temporal features, ensure the integrity and coherence of keyframe temporal logic, avoid cross-modal spatiotemporal matching failures caused by imbalanced temporal feature distribution (such as excessive concentration or sparseness of keyframes), and ensure that video nodes and text / image nodes form accurate spatiotemporal correspondences in the cognitive graph, the video feature imbalance value is compared with the video feature imbalance threshold to obtain a judgment result. If the judgment result is that the temporal feature distribution is unbalanced, that is, the video feature imbalance value is greater than the video feature imbalance threshold, then video keyframe screening is performed, and after processing, an analysis strategy of multimodal features and granularity matching conditions is implemented. The video feature imbalance threshold is represented by the average value of video feature imbalance values over historical time periods. If the judgment result is that the temporal feature distribution is balanced, that is, the video feature imbalance value is not greater than the video feature imbalance threshold, then the corresponding video feature imbalance value is marked as a qualified video feature to be analyzed, and an analysis strategy of multimodal features and granularity matching conditions is implemented.
[0047] Specifically, video keyframe filtering involves inputting imbalanced video feature values into a pre-defined video language model (such as ResNet), outputting the cosine similarity of the corresponding redundant frames, filtering out redundant frames with cosine similarity values greater than a pre-defined threshold, and retaining redundant frames with cosine similarity values not greater than the pre-defined threshold. This helps improve the information density and core feature representation capabilities of video keyframe sequences, eliminates duplicate and invalid redundant frame data, thereby reducing the computational cost of subsequent analysis processes and strengthening the correlation of video features in multimodal cognitive graphs. Video keyframe filtering is used to improve the effectiveness and correlation of video temporal features and suppress the interference of redundant frames on feature extraction.
[0048] In this embodiment, by performing video feature proportion imbalance quantification, the video feature imbalance value is obtained and compared with the video feature imbalance threshold. When the judgment result is that the temporal feature distribution is unbalanced, video keyframe screening is performed. When the judgment result is that the temporal feature distribution is balanced, an analysis strategy of multimodal feature and granularity matching conditions is implemented. While video feature proportion imbalance quantification, video keyframe screening, and the implementation of the analysis strategy of multimodal feature and granularity matching conditions are interconnected and interact with each other, it helps to achieve dynamic calibration and precise control of video temporal feature distribution, reduce cross-modal granularity matching deviation caused by local concentration or sparse distribution of keyframes, and thus ensure the temporal consistency and reliability of multimodal information association generation.
[0049] Furthermore, the specific process of implementing the analysis strategy of multimodal features and granularity matching conditions is as follows: Obtain a set of multimodal features that meet the matching conditions. The set of multimodal features that meet the matching conditions represents a data set corresponding to multimodal data that includes any two or three of the following: qualified text feature values to be analyzed, qualified image feature values to be analyzed, and qualified video feature values to be analyzed. Use the set of multimodal features that meet the matching conditions and the corresponding set of granularity matching difference level values that characterize the granularity matching difference level as input parameters of a preset matching correlation model (such as a random forest model), and output the granularity matching result through the preset matching correlation model.
[0050] The set of granularity matching difference levels includes text granularity matching difference levels, image granularity matching difference levels, and video granularity matching difference levels. The text granularity matching difference level represents the ratio of the maximum to the minimum standard deviation of the text feature vector within a preset semantic granularity level. Taking the standard deviation of the text feature vector as an example, let the text feature set be {X1, X2, ..., X...} n}, where a single text feature vector X i =(x i1 ,x i2 ,…,x id ), where d is the feature dimension, n is the number of samples, and i is the index of the text feature vector, i=1...n. First, calculate the mean of the feature dimension. j=1,2,...,d, where j represents the feature dimension number, and then the standard deviation of the feature dimension is calculated. Finally, we can obtain the standard deviation vector σ = (σ1,...,σ) of dimension d. d The image granularity matching difference level represents the ratio of the maximum to the minimum standard deviation of the image feature vector within a preset visual granularity level; the video granularity matching difference level represents the ratio of the maximum to the minimum standard deviation of the video feature vector within a preset video granularity level. A larger text granularity matching difference level indicates a greater degree of dispersion in the feature distribution across different semantic granularity levels (vocabulary level, phrase level, sentence level); a larger image granularity matching difference level indicates a more significant difference in the dispersion of feature distribution across different visual granularity levels (pixel level, target region level, scene level), and a worse consistency in granularity matching within the image; a larger video granularity matching difference level indicates a more significant difference in the dispersion of feature distribution across video granularity levels (frame level, segment level, video global level), and a worse consistency in granularity matching within the video.
[0051] Image feature vectors represent numerical vectors that characterize the core visual information of an image, obtained from different visual granularity levels (such as preset pixel level, preset target region level, and preset scene level) based on preset feature extraction algorithms (such as Vision Transformer). For example, the ViT 768-dimensional global feature vector. Video feature vectors represent numerical vectors that characterize the core visual information of a video, obtained from different video granularity levels (such as preset frame level, preset segment level, and preset video global level) based on preset feature extraction algorithms (such as keyframe mean pooling). For example, the 768-dimensional keyframe mean aggregation vector.
[0052] If the output granularity matching result is qualified, the semantic alignment accuracy evaluation is started directly; if the output granularity matching result is unqualified, the corresponding multimodal data is input into the preset lightweight model for feature mapping, and the mapping weights are dynamically adjusted according to the feature distribution characteristics of different modalities.
[0053] Specifically, the dynamic adjustment of mapping weights involves: based on a preset lightweight model, outputting the cosine similarity of the corresponding cognitive graph nodes, and using it as input to a preset granularity matching weight allocation set; reading the corresponding node weights in the cognitive graph to redistribute the node weights (such as the attention weights of the graph nodes); and evaluating the semantic alignment accuracy after the allocation is completed. The lower the corresponding node weight in the cognitive graph, the worse the semantic alignment accuracy after the feature mapping of this modality is, and higher adjustment weights need to be allocated to enhance the feature representation ability.
[0054] The redistribution of node weights corresponding to cognitive graph nodes helps to improve the ability of cognitive graph nodes to focus on multimodal core features, weaken the interference of redundant feature nodes, and thus facilitate the accurate mapping of multimodal features to graph nodes, thereby strengthening the reliability and stability of cross-modal information association.
[0055] In this embodiment, by implementing an analysis strategy of multimodal features and granularity matching conditions, a set of multimodal features that meet the matching conditions and a set of corresponding granularity matching difference level values are obtained. Based on a preset matching correlation model, the granularity matching result of the two working together is obtained, which helps to improve the accuracy and objectivity of multimodal feature matching judgment, provides high-quality feature input for subsequent semantic alignment accuracy assessment and qualification judgment, and consolidates the core foundation for multimodal information association generation.
[0056] If the output granularity matching result is qualified, the semantic alignment accuracy evaluation is directly initiated; if the output granularity matching result is unqualified, the mapping weight is dynamically adjusted. This helps to realize the interrelation between the analysis strategy of implementing multimodal features and granularity matching conditions and the dynamic adjustment of mapping weights, strengthens the closed-loop optimization capability of multimodal feature granularity calibration, thereby improving the adaptability and matching efficiency of unqualified features, reducing the risk of association failure caused by granularity imbalance, and ensuring the consistency and reliability of the overall process.
[0057] Furthermore, the specific process for evaluating semantic alignment accuracy is as follows: The aligned multimodal features are used as graph nodes. The weight values corresponding to the graph nodes are monitored, and the relationship between the weight values and their corresponding weight thresholds is determined. If the weight value corresponding to a graph node is less than the weight threshold, the corresponding graph node is marked as a redundant node, and the redundant node is removed based on a link pruning algorithm. The weight threshold is represented by the average of the weight values corresponding to graph nodes over a historical time period. If the weight value corresponding to a graph node is not less than the weight threshold, the co-occurrence frequency of graph nodes is monitored, and the co-occurrence frequency is compared with its corresponding co-occurrence frequency threshold. If the co-occurrence frequency is greater than the co-occurrence frequency threshold, the corresponding graph node is marked as an explicit associated node, and the relationship between the weight value and the weight threshold is determined based on a GNN (Graph Neural Network) algorithm. The network (graph neural network model) outputs the associated content of the corresponding graph nodes and performs a qualification assessment on the associated content. If the co-occurrence frequency of graph nodes is not greater than the co-occurrence frequency threshold, the knowledge graph is embedded based on the TransE algorithm, and the cognitive graph is updated after embedding, thereby mining the implicit semantic associations (such as causal, temporal, and subordinate relationships) between nodes. The co-occurrence frequency of graph nodes represents the proportion of the number of times the semantic concepts corresponding to two graph nodes appear simultaneously in the multimodal data sample set monitored by the counter to the total number of samples. The co-occurrence frequency threshold is represented by the average value of the co-occurrence frequency of graph nodes in historical time periods.
[0058] The process involves inputting a set of explicit related nodes acquired over a historical time period and the related content of preset graph nodes, along with a set of aligned multimodal features and the corresponding weights of preset graph nodes, into a graph neural network (GNN) model for training. Through the heterogeneous message passing mechanism of GNNs, features of adjacent nodes are aggregated to update their own representations, uncovering implicit related information between nodes. The resulting trained GNN models are then obtained. By re-inputting explicit related nodes and aligned multimodal features (such as character count and word frequency) into the corresponding trained GNN model, the related content of the corresponding graph nodes can be output, such as the related type (text node and image node) and the related direction (e.g., text node → image node is a unidirectional semantic guide, text node → image node is a unidirectional semantic guide). Image nodes are bidirectional feature complementarity, etc., or the weight values corresponding to the map nodes (such as the confidence scores obtained based on the GNN model).
[0059] like Figure 3 The diagram shown is a schematic diagram of the qualification assessment of associated content provided by an embodiment of the present invention: the accurate transmitted value and the consumed value are obtained, and when the qualification assessment conditions are met, multimodal information is sent to generate a qualification prompt; otherwise, multimodal information is sent to generate a non-qualification prompt.
[0060] It should be added that the specific process for judging the qualification of associated content is as follows: Based on the graph neural network model, the softmax normalization result of the weights of the associated edges between graph nodes is obtained and used as the transmission accuracy value to judge the accuracy of information transmission. The larger the transmission accuracy value, the more reliable the association and the more accurate the information transmission. The peak utilization of the monitored CPU (Central Processing Unit) (such as the peak CPU utilization of the preset process monitored through Process Explorer) is used as the consumption value to judge the resource consumption. The transmission accuracy value and consumption value are used as the conditions for judging the qualification of associated content. If the qualification judgment conditions are met, that is, the transmission accuracy value is greater than the predefined transmission accuracy value and the consumption value is not greater than the predefined consumption value, a multimodal information association is sent to generate a qualified prompt; otherwise, a multimodal information association is sent to generate a unqualified prompt. The predefined transmission accuracy value is represented by the average of the transmission accuracy values over a historical time period, and the predefined consumption value is represented by the average of the consumption values over a historical time period.
[0061] It should be added that the dataset consisting of related content and preset edge weights between graph nodes is used as training data and input into the graph neural network model for training. The graph neural network model learns the mapping rules between the two, explores the intrinsic relationship between the core features in the related content and the strength of the association between nodes, fits the nonlinear mapping relationship from heterogeneous related content to quantified edge weights, and optimizes learnable parameters such as attention scoring vector and linear transformation matrix inside the model to minimize the difference loss between the model's predicted edge weights and preset edge weights, thus obtaining the corresponding training model. By inputting the reacquired related content into the trained model, the edge weights between graph nodes can be output (as shown by the graph neural network model after learning node features and topology, the original score value of the degree of association between node pairs).
[0062] like Figure 4The diagram shows a schematic of the structure of a multimodal information association generation system based on a cognitive graph, provided in an embodiment of this invention. The system includes: a granularity matching monitoring module, a semantic alignment accuracy monitoring module, and an association enhancement monitoring module. The granularity matching monitoring module analyzes the granularity matching of corresponding features for multimodal data within a defined multimodal information association generation scenario, determining whether to implement an analysis strategy based on the feedback results of the analysis. The semantic alignment accuracy monitoring module, if the feedback result indicates implementation, evaluates the semantic alignment accuracy after implementing the analysis strategy, determining whether to conduct a qualification assessment of the associated content of the graph nodes based on the assessment results, and further deciding whether to send a qualified multimodal information association generation prompt. The association enhancement monitoring module, if the feedback result indicates non-implementation, continues to implement the analysis strategy based on multimodal features and granularity matching conditions after adopting measures to improve the effectiveness and relevance of the multimodal data.
[0063] In this embodiment, the weight values corresponding to the graph nodes are obtained by evaluating the semantic alignment accuracy. When the weight value is less than the weight threshold, the corresponding graph node is marked as a redundant node, and the redundant node is removed based on the link pruning algorithm. Otherwise, the co-occurrence frequency of graph nodes is monitored. This helps to achieve lightweight and accurate cognitive graph structure, remove nodes and links with no association value, thereby improving the efficiency of cross-modal information association retrieval and strengthening the graph's ability to represent core features. When the monitored co-occurrence frequency of graph nodes is greater than the co-occurrence frequency threshold, the corresponding graph node is marked as an explicit association node, and the qualification of the association content is judged. Otherwise, the knowledge graph is embedded based on the TransE algorithm, and the cognitive graph is updated after embedding. This helps to discover the hidden cross-modal associations in the graph, supplement the semantic integrity of the graph, and thus ensure the dynamic iteration and optimization of the cognitive graph, improving the coverage and depth of multimodal information association.
[0064] By conducting a qualification assessment and obtaining the accurate transmission value and consumption value, and sending a qualification prompt when the accurate transmission value is greater than the predefined accurate transmission value and the consumption value is not greater than the predefined consumption value, a qualification prompt is generated by multimodal information association; otherwise, a qualification prompt is generated by multimodal information association. This helps to improve the quality controllability and standardization of the multimodal information association generation results, thereby helping to ensure the reliability and stability of multimodal cognitive map applications.
[0065] Example 2, based on Example 1, can be used as an additional technical solution. When there is a local concentration of keyframes in a long-term video, it is necessary to analyze the overall distribution of keyframes in the long-term video along the time dimension. By introducing a keyframe distribution entropy model (global distribution entropy) based on equal-length windows, a global quantitative judgment of the distribution balance of video temporal features can be achieved, thereby avoiding video semantic understanding deviations caused by excessive concentration of keyframes in local time periods. When multimodal data includes video features, the specific process of video feature proportion loss quantification is as follows: In the case of local concentration of keyframes (such as when the number of keyframes in a preset window is greater than the defined maximum number of keyframes), the video temporal sequence is divided into equal-length windows, and each... The distribution entropy of the number of keyframes within the window is used as data to quantify the distribution of video features under local concentration of keyframes. Based on the obtained distribution entropy of the number of keyframes, the global distribution entropy is obtained, and the corresponding video imbalance judgment program is launched: if the normalized global distribution entropy is greater than the corresponding distribution entropy threshold, the higher the entropy value, the more uniform the temporal distribution of keyframes, and it is judged as a balanced temporal feature distribution. The normalized global distribution entropy is marked as a qualified video feature to be analyzed, and the analysis strategy of multimodal feature and granularity matching conditions is implemented. Otherwise, it is judged as an unbalanced temporal feature distribution, and a video feature distribution warning is pushed. The distribution entropy threshold is represented by the average value of the normalized global distribution entropy over a historical time period.
[0066] The method for obtaining the global distribution entropy is as follows: ; In the formula, H represents the global distribution entropy, M represents the total number of windows, W represents the window number, and P... W This represents the percentage of keyframes within the W-th window (represented by the ratio of the number of keyframes within the W-th window monitored by the counter to the total number of keyframes in the video).
[0067] In this embodiment, when keyframes in a long-term video are locally concentrated, a normalized global distribution entropy is obtained by performing video feature proportion imbalance quantification. When the normalized global distribution entropy is greater than the corresponding distribution entropy threshold, an analysis strategy based on multimodal features and granularity matching conditions is implemented; otherwise, a video feature distribution warning is pushed. This helps to achieve quantitative judgment and dynamic control of the distribution status of keyframes in long-term videos, accurately identify the temporal feature imbalance problem caused by the local concentration of keyframes, and further optimize the granularity matching and adaptability of video modalities with text and image modalities. This reduces cross-modal association deviations caused by uneven keyframe distribution and ensures the temporal consistency of multimodal information association generation.
[0068] like Figure 5The diagram illustrates the generation of associated content according to an embodiment of this invention: the left-hand "dynamic heterogeneous graph" is the training data input layer, where nodes of different shapes and connections between nodes represent associated content (including association type and association direction). Different time-step structures cover the diversity of heterogeneous content, and the initial edge weights between nodes are preset values. The associated content and the preset edge weights together constitute the training data. The middle-hand "dynamic heterogeneous graph attention search" is the core computation layer, which uses the attention parameterization space function F. N With parameter A N A R (Corresponding to learnable parameters such as attention score vector and linear transformation matrix), the model completes the mapping learning from associated content to edge weights, and at the same time, it uses the attention localization space to explore the intrinsic relationship between associated features and associated strength; this layer relies on multi-stage differentiable architecture search to advance training, where the transparent part is the candidate architecture component / parameter space to be searched, intuitively reflecting the candidateness and uncertainty of the search. During training, backpropagation is used to minimize the difference loss between the prediction and the preset edge weights, select the optimal architecture and complete the parameter optimization; the "unified dynamic heterogeneous graph attention" on the right and the module below are the output layer, and the dashed lines between nodes are the associated edge weights generated by the model. Inputting new associated content will output the corresponding associated edge weights between graph nodes.
[0069] like Figure 6 The figure shows a scatter plot of predicted and actual values provided in an embodiment of this invention. This plot visualizes the prediction results of the graph neural network model for the weights of the associated edges of graph nodes: the light blue scatter plots correspond to sample pairs of the predicted associated edge weights output by the model and the preset actual edge weights; the red solid line is the linear regression line between the predicted and actual values; the black dashed line is the ideal fitting baseline y=x (y represents the model's predicted edge weights, and x represents the preset actual edge weights); the determination coefficient R is marked in the figure. 2 = 0.900 and Pearson correlation coefficient r = 0.949, quantitatively characterizing the linear correlation strength between predicted and actual values. From a distributional perspective, the scatter points cluster along the fitted regression line, and the deviation range from the ideal baseline is relatively narrow; combined with a high R-squared value... 2 The correlation coefficient indicates that the model can effectively capture the potential mapping pattern between heterogeneous related content (related type, related direction) and the weights of the related edges of the graph nodes. The prediction results are highly consistent with the true values, which verifies the model's fitting accuracy for the weights of the related edges.
[0070] like Figure 7As shown, this is a convergence curve of the training loss of the model provided in an embodiment of the present invention. The figure shows the trend of the loss values of the predicted edge weights and the preset real edge weights as the training process of the graph neural network model changes with the training rounds. Both the training loss and the validation loss show a continuous decreasing trend and gradually converge. Moreover, the two curves have the same trend and the difference is stable, indicating that the model effectively learns the mapping law between heterogeneous related content and node edge weights, and completes the optimization of learnable parameters such as attention scoring vector and linear transformation matrix. At the same time, no overfitting or underfitting phenomenon occurs, which verifies the effectiveness and stability of the model training.
[0071] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any changes or substitutions within the technical scope disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A method for generating multimodal information association based on cognitive graphs, characterized in that, Includes the following steps: For multimodal data, the granularity matching of its corresponding features is analyzed, and the analysis strategy of multimodal features and granularity matching conditions is determined based on the feedback results of the analysis. The multimodal data refers to a set of multimodal features containing any two or three of the corresponding text features, image features, and video features. If the feedback result is "implementation", then after implementing the analysis strategy, the semantic alignment accuracy will be evaluated. Based on the evaluation result, it will be determined whether to conduct a qualification assessment on the associated content of the graph nodes, so as to further decide whether to send multimodal information association to generate a qualification prompt. If the feedback result is not to implement it, then after taking measures to improve the effectiveness and correlation of multimodal data, continue to implement the analysis strategy of multimodal features and granularity matching conditions.
2. The multimodal information association generation method based on cognitive graphs as described in claim 1, characterized in that, The analysis of granular matching of corresponding features for multimodal data is as follows: Acquire multimodal data and analyze the granularity matching of any two or three of the text, image and video features in the multimodal data. This involves any two or three of the following sub-tasks: text feature proportion decomposition, image feature proportion decomposition, and video feature proportion decomposition. The combined effect of data characterizing the internal feature distribution of multimodal data and data reflecting the granularity matching level is used as the quantitative evaluation basis for the granularity matching level. The data representing the internal feature distribution of the multimodal features includes: data representing the distribution of text features, data representing the distribution of video features, and data representing the distribution of image features.
3. The multimodal information association generation method based on cognitive graphs as described in claim 2, characterized in that, When the multimodal data includes text features, the specific process of text feature proportion loss scaling is as follows: The quantitative value of the imbalance in the proportion of features in the text, i.e. the text feature imbalance value, is used as data to characterize the distribution of text features. The text feature imbalance value represents the proportion of high-frequency word frequency to the total word frequency of the text, and the high-frequency word frequency represents the number of times the target high-frequency word appears in the text; The text feature imbalance value is used as the criterion for whether to perform text feature enhancement processing. If the text feature imbalance value is not greater than the text feature imbalance threshold, the corresponding text feature imbalance value is recorded as a qualified text feature to be analyzed, and the analysis strategy of multimodal feature and granularity matching condition is implemented. Otherwise, text feature enhancement processing is performed, and the analysis strategy is implemented after the processing is completed. The text feature enhancement process is as follows: High-frequency word extraction: Input the text feature imbalance value and the amount of multimodal data into the defined window extraction length table, capture the window length corresponding to the multimodal data, use the window length as the extraction length, and extract the high-frequency words corresponding to the text feature imbalance value that is greater than the text feature imbalance threshold. The high-frequency words obtained after high-frequency word extraction are input into the preset text language model, and the cosine similarity of the corresponding high-frequency words is output. High-frequency words whose output cosine similarity is greater than the preset word similarity are filtered out.
4. The multimodal information association generation method based on cognitive graphs as described in claim 2, characterized in that, When the multimodal data includes image features, the specific process of performing image feature proportion loss quantification is as follows: The quantized value of the difference in pixel feature distribution in the image, i.e. the image feature imbalance value, is selected as the data characterizing the distribution of image features. The image feature imbalance value is represented by the ratio of the number of pixels in the background region to the number of pixels in the target region; The results of the difference processing based on the image feature imbalance value and the image feature imbalance threshold are compared: when the result of the difference processing is greater than 0, image region feature enhancement processing is performed, and after the processing is completed, the analysis strategy of multimodal feature and granularity matching conditions is implemented; when the result of the difference processing is not greater than 0, the corresponding image feature imbalance value is recorded as a qualified image feature to be analyzed value, and the analysis strategy is implemented. The image region feature enhancement processing specifically includes: Pixels in the target region whose difference processing result is greater than 0 are marked as pixels to be improved. The image feature imbalance values are input into the image node weight allocation set. The corresponding node weights in the cognitive map corresponding to the image feature imbalance values are read, and the corresponding node weights are reallocated to the pixels to be improved.
5. The method for generating multimodal information association based on cognitive graphs as described in claim 2, characterized in that, When the multimodal data includes video features, the specific process of performing video feature proportion loss quantification is as follows: The quantified value of the temporal feature distribution imbalance in the video, i.e. the video feature imbalance value, is used as data to characterize the video feature distribution. The video feature imbalance value is represented by the ratio of the number of redundant frames to the number of key frames within a unit time window; The redundant frames refer to image frames in a temporal sequence of consecutive image frames where the feature cosine similarity between the feature vectors of two adjacent frames is higher than a preset cosine similarity threshold. The keyframe refers to an image frame in a temporal sequence of consecutive image frames where the feature cosine similarity between the feature vectors of two adjacent frames is not higher than a preset cosine similarity threshold. The video feature imbalance value is compared with the video feature imbalance threshold to obtain the judgment result; If the determination result is that the temporal feature distribution is unbalanced, that is, the video feature imbalance value is greater than the video feature imbalance threshold, then video keyframe screening processing is performed, and after the processing is completed, the analysis strategy of multimodal feature and granularity matching conditions is implemented. If the determination result is that the temporal feature distribution is balanced, that is, the video feature imbalance value is not greater than the video feature imbalance threshold, then the corresponding video feature imbalance value is marked as a qualified video feature to be analyzed value, and the analysis strategy is implemented. The video keyframe filtering process involves inputting the video feature imbalance values into a preset video language model, outputting the cosine similarity of the corresponding redundant frames, filtering out redundant frames whose cosine similarity is greater than a preset redundant frame similarity threshold, and retaining redundant frames whose cosine similarity is not greater than the preset redundant frame similarity threshold.
6. The multimodal information association generation method based on cognitive graphs as described in claim 1, characterized in that, The specific process of implementing the analysis strategy of multimodal feature and granularity matching conditions is as follows: Obtain a set of multimodal features that meet the matching conditions. The set of multimodal features that meet the matching conditions represents a data set corresponding to multimodal data that includes any two or three of the following: qualified text feature values to be analyzed, qualified image feature values to be analyzed, and qualified video feature values to be analyzed. The set of multimodal features that meet the matching conditions and the set of corresponding granularity matching difference level values representing the granularity matching difference level are used as input parameters of the preset matching correlation model, and the granularity matching result is output through the preset matching correlation model. If the output granularity matching result is qualified, the semantic alignment accuracy evaluation is started directly; If the output granularity matching result is unqualified, the corresponding multimodal data is input into a preset lightweight model for feature mapping, and the mapping weights are dynamically adjusted according to the feature distribution characteristics of different modalities. The dynamic adjustment of mapping weights specifically involves: based on a preset lightweight model, outputting the cosine similarity of the corresponding cognitive graph nodes, and using it as input to a preset granularity matching weight allocation set; reading the corresponding node weights in the cognitive graph to redistribute the corresponding node weights of the cognitive graph nodes; and evaluating the semantic alignment accuracy after the allocation is completed.
7. The method for generating multimodal information association based on cognitive graphs as described in claim 6, characterized in that, The specific process for evaluating the semantic alignment accuracy is as follows: The aligned multimodal features are used as graph nodes. The weight values corresponding to the graph nodes are monitored, and the relationship between the weight values and their corresponding weight thresholds is determined. If the weight value corresponding to a graph node is less than the weight threshold, the corresponding graph node is marked as a redundant node, and the redundant node is removed based on the link pruning algorithm. If the weight value corresponding to a graph node is not less than the weight threshold, continue to monitor the co-occurrence frequency of graph nodes, compare the co-occurrence frequency of graph nodes with its corresponding co-occurrence frequency threshold, if the co-occurrence frequency of graph nodes is greater than the co-occurrence frequency threshold, mark the corresponding graph node as an explicit associated node, output the associated content of the corresponding graph node based on the graph neural network model, and perform a qualification assessment on the associated content, if the co-occurrence frequency of graph nodes is not greater than the co-occurrence frequency threshold, embed the knowledge graph based on the TransE algorithm, and update the cognitive graph after embedding, thereby mining the implicit semantic associations between nodes; The co-occurrence frequency of the graph nodes represents the proportion of times the semantic concepts corresponding to two graph nodes appear simultaneously in the multimodal data sample set relative to the total number of samples.
8. The multimodal information association generation method based on cognitive graphs as described in claim 7, characterized in that, The specific process for assessing the eligibility of related content is as follows: Based on the graph neural network model, the softmax normalization result of the weights of the associated edges between graph nodes is used as the transmission accuracy value to evaluate the accuracy of information association transmission. The peak CPU utilization rate is used as the consumption value to judge resource consumption. The criteria for judging the qualification of associated content are the transmission of accurate values and the consumption values. If the qualification criteria are met, that is, the transmission of accurate values is greater than the predefined transmission accurate values and the consumption values are not greater than the predefined consumption values, multimodal information is sent to generate a qualification prompt; otherwise, multimodal information is sent to generate a non-qualification prompt.
9. The method for generating multimodal information association based on cognitive graphs as described in claim 2, characterized in that, When the multimodal data includes video features, the specific process of performing video feature proportion loss quantification is as follows: In cases of localized keyframe concentration, the video time sequence is divided into equal-length windows. The distribution entropy of the number of keyframes within each window is obtained. This distribution entropy is used as data to quantify the distribution of video features in cases of localized keyframe concentration. Based on the obtained distribution entropy of the number of keyframes, the global distribution entropy is obtained, and the corresponding video imbalance determination procedure is initiated. If the normalized global distribution entropy is greater than the corresponding distribution entropy threshold, it is determined that the temporal feature distribution is balanced. The normalized global distribution entropy is marked as a qualified video feature to be analyzed, and an analysis strategy of multimodal feature and granularity matching conditions is implemented. Otherwise, it is determined that the temporal feature distribution is unbalanced, and a video feature distribution warning is pushed.
10. A system applying the multimodal information association generation method based on cognitive graphs as described in any one of claims 1-9, characterized in that, include: Granularity matching monitoring module, semantic alignment accuracy monitoring module, and association enhancement monitoring module; The granularity matching monitoring module is used to analyze the granularity matching of multimodal data and its corresponding features, and to determine whether to implement the analysis strategy of multimodal features and granularity matching conditions based on the feedback results of the analysis. The semantic alignment accuracy monitoring module is used to evaluate the semantic alignment accuracy after implementing the analysis strategy if the feedback result is "implementation". Based on the evaluation result, it determines whether to conduct a qualification assessment of the associated content of the graph node, so as to further decide whether to send multimodal information association to generate a qualification prompt. The correlation enhancement monitoring module is used to continue implementing the analysis strategy of multimodal features and granularity matching conditions after taking measures to improve the effectiveness and correlation of multimodal data if the feedback result is not to implement it.