Multimodal data fusion model and modality enhancement processing method
By incorporating the encoding, evaluation, dynamic fusion, and task adjustment modules of the multimodal data fusion model, the robustness and accuracy of existing multimodal data processing schemes are addressed. This enables objective evaluation of modal quality and dynamic weight adjustment, thereby enhancing the model's decision-making capabilities in complex environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHONGQING JIAOTONG UNIV
- Filing Date
- 2026-03-09
- Publication Date
- 2026-06-19
AI Technical Summary
Existing multimodal data processing schemes lack explicit evaluation mechanisms for the statistical properties of each modal input data and its actual task response capabilities. This results in poor robustness of data fusion when the model faces complex environments with missing or disturbed modalities, and it is prone to path dependence on a single strong modality, affecting the accuracy of the processing results.
A multimodal data fusion model is adopted, including an encoding module, an evaluation module, a dynamic fusion module, and a task adjustment module. The model generates modality strength scores by weighted combination of feature encoding, statistical attribute features, and task response features, dynamically adjusts modality weights, and introduces a global adaptation coefficient to adjust the overall weight of the final result, thereby ensuring the robustness and accuracy of multimodal data fusion.
It effectively suppresses noise interference introduced by low-quality modes, accurately enhances the representation ability of high-value modes, improves the robustness and accuracy of multimodal task processing, reduces the risk of overfitting a single mode, and enhances the model's decision-making ability in complex environments.
Smart Images

Figure CN122241066A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence technology, and in particular to a multimodal data fusion model and modality enhancement processing method. Background Technology
[0002] In today's complex IoT information processing scenarios, such as multimedia content analysis, intelligent human-computer interaction, and autonomous driving decision-making, computer systems typically need to simultaneously receive and process data from multiple heterogeneous modalities, including text, images, and audio, in an attempt to achieve comprehensive perception and accurate judgment of task objectives by integrating information from different sources.
[0003] A common approach to multimodal data processing employs a Transformer-based attention mechanism for feature fusion. The workflow typically involves: first, using a specific feature extraction network to obtain feature vectors for each modality; then, directly inputting these feature vectors into a cross-attention layer or concatenation layer; and finally, during end-to-end training, the model implicitly learns the interaction relationships and weight allocations between modal features by minimizing the final task loss. The final result is then directly output as a classification or prediction based on the fused features.
[0004] However, the aforementioned existing technologies, when processing multimodal data, rely solely on backpropagation of the final task objective to update weights. They lack explicit evaluation mechanisms for the statistical properties of each modality's input data and its actual task response capabilities. Furthermore, they fail to consider the impact of the balance of contributions among modalities on the reliability of the results during the output stage. This approach makes the model highly susceptible to path dependence on a single strong modality (such as relying solely on images while ignoring text), and it cannot effectively identify and suppress noise interference introduced by low-quality modalities. Consequently, the model exhibits poor robustness in data fusion when facing complex environments with missing or interfering modalities, resulting in low accuracy of the final processing results. Summary of the Invention
[0005] This invention provides a multimodal data fusion model and modality enhancement processing method to address the shortcomings of existing technologies and improve the robustness and accuracy of multimodal task processing.
[0006] This invention provides a multimodal data fusion processing model, including: an encoding module, an evaluation module, a dynamic fusion module, and a task adjustment module; The encoding module performs feature encoding on the input data of each modality and outputs the initial feature vector of each modality. The evaluation module calculates statistical attribute features and task response features for each modality's initial feature vector, and generates a modality strength score for each modality based on a weighted combination of the statistical attribute features and the task response features. The dynamic fusion module maps the modality strength score of each modality to the corresponding dynamic modality weight, and performs weighted fusion of the initial feature vectors of all modalities based on the dynamic modality weights of all modalities to obtain a fused feature vector. The dynamic fusion module maps the fused feature vector to an initial prediction result; The task adjustment module calculates the global adaptation coefficient based on the overall distribution of the dynamic modal weights of all modalities. The task adjustment module performs an overall weighted adjustment on the initial prediction result based on the global adaptation coefficient to obtain the final processing result.
[0007] According to a multimodal data fusion processing model provided by the present invention, the task adjustment module is specifically used to perform the following steps: Calculate the proportion of the dynamic modal weight of each mode in the sum of the dynamic modal weights of all modes to obtain the actual contribution of each mode; The concentration index is obtained by calculating the sum of squares of the actual contributions of all modes. Calculate the difference between the preset constant and the concentration index, and determine the difference as the global adaptation coefficient, wherein the global adaptation coefficient is negatively correlated with the concentration index.
[0008] According to a multimodal data fusion processing model provided by the present invention, the evaluation module is specifically used to perform the following steps: Map the initial feature vectors of all the aforementioned modalities to a unified token representation; Based on the unified token representation, calculate the mutual information between the current modality and other modalities, and use the mean of the mutual information as the first component of the statistical attribute feature; Calculate the variance of the initial feature vector of the modality in the feature dimension, and use the normalized value of the variance as the second component of the statistical attribute feature; The initial feature vector of the modality is input into a preset lightweight evaluation network, and a single-value scalar is output through linear mapping. The single-value scalar is then used as the task response feature. Based on the preset first fusion coefficient, second fusion coefficient, and third fusion coefficient, the first component of the statistical attribute feature, the second component of the statistical attribute feature, and the task response feature are weighted and summed to obtain the modality strength score.
[0009] According to a multimodal data fusion processing model provided by the present invention, the dynamic fusion module is specifically used to perform the following steps: For any modality, calculate the global mean of the initial feature vector; The global mean is multiplied by the modality strength score to obtain the evaluation input vector; The evaluation input vector is fed into a gating network to generate gating values; The global mean is processed using a basic weight network to generate basic weights, and the basic weights are multiplied by the gate value to obtain the dynamic modal weights; Multiply the dynamic modal weights by the initial feature vector to obtain a weighted initial feature vector; Based on the preset head number configuration parameters for each modality, the weight ratio of each modality during fusion is determined, and the weighted initial feature vectors of all modalities are weighted and fused according to the weight ratio to obtain the fused feature vector.
[0010] According to a multimodal data fusion processing model provided by the present invention, the encoding module is specifically used to perform the following steps: The original input data of each modality is standardized, and a standardized input sequence containing modality identifiers and position information is generated through modality-specific embedding and Fourier feature position encoding. The normalized input sequence is processed using a single-modal shared encoder based on the Transformer architecture; The single-modal shared encoder has a set of fixed-dimensional potential arrays as query vectors and the standardized input sequence as key vectors and value vectors. The cross-modal interaction between the latent array and the standardized input sequence is calculated using a cross-attention mechanism. The interaction result is then input into the attention mechanism layer to iteratively update the latent array. The final updated latent array is then determined as the initial feature vector for that modality.
[0011] This invention provides a modal enhancement processing method, comprising the following steps: In the current task processing round, the input data of various modalities corresponding to each task are input into the task processing model, and the final processing results corresponding to each task are output by the task processing model.
[0012] A modal enhancement processing method provided by the present invention further includes: For each task, obtain the modality strength score and the actual contribution of each modality generated by the task processing model when processing input data of multiple modalities; Calculate the ideal contribution of each mode based on the modality strength score of each mode; Calculate the contribution deviation loss based on the deviation between the actual contribution and the ideal contribution of all modes; Calculate the overall task loss based on the original loss of each task; The total loss is obtained by weighted summing the contribution deviation loss and the overall task loss. The task processing model is trained based on the total loss to obtain an optimized task processing model, which is then used for task processing in the next task processing round.
[0013] According to a modal enhancement processing method provided by the present invention, the step of calculating the ideal contribution of each mode based on the modal strength score of each mode includes: The ideal contribution of each mode is calculated using the following formula: ; In the formula, The ideal contribution of the m-th mode, Let M be the modality strength score of the m-th mode, and M be the total number of modes.
[0014] According to a modal enhancement processing method provided by the present invention, the step of calculating the contribution deviation loss based on the deviation between the actual contribution and the ideal contribution of all modes includes: The contribution deviation loss is calculated using the following formula: ; In the formula, For contribution bias loss, The ideal contribution of the m-th mode, Let M be the actual contribution of the m-th mode, and M be the total number of modes.
[0015] According to a modal enhancement processing method provided by the present invention, the input data of the multiple modalities includes text input data, image input data, audio input data, and video input data.
[0016] In summary, one or more technical solutions provided in the embodiments of this application have at least the following technical effects or advantages: The encoding module encodes the input data for each modality separately and outputs initial feature vectors, thereby eliminating dimensional differences between heterogeneous data and establishing a unified potential representation space for subsequent feature interactions. The evaluation module calculates statistical attribute features and task response features separately and generates modality strength scores based on their weighted combination. This combines the statistical distribution of the data itself with the specific requirements of the task objective to achieve objective and multi-dimensional quantification of the information value of each modality. The dynamic fusion module maps the modality strength scores to dynamic modality weights and completes the weighted fusion of the initial feature vectors. This adaptively adjusts the feature fusion ratio according to the actual contribution of the modality, effectively suppressing low-quality modalities. The system introduces noise interference and precisely enhances the representation ability of high-value modalities; through a dynamic fusion module, the fused feature vector is mapped to the initial prediction result, thereby obtaining a preliminary task decision based on weighted feature representation; through a task adjustment module, the global adaptation coefficient is calculated based on the overall distribution of dynamic modal weights and the initial prediction result is adjusted by overall weighting, thereby constructing an output confidence calibration mechanism based on the degree of modal collaboration. When a single modality dominates, the output weight is automatically suppressed to reduce the risk of overfitting the model to a specific modality. In the multimodal equilibrium and collaboration stage, the output weight is enhanced to maximize the fusion gain, ultimately ensuring the decision robustness and accuracy of the multimodal data fusion model in complex task environments. Attached Figure Description
[0017] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0018] Figure 1 This is a schematic diagram of the overall architecture of the multimodal data fusion model provided by the present invention.
[0019] Figure 2 This is an overall framework diagram of the multimodal data fusion model and modal enhancement processing method provided by the present invention.
[0020] Figure 3 This is one of the flowcharts of the modal enhancement processing method provided by the present invention.
[0021] Figure 4 This is the second schematic diagram of the modal enhancement processing method provided by the present invention. Detailed Implementation
[0022] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.
[0023] It should be noted that in the description of this invention, the terms "comprising," "including," or any other variations thereof are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element. The terms "upper," "lower," etc., indicating orientation or positional relationships according to the accompanying drawings, are only for the convenience of describing the invention and for simplifying the description, and do not indicate or imply that the system or element referred to must have a specific orientation, or be constructed and operated in a specific orientation, and therefore should not be construed as a limitation of the invention. Those skilled in the art can understand the specific meaning of the above terms in this invention according to the specific circumstances.
[0024] The terms "first," "second," etc., used in this invention are used to distinguish similar objects, not to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that embodiments of the invention can be implemented in orders other than those illustrated or described herein, and the objects distinguished by "first," "second," etc., are generally of the same class, without limiting the number of objects; for example, a first object can be one or more. Furthermore, "and / or" indicates at least one of the connected objects, and the character " / " generally indicates that the preceding and following objects are in an "or" relationship.
[0025] The following is combined Figures 1 to 4 This invention describes the multimodal data fusion model and modal enhancement processing method provided by the present invention.
[0026] Reference Figure 1 , Figure 1This is a schematic diagram of the overall architecture of the multimodal data fusion model provided by the present invention. This embodiment provides a multimodal data fusion model that can be applied to artificial intelligence processing scenarios such as multimodal task analysis, multimodal content understanding, and audiovisual speech recognition. In this embodiment, a multimodal analysis task is used as an example for illustration. The multimodal data fusion model includes an encoding module, an evaluation module, a dynamic fusion module, and a task adjustment module.
[0027] The encoding module performs feature encoding on the input data of each modality and outputs the initial feature vector of each modality. The evaluation module calculates statistical attribute features and task response features for each modality's initial feature vector, and generates a modality strength score for each modality based on a weighted combination of the statistical attribute features and the task response features. The dynamic fusion module maps the modality strength score of each modality to the corresponding dynamic modality weight, and performs weighted fusion of the initial feature vectors of all modalities based on the dynamic modality weights of all modalities to obtain a fused feature vector. The dynamic fusion module maps the fused feature vector to an initial prediction result; The task adjustment module calculates the global adaptation coefficient based on the overall distribution of the dynamic modal weights of all modalities. The task adjustment module performs an overall weighted adjustment on the initial prediction result based on the global adaptation coefficient to obtain the final processing result.
[0028] The encoding module serves as the input processing interface for the multimodal data fusion model. It receives and processes input data from each modality. In this embodiment, the input data includes, but is not limited to, text input data, image input data, and audio input data. The encoding module performs independent encoding operations on the text input data, image input data, and audio input data respectively. It extracts features from the input data of each modality and maps the extracted features to a unified dimensional space, outputting initial feature vectors for each modality. The encoding module ensures that the output initial feature vectors for each modality have a consistent dimension to facilitate processing by subsequent modules. For example, the encoding module converts text input data into text initial feature vectors and image input data into image initial feature vectors.
[0029] The evaluation module is connected to the encoding module. The evaluation module receives the initial feature vector for each modality output by the encoding module. The main function of the evaluation module is to quantify the effectiveness of each modality for the current task.
[0030] The evaluation module performs two types of feature calculations in parallel for each modality's initial feature vector: statistical attribute feature calculation and task response feature calculation.
[0031] Specifically, the evaluation module calculates statistical attribute features to reflect the intrinsic quality of the modal data. These statistical attribute features include mutual information metrics, which measure the redundancy of information between modalities, and variance metrics, which measure the richness of information within modalities. Simultaneously, the evaluation module calculates task response features to reflect the degree of matching between the modal data and the current task. The evaluation module uses a pre-built lightweight neural network to process the initial feature vectors to obtain numerical values representing the task response features.
[0032] After calculating the statistical attribute features and task response features, the evaluation module performs a weighted combination operation. Based on preset weight parameters, the evaluation module linearly weights and sums the statistical attribute features and task response features to generate a modality strength score for each modality. The modality strength score is a scalar value; a higher score indicates higher quality information contained in the current data sample and greater importance to the task.
[0033] The dynamic fusion module is connected to the evaluation and encoding modules. The dynamic fusion module uses the modality strength scores generated by the evaluation module to guide the feature fusion process.
[0034] The dynamic fusion module first inputs the modality strength score of each modality into an internal gating network or mapping function, mapping the modality strength score to the corresponding dynamic modality weight. The dynamic modality weight reflects the degree of attention that should be given to the corresponding modality during the feature fusion stage. Subsequently, the dynamic fusion module performs weighted fusion of the initial feature vectors of all modalities based on the dynamic modality weights of all modalities. The module multiplies the dynamic modality weight of each modality with the corresponding initial feature vector and accumulates or concatenates the products to obtain the fused feature vector. The fused feature vector integrates the key information of all modalities and highlights the features of important modalities based on the dynamic modality weights. The dynamic fusion module further inputs the fused feature vector into a task classifier or regressor. The task classifier maps the fused feature vector to the initial prediction result. In multimodal analysis scenarios, the initial prediction result is a probability distribution vector representing the category.
[0035] The task adjustment module is connected to the dynamic fusion module. The task adjustment module introduces a weight-based global adjustment mechanism to further improve the reliability of the model output.
[0036] The task adjustment module obtains the overall distribution of dynamic modal weights for all modalities. By analyzing these dynamic modal weights, the module determines whether the contributions of each modality to the task are balanced. Based on the overall distribution of the dynamic modal weights, the module calculates the global adaptation coefficient. The global adaptation coefficient is a scalar used to measure the reliability of the current fusion result.
[0037] When the distribution of dynamic modal weights shows that some modes contribute extremely highly while others contribute extremely lowly, the global fit coefficient calculated by the task adjustment module tends to be a smaller value to suppress the potential risk of single-mode overfitting.
[0038] When the distribution of dynamic modal weights is relatively balanced, the global adaptation coefficient calculated by the task adjustment module tends to be larger to encourage multimodal collaboration. Finally, the task adjustment module performs an overall weighted adjustment of the initial prediction results based on the global adaptation coefficient. The task adjustment module multiplies the initial prediction results with the global adaptation coefficient to obtain the final processed result. The final processed result is the final analysis result output by the multimodal data fusion model for the current input data.
[0039] The multimodal data fusion model provided in this embodiment achieves adaptive fusion and dynamic adjustment of multimodal data through the collaborative work of an encoding module, an evaluation module, a dynamic fusion module, and a task adjustment module. The evaluation module combines statistical attribute features and task response features to generate modality strength scores, avoiding the modality inertia problem caused by solely relying on the network's self-attention mechanism, and enabling a more objective evaluation of modality quality. The task adjustment module introduces a global adaptation coefficient, dynamically adjusting the weights of the final output based on the distribution of modality weights. When modality contributions are unbalanced, it automatically reduces the output confidence, forcing the model to mine complementary information between multimodalities. This effectively solves the problem in existing technologies where models tend to over-rely on a single strong modality while ignoring weak modality information, significantly improving the robustness and accuracy of multimodal models in complex scenarios.
[0040] In this embodiment, the encoding module employs a unified, standardized processing flow and a shared encoding mechanism based on the Transformer architecture to achieve efficient feature extraction from heterogeneous modal data. Specifically, the encoding module performs the following steps to generate initial feature vectors for each modality.
[0041] The encoding module is specifically used to perform the following steps: The original input data of each modality is standardized, and a standardized input sequence containing modality identifiers and position information is generated through modality-specific embedding and Fourier feature position encoding. The normalized input sequence is processed using a single-modal shared encoder based on the Transformer architecture; The single-modal shared encoder has a set of fixed-dimensional potential arrays as query vectors and the standardized input sequence as key vectors and value vectors. The cross-modal interaction between the latent array and the standardized input sequence is calculated using a cross-attention mechanism. The interaction result is then input into the attention mechanism layer to iteratively update the latent array. The final updated latent array is then determined as the initial feature vector for that modality.
[0042] Specifically, firstly, the encoding module standardizes the raw input data for each modality. The encoding module treats different forms of data, such as text input data, image input data, and audio input data, as sequential data. Similarly, for tabular data, set data, or graph data, the encoding module treats each element as an element in a sequence. The encoding module then converts the raw input data into standardized input data. Standardized input data The dimension is ,in It is an input sequence length specifically designed for the modality and task. It is an input dimension specifically designed for modalities and tasks.
[0043] Building upon the standardization process, the encoding module further generates a standardized input sequence containing modality identifiers and location information through modality-specific embeddings and Fourier feature position encoding. For each distinct modality m, the encoding module defines a one-hot embedded modality vector. ,in This represents the total number of different modalities processed by the multimodal data fusion model. Simultaneously, the encoding module incorporates Fourier feature location encoding. ,in This refers to the positional encoding dimension. Fourier feature positional encoding is used to capture cross-modal positional information within each modality. Furthermore, the encoding module utilizes common positional encoding to capture common temporal dimension information between different modalities. The encoding module fuses the standardized input data, embedded modality vectors, Fourier feature positional encoding, and padding encoding to obtain the processed standardized input sequence. The specific fusion calculation process is shown in the following equation: ; in, This indicates a splicing or addition operation. This represents padding encoding. Padding encoding is an all-zero vector used to mark invalid positions in the sequence. After the above processing, the standardized input sequence enters the standard dimension. Standard Dimension The calculation formula is ,in Let m be the channel size of modality m. This approach ensures that the input data of all modalities are mapped to a unified feature space, laying the foundation for subsequent shared encoding.
[0044] Subsequently, the encoding module processes the normalized input sequence using a unimodal shared encoder based on the Transformer architecture. The unimodal shared encoder adopts a Perceiver-based design concept, with its core being a hierarchical cross-attention and self-attention mechanism. The unimodal shared encoder pre-sets a fixed-dimensional latent array as the query vector and uses the normalized input sequence as the key and value vectors. The unimodal shared encoder is recursively trained to have the following shape: Potential arrays ,in It is the sequence length of the latent vectors. This refers to the latent dimension. Before training begins, the latent array is randomly initialized to... .
[0045] The encoding module computes the cross-modal interaction between the latent array and the normalized input sequence using a cross-attention mechanism. When computed at each layer L, the unimodal shared encoder utilizes the latent array from the previous layer. As a query, it utilizes the processed, standardized input sequence. As keys and values, intermediate feature representations are calculated. The specific formula for calculating cross-attention is as follows: ; in, , , These are trainable cross-attention parameters. This step compresses and maps high-dimensional input information into a fixed-dimensional latent array.
[0046] Next, the unimodal shared encoder inputs the interaction results into the attention mechanism layer to iteratively update the latent array. The unimodal shared encoder represents the intermediate features. Perform self-attention calculations to obtain the representation used as input to the next layer. The specific formula for calculating self-attention is as follows: ; in, , , It is a self-attention parameter.
[0047] The encoding module repeatedly performs the cross-attention and self-attention steps described above. After multiple iterations, the updated latent array is finally obtained. The encoding module determines the final updated latent array as the initial feature vector for this modality.
[0048] The technical advantage of this embodiment lies in the fact that the encoding module, by introducing modality-specific embeddings and Fourier positional encoding, achieves a unified and standardized representation of heterogeneous modal data, eliminating the differences in data structure between different modalities. Specifically, a single-modal shared encoder based on the Perceiver architecture is employed, utilizing a fixed-dimensional latent array as the query vector to decouple computational complexity from the dimensionality of the input data. Regardless of whether the input data is a high-resolution image or a long text sequence, the computational cost of the encoding process primarily depends on the size of the latent array, thereby significantly reducing the computational cost of the multimodal data fusion model and improving the model's efficiency and scalability in processing large-scale multimodal data.
[0049] In this embodiment, the evaluation module generates modality strength scores through a series of refined calculation steps, aiming to comprehensively quantify the effectiveness of each modality. The evaluation module specifically performs the following steps: Map the initial feature vectors of all the aforementioned modalities to a unified token representation; Based on the unified token representation, calculate the mutual information between the current modality and other modalities, and use the mean of the mutual information as the first component of the statistical attribute feature; Calculate the variance of the initial feature vector of the modality in the feature dimension, and use the normalized value of the variance as the second component of the statistical attribute feature; The initial feature vector of the modality is input into a preset lightweight evaluation network, and a single-value scalar is output through linear mapping. The single-value scalar is then used as the task response feature. Based on the preset first fusion coefficient, second fusion coefficient, and third fusion coefficient, the first component of the statistical attribute feature, the second component of the statistical attribute feature, and the task response feature are weighted and summed to obtain the modality strength score.
[0050] Specifically, first, the evaluation module feeds the initial feature vectors of all modalities into a multi-head attention network and maps them to a unified token representation. For each modality m, the evaluation module uses a linear transformation layer to transform the initial feature vector of that modality... Perform compression mapping to transform it into a single token representation with a unified dimension. The specific mapping formula is as follows: ;in, This step represents a unified token for modality m. It compresses complex multidimensional features into a compact representation, facilitating subsequent cross-modal similarity calculations.
[0051] Subsequently, the evaluation module calculates the mutual information between the current modality and other modalities based on the unified token representation, and uses the mean of the mutual information as the first component of the statistical attribute feature. The evaluation module uses mutual information (MI) as an indicator to measure the cross-modal correlation between two modalities. For any two modalities... and The evaluation module calculates the mutual information between them. The calculation formula is as follows: in, Describes the joint probability distribution. This represents the product of marginal probability distributions. It is a function used to evaluate similarity. Based on the calculated pairwise mutual information, the evaluation module defines modes. and Cross-modal correlation Next, the evaluation module calculates the mean of the cross-modal correlation between the current mode m and all other modes, and standardizes this mean to obtain the first component of the statistical attribute feature. The specific calculation formula is as follows: ; in, This indicates the operation of calculating the mean. This represents standardized operations. This component reflects the degree of information overlap between the current mode and other modes.
[0052] Simultaneously, the evaluation module calculates the variance of the initial feature vector of the mode along the feature dimension, and uses the normalized value of the variance as the second component of the statistical attribute feature. The evaluation module calculates the initial feature vector of mode m. variance This variance is used as a feature discrimination index. A larger variance generally means that the feature carries more information. The evaluation module standardizes the calculated variance to obtain the second component of the statistical attribute feature. The specific calculation formula is as follows: ; In the formula here The variance characteristic of a mode is represented by a statistical measure compared with other modes, and is formed after standardization. .
[0053] Furthermore, the evaluation module inputs the initial feature vector of the modality into a pre-built lightweight evaluation network, outputs a single-valued scalar through linear mapping, and uses this single-valued scalar as the task response feature. The evaluation module utilizes a lightweight neural network called ContribNet to evaluate the modality's direct contribution to the task. The evaluation module first obtains the total attention output through a multi-head attention mechanism. And it is broken down into the attention head output of mode m. : .
[0054] Then, the evaluation module calculates the contribution of mode m in the network. : ; in, The output is a single-value contribution score, with higher scores indicating greater importance to the task. The evaluation module averages and standardizes these contribution scores to obtain the task response characteristics. : ; Finally, the evaluation module performs a weighted summation of the first component of the statistical attribute feature, the second component of the statistical attribute feature, and the task response feature based on preset first, second, and third fusion coefficients to obtain the modality strength score. The evaluation module calculates the modality strength score of modality m according to the following formula. : ; in, The first fusion coefficient, The second fusion coefficient, This is the third fusion coefficient. These coefficients are used to balance the importance of different evaluation dimensions.
[0055] The technical advantage of this embodiment lies in the fact that the evaluation module constructs a three-dimensional modality evaluation system. By integrating mutual information (reflecting cross-modal relevance), variance (reflecting feature information content), and lightweight network output (reflecting task relevance), the evaluation module can comprehensively and objectively quantify the strength of each modality from two dimensions: statistical attributes and neural network learning attributes. This comprehensive evaluation mechanism effectively avoids the bias caused by relying on a single indicator, ensuring that the modality strength score can truly reflect the actual value of the modality in the current task, and providing an accurate basis for subsequent dynamic weight allocation.
[0056] In this embodiment, the dynamic fusion module dynamically weights and fuses features based on the modality strength scores output by the evaluation module through a gating mechanism. The dynamic fusion module specifically performs the following steps: For any modality, calculate the global mean of the initial feature vector; The global mean is multiplied by the modality strength score to obtain the evaluation input vector; The evaluation input vector is fed into a gating network to generate gating values; The global mean is processed using a basic weight network to generate basic weights, and the basic weights are multiplied by the gate value to obtain the dynamic modal weights; Multiply the dynamic modal weights by the initial feature vector to obtain a weighted initial feature vector; Based on the preset head number configuration parameters for each modality, the weight ratio of each modality during fusion is determined, and the weighted initial feature vectors of all modalities are weighted and fused according to the weight ratio to obtain the fused feature vector.
[0057] First, for any modality, the dynamic fusion module calculates the global mean of the initial feature vector. The dynamic fusion module then obtains the initial feature vector for modality m. The mean of the modal features is obtained by averaging them across the feature dimensions. This mean vector summarizes the overall characteristic distribution of this mode.
[0058] Subsequently, the dynamic fusion module multiplies the global mean by the modality strength score to obtain the evaluation input vector. The dynamic fusion module then calculates the feature mean... Modal strength scores generated by the evaluation module These vectors are combined to form a new vector that serves as the input to subsequent networks.
[0059] Next, the dynamic fusion module inputs the evaluation input vector into a gating network to generate gating values. The dynamic fusion module utilizes a gating network. The concatenated input is processed, and the output is limited to between 0 and 1 using the Sigmoid activation function to generate a gating value. ;in, This is the Sigmoid activation function. This gating value reflects the modulating factor that combines feature content and strength scores.
[0060] Simultaneously, the dynamic fusion module utilizes a basic weight network to process the global mean to generate basic weights, and multiplies these basic weights by the gate value to obtain the dynamic modal weights. The dynamic fusion module also utilizes another basic weight network. For the characteristic mean The process involves generating basic weights using the Sigmoid activation function. Then, the dynamic fusion module multiplies the base weights element-wise with the gate values generated in the previous step to obtain the final dynamic modal weights. The calculation formula is as follows: ; in, This refers to the dynamic modal weights of mode m.
[0061] Subsequently, the dynamic fusion module multiplies the dynamic modal weights by the initial feature vector to obtain a weighted initial feature vector. The dynamic fusion module then uses the calculated dynamic modal weights... For the original initial feature vector After weighting adjustment, the weighted modal features are obtained. The calculation formula is: ; This step enhances the modal characteristics of the model.
[0062] Finally, the dynamic fusion module determines the weight ratio of each modality during fusion based on the preset head count configuration parameters for each modality, and performs weighted fusion on the weighted initial feature vectors of all modalities according to the weight ratio to obtain the fused feature vector. The dynamic fusion module performs feature fusion based on a multi-head attention mechanism. The dynamic fusion module determines the weight ratio based on the preset head count configuration parameters. Calculate the weight ratio of each modality in the final fusion. Then, the dynamic fusion module weights the initial feature vectors of all modalities. The final fused feature vector F is obtained by weighting and summing the features according to this ratio. The calculation formula is as follows: ; Where M represents the total number of modalities. The fused feature vector F integrates key information from all modalities after dynamic filtering.
[0063] The technical advantage of this embodiment lies in the fact that the dynamic fusion module perceives modal features through a multi-head attention mechanism and introduces a gated network and a dual weight generation mechanism (GateNet function and BaseWeightNet function), achieving non-linear dynamic weight adjustment. This module not only considers the content of the features themselves but also forcibly incorporates the prior strength scores given by the evaluation module, making the weight generation both data-adaptive and controlled by statistical evaluation results. This mechanism ensures that the model can flexibly adjust the contribution of different modalities according to task requirements, improving the effectiveness and relevance of feature fusion.
[0064] In this embodiment, the task adjustment module employs a global adjustment strategy based on contribution distribution to dynamically balance the credibility of the multimodal fusion results. The task adjustment module specifically performs the following steps: Calculate the proportion of the dynamic modal weight of each mode in the sum of the dynamic modal weights of all modes to obtain the actual contribution of each mode; The concentration index is obtained by calculating the sum of squares of the actual contributions of all modes. Calculate the difference between the preset constant and the concentration index, and determine the difference as the global adaptation coefficient, wherein the global adaptation coefficient is negatively correlated with the concentration index.
[0065] First, the task adjustment module calculates the proportion of the dynamic modal weight of each modality in the sum of the dynamic modal weights of all modalities, thus obtaining the actual contribution of each modality. The task adjustment module then obtains the dynamic modal weights of all modalities generated by the dynamic fusion module. The task adjustment module will adjust the dynamic modal weights for each modality. Divide by the sum of the dynamic modal weights of all modalities to obtain the normalized value, which is the actual contribution of that modality to the current task. The calculation formula is as follows: ; Where M represents the total number of modes. Actual contribution The value ranges from 0 to 1, and the sum of the actual contributions of all modalities is 1. This indicator intuitively quantifies the supporting role of each modality in the current fusion task.
[0066] Subsequently, the task adjustment module calculates the sum of squares of the actual contributions of all modalities to obtain the concentration index. The task adjustment module then calculates the actual contribution of each modality. Perform a square operation and sum the squares of all modalities to obtain a scalar value. The task adjustment module defines this scalar value as a concentration index: .
[0067] This concentration index reflects the distribution of modal contributions. When the contribution of a certain mode is extremely high (close to 1) while the contributions of other modes are extremely low, the index approaches 1; when the contributions of all modes are relatively balanced, the index value is small.
[0068] Finally, the task adjustment module calculates the difference between the preset constant and the concentration index, and determines the difference as the global adaptation coefficient, wherein the global adaptation coefficient is negatively correlated with the concentration index. In this embodiment, the preset constant is set to 1. The task adjustment module subtracts the concentration index calculated above from 1 to obtain the global adaptation coefficient, i.e.: .
[0069] Next, determine the initial prediction result Y. ;in, The global mean of the fused features. For task classifiers, Representative to Standardize the process.
[0070] Finally, the task adjustment module performs an overall weighted adjustment on the initial prediction results based on the global adaptation coefficient to obtain the final processing result. ,Right now: .
[0071] The technical advantage of this embodiment lies in the fact that the task adjustment module designs a counterintuitive adjustment mechanism that "encourages collaboration and punishes single-mode behavior." By calculating the global adaptation coefficient, when a high-contribution mode (i.e., single-mode dominance) exists, the module will adjust the task accordingly. When the modality increases, the fit coefficient decreases, reducing the weight of the final output; conversely, when the contributions of each modality are balanced (no obvious strong mode), the fit coefficient decreases. When the coefficient of fit is relatively small, it approaches 1, maximizing the weight of the fusion result. This mechanism mathematically forces the model to explore the complementarity between multiple modalities, rather than lazily relying on only one strong modality. This dynamically adjusts the credibility of the fusion result based on the distribution of modal contributions, improving the model's utilization of multimodal synergistic effects.
[0072] In another embodiment, refer to Figure 2 , Figure 2 This is an overall framework diagram of the multimodal data fusion model and modality enhancement processing method provided by this invention. This embodiment demonstrates an end-to-end deep learning processing flow, covering the entire process from multi-task input to final task output.
[0073] like Figure 2 As shown, the multimodal data fusion model first receives multimodal input data from different tasks (Task 1, Task 2). Figure 2 In the scenario shown, the input data includes time-series data, video data, audio data, and image data.
[0074] These raw input data are first preprocessed and transformed into feature sequences in a uniform format. For example... Figure 2 As shown on the left, data for each modality (such as Time-series or Video) is mapped to a set of input vectors containing feature values (gray blocks) and modality embedding identifiers (purple / pink blocks). These vectors not only carry information about the original data but also specify the modality category to which the data belongs through the modality embedding identifiers.
[0075] The processed vector then enters a multi-modal fusion encoder. This encoder is... Figure 2 One of the core components of the architecture shown adopts a fully connected or densely connected structure (such as...). Figure 2 The densely connected lines (shown in the middle) are used for deep information interaction and integration between features of different modalities. Through this process, the encoder outputs a set of preliminarily fused latent feature representations (the middle green, pink, and purple node layers).
[0076] To further enhance the responsiveness of features to specific tasks, the model introduces a multi-head attention mechanism. Figure 2 The enlarged box at the top center). For each set of potential features, the model utilizes a query matrix ( ), key matrix ( ) and value matrix ( The calculation is performed using a weight matrix. Specifically, the input features are processed by the weight matrix. , and A linear transformation is performed to generate the corresponding Q, K, and V. Next, the model calculates the dot product of Q and K to generate the attention score matrix. And obtained after Softmax normalization. Finally, the normalized attention score is combined with the value matrix V to generate a weighted feature representation. This process is performed in parallel across multiple attention heads (Head 0~h) to capture feature correlations in different subspaces. Furthermore, in the multi-head attention mechanism, this application sets parameters to automatically allocate the number of attention heads to each modality.
[0077] After the features are processed by the multi-head attention mechanism, the model will use attention scores to refine the features. For example... Figure 2 As shown in the right-hand diagram, in the attention mechanism, modal similarity is calculated based on mutual information, and a modal strength score is calculated based on the modal similarity and the output of each attention head. This score map displays the correlation strength between different feature elements in the form of a heatmap. Based on these scores, the model introduces a gating mechanism to generate dynamic weights and perform weighted processing on the modalities, highlighting feature regions that are more discriminative for the current task (such as...). Figure 2 (As shown by the squares of varying shades on the right side of the middle section).
[0078] Finally, the enhanced and weighted features are fed into their respective task classifiers. Figure 2The model shown supports multi-task parallel processing, with independent classification paths for Task 1 (e.g., behavior recognition based on video and time series) and Task 2 (e.g., sentiment analysis based on audio and images). Each task classifier outputs a prediction result for a specific task (Task 1 Output, Task 2 Output) based on the fused features of the input through a fully connected layer and a Softmax layer.
[0079] The technical advantage of this embodiment lies in the fact that, through the cascaded design of a visualized multi-head attention mechanism and a modality enhancement module, the model not only achieves deep fusion of multimodal features but also intuitively displays feature focus points through attention score maps, improving the model's interpretability. Simultaneously, the multi-task parallel processing architecture enables the model to share underlying feature extraction capabilities, significantly improving resource utilization and cross-task generalization performance.
[0080] This embodiment also provides a modality enhancement processing method. Based on the aforementioned multimodal data fusion model, this method continuously optimizes the model's ability to process multimodal data through a closed-loop training and inference mechanism. In this embodiment, this method is applied to a continuous learning multimodal analysis system.
[0081] Reference Figure 3 , Figure 3 This is one of the flowcharts of the modal enhancement processing method provided by the present invention. The modal enhancement processing method first performs model inference steps: Step 101: In the current task processing round, input data of multiple modalities corresponding to each task are input into the task processing model, and the final processing results corresponding to each task are output by the task processing model.
[0082] Specifically, in the current task processing round, the system inputs multimodal input data corresponding to each task into the task processing model. Input data may include the text of a user comment, the user's facial expression image during the comment, and audio of their tone. The system obtains the final processing results corresponding to each task output by the task processing model. In this process, the task processing model adopts the multimodal data fusion model architecture described in the preceding embodiments. The task processing model outputs classification results through internal steps such as encoding, evaluation, dynamic fusion, and task adjustment.
[0083] In the aforementioned reasoning steps, the multimodal input data encompasses a rich variety of media formats. Specifically, the multimodal input data includes text input data (such as social media posts and comment text), image input data (such as emojis and scene photos), audio input data (such as voice clips and ambient sound effects), and video input data (such as short videos and live streams). This multi-dimensional input setup ensures that the method can handle complex real-world scenario data.
[0084] To improve the robustness of the model, the method also includes model training and optimization steps.
[0085] Reference Figure 4 , Figure 4 This is the second flowchart of the modal enhancement processing method provided by the present invention. Specifically, it includes the following steps: Step 201: For each task, obtain the modality strength score and actual contribution of each modality generated by the task processing model when processing input data of multiple modalities; Step 202: Calculate the ideal contribution of each mode based on the modality strength score of each mode; Step 203: Calculate the contribution deviation loss based on the deviation between the actual contribution and the ideal contribution of all modes; Step 204: Calculate the overall task loss based on the original loss of each task; Step 205: Calculate the weighted sum of the contribution deviation loss and the overall task loss to obtain the total loss; Step 206: Train the task processing model based on the total loss to obtain an optimized task processing model, which is then used for task processing in the next task processing round.
[0086] Specifically, for each task, the system obtains the modality strength scores of each modality generated by the task processing model when processing input data from multiple modalities. and the actual contribution of each mode These two parameters are calculated by the model's evaluation module and task adjustment module during the inference process, respectively.
[0087] Subsequently, the system calculates the ideal contribution of each modality based on its modality strength score. This step aims to establish a "benchmark target" based on statistical characteristics to guide the network in allocating learning weights.
[0088] The system performs a strict normalization operation when calculating the ideal contribution. The ideal contribution of each mode is calculated using the following formula: ; In the formula, The ideal contribution of the m-th mode, Let M be the modality strength score of the m-th mode, and M be the total number of modes. This formula ensures that the sum of the ideal contributions is 1, giving it the properties of a probability distribution and facilitating comparison with the actual contributions.
[0089] Next, the system calculates the contribution deviation loss based on the deviation between the actual and ideal contributions of all modes. This loss function quantifies the difference between the model's "actually learned weights" and "theoretically should have weights".
[0090] The system uses an absolute error summation method to calculate the contribution deviation loss. The contribution deviation loss is calculated using the following formula: ; In the formula, For contribution bias loss, The ideal contribution of the m-th mode, Let M be the actual contribution of the m-th mode, and M be the total number of modes. This formula forces the dynamic weight allocation network of the model (i.e., the generator) by calculating the L1 norm. (partial) to the results of evaluation based on statistical characteristics ( (Approach)
[0091] Furthermore, the system calculates the overall task loss based on the raw loss of each task. The raw loss is typically cross-entropy loss, which measures the error between the predicted result and the true label.
[0092] Specifically, the total loss is obtained by weighted summing of the contribution deviation loss and the overall task loss. : ; in, These are the preset weighting coefficients; The task loss function is used to adjust the enhancement based on the task loss, making the modal enhancement more closely match the specific task requirements. This allows the information provided by the modality to better adapt to the task objective, improving the model's performance in specific downstream tasks. The sum of the original losses for all T tasks is calculated using the following formula: ; in, For the first t The original loss for each task, such as cross-entropy loss.
[0093] Finally, the system performs a weighted summation of the contribution bias loss and the overall task loss to obtain the total loss. Based on this total loss, the system uses the backpropagation algorithm to train the task processing model, update the model parameters, and obtain an optimized task processing model. This optimized task processing model will be deployed for task processing in the next round, thereby achieving iterative improvement in model performance.
[0094] The modality enhancement method provided in this embodiment constructs a closed-loop mechanism of "modality evaluation - task adjustment - loss optimization" by introducing contribution bias loss. This method not only focuses on the final prediction accuracy of the task through the overall task loss, but also explicitly constrains the attention allocation mechanism within the model through contribution bias loss. This dual optimization strategy prevents the neural network from "taking shortcuts" during training (i.e., ignoring weak modal information), forcing the model to learn a modality weight distribution that conforms to statistical laws, thereby significantly improving the model's generalization ability and stability when dealing with modality loss or noise interference.
[0095] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A multimodal data fusion model, characterized in that, include: The module includes an encoding module, an evaluation module, a dynamic fusion module, and a task adjustment module. The encoding module performs feature encoding on the input data of each modality and outputs the initial feature vector of each modality. The evaluation module calculates statistical attribute features and task response features for each modality's initial feature vector, and generates a modality strength score for each modality based on a weighted combination of the statistical attribute features and the task response features. The dynamic fusion module maps the modality strength score of each modality to the corresponding dynamic modality weight, and performs weighted fusion of the initial feature vectors of all modalities based on the dynamic modality weights of all modalities to obtain a fused feature vector. The dynamic fusion module maps the fused feature vector to an initial prediction result; The task adjustment module calculates the global adaptation coefficient based on the overall distribution of the dynamic modal weights of all modalities. The task adjustment module performs an overall weighted adjustment on the initial prediction result based on the global adaptation coefficient to obtain the final processing result.
2. The multimodal data fusion model according to claim 1, characterized in that, The task adjustment module is specifically used to perform the following steps: Calculate the proportion of the dynamic modal weight of each mode in the sum of the dynamic modal weights of all modes to obtain the actual contribution of each mode; The concentration index is obtained by calculating the sum of squares of the actual contributions of all modes. Calculate the difference between the preset constant and the concentration index, and determine the difference as the global adaptation coefficient, wherein the global adaptation coefficient is negatively correlated with the concentration index.
3. The multimodal data fusion model according to claim 1, characterized in that, The evaluation module is specifically used to perform the following steps: Map the initial feature vectors of all the aforementioned modalities to a unified token representation; Based on the unified token representation, calculate the mutual information between the current modality and other modalities, and use the mean of the mutual information as the first component of the statistical attribute feature; Calculate the variance of the initial feature vector of the modality in the feature dimension, and use the normalized value of the variance as the second component of the statistical attribute feature; The initial feature vector of the modality is input into a preset lightweight evaluation network, and a single-value scalar is output through linear mapping. The single-value scalar is then used as the task response feature. Based on the preset first fusion coefficient, second fusion coefficient, and third fusion coefficient, the first component of the statistical attribute feature, the second component of the statistical attribute feature, and the task response feature are weighted and summed to obtain the modality strength score.
4. The multimodal data fusion model according to claim 1, characterized in that, The dynamic fusion module is specifically used to perform the following steps: For any modality, calculate the global mean of the initial feature vector; The global mean is multiplied by the modality strength score to obtain the evaluation input vector; The evaluation input vector is fed into a gating network to generate gating values; The global mean is processed using a basic weight network to generate basic weights, and the basic weights are multiplied by the gate value to obtain the dynamic modal weights; Multiply the dynamic modal weights by the initial feature vector to obtain a weighted initial feature vector; Based on the preset head number configuration parameters for each modality, the weight ratio of each modality during fusion is determined, and the weighted initial feature vectors of all modalities are weighted and fused according to the weight ratio to obtain the fused feature vector.
5. The multimodal data fusion model according to claim 1, characterized in that, The encoding module is specifically used to perform the following steps: The original input data of each modality is standardized, and a standardized input sequence containing modality identifiers and position information is generated through modality-specific embedding and Fourier feature position encoding. The normalized input sequence is processed using a single-modal shared encoder based on the Transformer architecture; The single-modal shared encoder has a set of fixed-dimensional potential arrays as query vectors and the standardized input sequence as key vectors and value vectors. The cross-modal interaction between the latent array and the standardized input sequence is calculated using a cross-attention mechanism. The interaction result is then input into the attention mechanism layer to iteratively update the latent array. The final updated latent array is then determined as the initial feature vector for that modality.
6. A modal enhancement processing method, characterized in that, include: In the current task processing round, the input data of multiple modalities corresponding to each task are input into the task processing model, and the final processing results corresponding to each task are output by the task processing model. The task processing model is the multimodal data fusion model as described in any one of claims 1 to 5.
7. The modal enhancement processing method according to claim 6, characterized in that, Also includes: For each task, obtain the modality strength score and the actual contribution of each modality generated by the task processing model when processing input data of multiple modalities; Calculate the ideal contribution of each mode based on the modality strength score of each mode; Calculate the contribution deviation loss based on the deviation between the actual contribution and the ideal contribution of all modes; Calculate the overall task loss based on the original loss of each task; The total loss is obtained by weighted summing the contribution deviation loss and the overall task loss. The task processing model is trained based on the total loss to obtain an optimized task processing model, which is then used for task processing in the next task processing round.
8. The modal enhancement processing method according to claim 7, characterized in that, The calculation of the ideal contribution of each mode based on the mode strength score includes: The ideal contribution of each mode is calculated using the following formula: ; In the formula, The ideal contribution of the m-th mode, Let M be the modality strength score of the m-th mode, and M be the total number of modes.
9. The modal enhancement processing method according to claim 7, characterized in that, The step of calculating the contribution deviation loss based on the deviation between the actual contribution and the ideal contribution of all modes includes: The contribution deviation loss is calculated using the following formula: ; In the formula, For contribution bias loss, The ideal contribution of the m-th mode, Let M be the actual contribution of the m-th mode, and M be the total number of modes.
10. The modal enhancement processing method according to claim 6, characterized in that, The various modal input data include text input data, image input data, audio input data, and video input data.