A multi-modal representation learning inference method and system based on spectrum-driven adaptive selection and gated reconstruction

By employing a multimodal representation learning method driven by spectrum-driven adaptive selection and gated reconstruction, the problems of modality loss and environmental interference in multimodal representation learning are solved, achieving robust inference and efficient feature reconstruction in complex environments.

CN121902995BActive Publication Date: 2026-06-19YANTAI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
YANTAI UNIV
Filing Date
2026-03-23
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing multimodal representation learning methods struggle to maintain robust inference performance when faced with missing modal information and environmental interference. Furthermore, they lack a dynamic perception mechanism for significant differences within multimodal information, leading to unbalanced feature fusion and wasted computational resources.

Method used

A multimodal representation learning method based on spectrum-driven adaptive selection and gated reconstruction is adopted. Through spectral domain state diagnosis, adaptive dynamic path selection and gated reconstruction mechanism, accurate modeling and robust inference of multimodal information are achieved.

Benefits of technology

Even with a 70% missing modal information ratio, it can still maintain a weighted F1 score of over 75% and a representation reconstruction fidelity of 91.4%, improving recognition accuracy, structural robustness, and real-time performance while reducing computational complexity.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121902995B_ABST
    Figure CN121902995B_ABST
Patent Text Reader

Abstract

This invention relates to the field of multimodal representation learning and inference technology, and in particular to a multimodal representation learning and inference method and system based on spectrum-driven adaptive selection and gated reconstruction. The method includes multimodal adaptive feature encoding based on acquired data, including heterogeneous multimodal feature fusion and global modeling, entity and context graph construction, feature alignment, and spatial reshaping; spectrum-driven adaptive state perception based on the encoded data, including sentiment saliency diagnosis based on spectral domain energy distribution, higher-order dependency modeling based on weighted hypergraphs, and semantic stability maintenance based on linear mapping; and gated guided feature reconstruction based on perceived features, including high-energy node feature recovery based on hybrid expert routing, low-energy expert semantic preservation, and global sequence reconstruction. This invention achieves automatic identification of structurally salient nodes in multimodal input sequences through a spectrum-driven adaptive selection module (SAS).
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of multimodal representation learning and reasoning technology, and in particular to a multimodal representation learning and reasoning method and system based on spectrum-driven adaptive selection and gated reconstruction. Background Technology

[0002] With the rapid development of artificial intelligence (AI) technology, multimodal information fusion has become a core capability for intelligent systems to achieve high-level semantic understanding and complex decision-making reasoning. In applications such as intelligent interactive systems, content understanding platforms, and intelligent decision support, systems typically need to perform unified modeling and comprehensive analysis of heterogeneous data from text, speech, images, and videos. This type of multimodal data is highly coupled in both temporal and spatial dimensions, and its internal semantic cues are often distributed across modalities. Only through collaborative fusion and deep association modeling can accurate reasoning about complex states or decision objectives be achieved. Therefore, how to achieve robust representation learning and intelligent reasoning of multimodal information in complex environments has become one of the important research directions in the field of AI.

[0003] While existing multimodal modeling techniques have made some progress and can achieve feature fusion and task prediction under ideal data conditions, they still face significant challenges in real-world deployment environments. On one hand, multimodal data in real-world scenarios is highly susceptible to external environmental interference, such as audio distortion caused by environmental noise, and image information loss due to visual occlusion or changes in lighting. This can lead to partial missing or severely imbalanced multimodal input data, disrupting the originally complete semantic structure and resulting in semantic fragmentation in the model's representation space. Traditional multimodal methods often employ strategies such as mean imputation or directly ignoring missing modalities, lacking the ability to actively model the inherent correlation structure of missing information, making it difficult to maintain stable inference performance in complex environments.

[0004] On the other hand, existing methods generally lack a dynamic perception mechanism for the significant differences within multimodal information. In practical applications, multimodal data often exhibits non-uniform distribution characteristics, meaning that not all time points or information fragments contribute equally to the final decision result. Some nodes may carry high-density key semantic cues, while many more nodes only serve as background transitions or auxiliary support functions. However, traditional multimodal models typically employ a uniform feature aggregation strategy, treating all input nodes with equal weight, lacking the ability to prioritize and differentiate key information. This indiscriminate processing not only wastes computational resources but also easily leads to the "oversmoothing" problem, causing highly saliency features to be diluted during global fusion, thereby weakening the model's discriminative ability in complex decision-making tasks.

[0005] Therefore, there is an urgent need to construct a robust multimodal representation learning and reasoning method for incomplete input environments, so that the system can maintain robust reasoning performance under complex interference conditions, and provide key technical support for the stable deployment of multimodal intelligent systems in complex real-world environments. Summary of the Invention

[0006] To address key issues in current multimodal representation learning and inference processes, such as semantic structure fragmentation due to modality loss, difficulty in effectively perceiving key information nodes, and over-smoothing of features due to the lack of guiding mechanisms in feature reconstruction, this invention provides a multimodal representation learning and inference method and system based on spectral-driven adaptive selection and gated reconstruction. By constructing a technical framework integrating spectral domain state diagnosis, adaptive dynamic path selection, and gated guided reconstruction mechanisms, it achieves accurate modeling and robust inference output of information structure evolution in incomplete multimodal environments.

[0007] Firstly, the present invention provides a multimodal representation learning and inference method based on spectrum-driven adaptive selection and gated reconstruction, which adopts the following technical solution:

[0008] A multimodal representation learning and inference method based on spectrum-driven adaptive selection and gated reconstruction includes:

[0009] Acquire heterogeneous multimodal data;

[0010] Multimodal adaptive feature encoding is performed based on the acquired data, including heterogeneous multimodal feature fusion and global modeling, entity and context relationship graph construction, feature alignment and spatial reshaping;

[0011] Spectrum-driven adaptive state perception based on encoded data includes sentiment saliency diagnosis based on spectral domain energy distribution, higher-order dependency modeling based on weighted hypergraphs, and semantic stability maintenance based on linear mapping.

[0012] Gated guided feature reconstruction based on perceptual features includes high-energy node feature recovery, low-energy expert semantic preservation, and global sequence reassembly based on hybrid expert routing.

[0013] Multi-task discrimination and reasoning based on reconstructed features;

[0014] Output the reasoning results.

[0015] Secondly, a multimodal representation learning and inference system based on spectrum-driven adaptive selection and gated reconstruction includes:

[0016] The data acquisition module is configured to acquire heterogeneous multimodal data;

[0017] The feature encoding module is configured to perform multimodal adaptive feature encoding based on the acquired data, including heterogeneous multimodal feature fusion and global modeling, entity and context relationship graph construction, feature alignment and spatial reshaping;

[0018] The state-aware module is configured to perform spectrum-driven adaptive state awareness based on encoded data, including sentiment saliency diagnosis based on spectral domain energy distribution, higher-order dependency modeling based on weighted hypergraphs, and semantic stability maintenance based on linear mapping.

[0019] The feature reconstruction module is configured to perform gated guided feature reconstruction based on perceptual features, including high-energy node feature recovery based on hybrid expert routing, low-energy expert semantic preservation, and global sequence recombination.

[0020] The reasoning module is configured to perform multi-task discrimination and reasoning based on the reconstructed features; and output the reasoning results.

[0021] Thirdly, the present invention provides a computer-readable storage medium storing a plurality of instructions adapted for loading and execution by a processor of a terminal device of the aforementioned multimodal representation learning inference method based on spectrum-driven adaptive selection and gated reconstruction.

[0022] Fourthly, the present invention provides a terminal device, including a processor and a computer-readable storage medium, wherein the processor is used to implement various instructions; the computer-readable storage medium is used to store multiple instructions, the instructions being adapted to be loaded and executed by the processor to provide the multimodal representation learning inference method based on spectrum-driven adaptive selection and gated reconstruction.

[0023] In summary, the present invention has the following beneficial technical effects:

[0024] Compared to existing multimodal representation methods, which suffer from significant performance degradation under conditions of missing modal information, limited feature aggregation strategies, and a lack of differentiated reconstruction mechanisms, this invention addresses the need for robust representation and stable inference under incomplete multimodal input environments by constructing an end-to-end technical framework based on spectrum-driven adaptive selection and gated feature reconstruction. The beneficial effects of this invention are mainly reflected in the following aspects:

[0025] First, this invention achieves automatic identification of structurally significant nodes in multimodal input sequences through a spectral-driven adaptive selection module (SAS). By employing spectral domain variance analysis, the energy distribution of nodes in the local relational graph is measured, transforming state partitioning from a traditional unified aggregation model to an adaptive splitting mechanism based on structural differences. This mechanism prioritizes the modeling of high-variability regions while maintaining stable representations of low-variability regions, effectively avoiding over-smoothing of features and improving overall representation quality.

[0026] Secondly, the Gated Guided Feature Reconstruction (GSR) module constructed in this invention proposes differentiated reconstruction paths to address the modality missing problem. Through a hybrid expert (MoE) routing mechanism, high-energy nodes are assigned to enhanced modeling paths, and directional compensation is performed using global semantic anchors. Simultaneously, lightweight and stable paths are designed for low-energy nodes to reduce computational overhead and maintain structural consistency. This dual-path collaborative mechanism avoids the noise amplification or structural shift problems common in traditional completion methods, maintaining the continuity and consistency of feature representation even under incomplete information conditions.

[0027] Furthermore, this invention achieves a paradigm shift in structural optimization from "unified processing" to "state decentralization and on-demand enhancement." By introducing a state gating mechanism, the system can dynamically allocate modeling intensity and computational resources based on the salience of nodes, enabling critical regions to receive higher-order dependency modeling support while stable regions maintain lightweight representation. This strategy effectively controls computational complexity while improving robustness, balancing performance and efficiency.

[0028] Furthermore, this invention integrates task prediction objectives, state selection supervision, distribution alignment constraints, and feature reconstruction constraints into an end-to-end training framework through a multi-task joint optimization mechanism. This collaborative optimization mode fosters a mutually reinforcing relationship between the feature reconstruction process and the task discrimination objective, enhancing the model's generalization ability and prediction stability under modality missing conditions.

[0029] In summary, this invention demonstrates significant performance advantages in scenarios with incomplete multimodal inputs. Experimental results show that even with a 70% missing modal information ratio, the invention can still maintain a weighted F1 score above 75%; under complete data conditions, the prediction accuracy (ACC) is improved by approximately 3.2%–5.8% compared to mainstream benchmark models; the system's reconstruction fidelity under missing modalities reaches 91.4%; and thanks to differentiated expert routing design, the system's inference latency is controlled within 45 milliseconds. This invention achieves stable improvements in multiple dimensions, including recognition accuracy, structural robustness, reconstruction quality, and real-time performance, and has strong engineering application and promotion value. Attached Figure Description

[0030] Figure 1 This is a schematic diagram of a multimodal representation learning and inference method based on spectrum-driven adaptive selection and gated reconstruction according to Embodiment 1 of the present invention;

[0031] Figure 2 A diagram comparing the core recognition performance of each model in extreme modality-deficient scenarios;

[0032] Figure 3A diagram comparing the feature reconstruction quality of each model in extreme modality missing scenarios;

[0033] Figure 4 A schematic diagram comparing the consistency assessment between the state diagnosis mechanisms of each model and the labeled true values;

[0034] Figure 5 A diagram comparing the inference time of each model;

[0035] Figure 6 A diagram showing the comparison of the multi-dimensional comprehensive performance evaluation of various models. Detailed Implementation

[0036] The present invention will be further described in detail below with reference to the accompanying drawings.

[0037] Example 1

[0038] Reference Figure 1 The "robust multimodal representation learning and reasoning method based on spectrum-driven adaptive selection and gated reconstruction" (hereinafter referred to as SAGE-Net) proposed in this embodiment has an overall technical framework consisting of four core modules: (1) Multimodal adaptive feature encoding module (MAE); (2) Spectrum-driven adaptive state diagnosis module (SAS); (3) Gated guided reconstruction module (GSR); and (4) Multi-task reasoning and decision output module (MPC). The specific scheme is as follows:

[0039] (1) Multimodal Adaptive Feature Encoding Module (MAE)

[0040] This module is the foundation of the entire system. Its core task is to transform heterogeneous multimodal data into structured feature representations with uniform dimensions and contextual relevance. Input data can include modalities such as text, audio, images, or video. These modalities differ in acquisition conditions, representation forms, and semantic density. For example, audio signals may be affected by environmental noise, visual signals may be affected by occlusion or changes in lighting, and there is also an asymmetry in the distribution of semantic information among different modalities. Therefore, this module needs to have robust feature integration capabilities.

[0041] This module performs unified modeling of multimodal inputs by constructing an encoding network that incorporates identity embedding and bidirectional temporal modeling structures. The encoding process outputs node-level feature representations with context-aware capabilities. Simultaneously, stable semantic features are extracted from the text modality, and global semantic anchors are generated for cross-modal guidance and information compensation in subsequent modules.

[0042] 1) Heterogeneous multimodal feature fusion and global modeling

[0043] In multimodal representation learning and reasoning tasks, text, audio, and visual data are crucial information sources for constructing structured semantic representations. Traditional methods typically integrate features from different modalities through simple vector concatenation. However, when dealing with long-range relational structures that require incorporating intonation variations, visual dynamics, and contextual dependencies, the lack of deep modeling of temporal dependencies often hinders the acquisition of unified and concise global semantic representations. To overcome this problem, this invention employs Bi-Gated Recurrent Units (Bi-GRUs) to encode multimodal heterogeneous data. By leveraging their Long Short-Term Memory (LSTM) mechanism, Bi-GRUs model the bi-directional dependencies between nodes at any time step in the sequence, thereby obtaining global node features with context-aware capabilities.

[0044] First, regarding the time... Acquired sequence node data Original feature extraction is performed. To formally describe the common modality missing problem in practical applications, this invention provides a method for each modality... (in (representing audio, visual, and text respectively) Define a binary presence indicator. By comparing this indicator factor with the original modal features The product of these elements constructs a preliminary multimodal input vector:

[0045] ,

[0046] in, Indicates the first The multimodal fusion input vector of each node; , as well as These represent the extracted original feature vectors for audio, visual, and text modalities, respectively. For binary, there exists an indicator factor, when the first... The node of the first When a modal data point exists, its value is 1; when the modal data point is missing, its value is 0.

[0047] To enable the model to model the identity features and time-series information of different entities, this invention maps identity identifiers to embedding vectors in a continuous space and links them to multimodal input vectors. The information is then fused and fed into a bidirectional gated recurrent network. This network simultaneously models both forward and backward contextual information, providing a more comprehensive portrayal of the evolution of information within the sequence compared to a unidirectional structure.

[0048] ,

[0049] in, Indicates the first Context-aware node embeddings generated by encoding individual nodes; and These represent the forward and backward states at adjacent time points, respectively; This represents a bidirectional gated loop operation function.

[0050] While acquiring sequence features, this invention simultaneously extracts deep semantic information of the text modality and maps it into global semantic anchor signals through a linear transformation layer. Due to its strong semantic stability, this signal serves as a reference benchmark for cross-modal guidance and information compensation in subsequent modules.

[0051] ,

[0052] in, This represents the generated global text semantic anchor vector; The input is the original text feature vector; Represents a learnable linear projective weight matrix; This is the bias vector.

[0053] After the above encoding process, the original heterogeneous multimodal data is transformed into a node representation matrix with contextual relevance. This provides a stable structured input for subsequent spectrum-driven state diagnosis.

[0054] 2) Construction of entity and context relationship graph

[0055] In multimodal sequence modeling, the evolution of information depends not only on the semantic representation of the nodes themselves, but also on the interaction relationships between different entities and the temporal structure of the context. Traditional methods typically employ fully connected networks or simple attention mechanisms to characterize dependencies. However, when faced with multi-entity or long-sequence interaction scenarios, the lack of explicit modeling of the topology makes it difficult for the model to distinguish between self-dependencies and cross-entity dependencies, leading to local key information being easily diluted by global background features. To overcome this problem, this invention, based on the node representations obtained in the aforementioned steps, introduces a dynamic heterogeneous graph construction mechanism based on a sliding window. This mechanism jointly models the identity and temporal dimensions in a non-Euclidean space using a relational graph convolutional network.

[0056] First, this invention defines a size of A dynamic sliding window, for any central node Its neighborhood node set is defined as:

[0057] ,

[0058] Based on this, the system identifies the identity of each node in the neighborhood and constructs connection edges that reflect the interaction relationships of multiple entities. and according to the entity and The correspondence is assigned to a specific relation type. This design distinguishes the dependency structures between different entities through explicit connection constraints, thereby characterizing the association patterns between entities in complex interaction scenarios.

[0059] Meanwhile, this invention establishes time-series tags between nodes. Connection edge This involves constructing a topology that reflects multi-directional contextual logic. This includes a time-series tag set. Including forward, current, and backward directions, the model can model the propagation and feedback of information along the time axis by assigning differentiated weights to edges in different directions. This dual-path graph structure enables the model to jointly model node features from the dimensions of entity relationships and time.

[0060] To achieve efficient feature extraction and aggregation of the aforementioned heterogeneous relationships, this invention employs a two-layer parameter-shared relational graph convolutional network to calculate identity relationship features separately. and context features The feature update process is as follows:

[0061] ,

[0062] ,

[0063] in, Indicates the first The spatial feature vector obtained by aggregating nodes along the identity relationship dimension; Indicates the first The feature vector obtained by aggregating nodes in the temporal dimension; Indicates in relation type Next node The set of neighboring nodes; The first output of the initial output A context-aware node embedding; and These represent different relation types. Learnable transformation weight matrix; This represents a non-linear activation function.

[0064] After obtaining the two-dimensional features, the entity relationship features and contextual temporal features are fused through a nonlinear fusion layer to generate a comprehensive node feature vector for subsequent spectral domain state diagnosis. :

[0065] ,

[0066] in, This represents a comprehensive discourse feature vector that incorporates multidimensional topological associations. Indicates feature concatenation operation; This is the global feature transformation matrix; This is a learnable bias term used to adjust the distribution.

[0067] Through the aforementioned graph structure construction and feature iteration, the system maps the original linear sequence into a high-dimensional relational graph structure. This process enhances the model's ability to express key local interaction structures and provides a structured representation basis for the next stage of spectral-driven information saliency diagnosis.

[0068] 3) Feature Alignment and Spatial Reshaping

[0069] After aggregating multidimensional interactive dependencies, the core challenge for the system lies in eliminating spatial distribution biases generated during the fusion process of heterogeneous modal features. Since text, audio, and visual features have different dimensions and semantic densities in the initial stages, directly coupled feature vectors may exhibit numerical instability, easily masking fine-grained structural changes. To address this issue, this invention introduces a spatial reshaping strategy based on an adaptive residual mechanism, building upon dual-path relation feature fusion. This strategy maps the fused features to a unified alignment space through nonlinear projection and complements the original node representations, thereby improving the stability and robustness of feature representation.

[0070] First, this invention constructs a feature reshaping network consisting of a linear transformation layer and a normalization layer, applying regularization constraints to the geometric distribution of nodes in the cross-modal alignment space. To avoid gradient decay or loss of core semantic information after convolutional operations on complex graphs, this invention employs a residual connection structure during the reshaping process, preserving the original node features... With fusion features Perform weighted consolidation:

[0071] ,

[0072] in, This represents the enhanced node feature vector after spatial reshaping and semantic alignment. The initial node representation is obtained by bidirectional temporal coding; This is a comprehensive feature derived from the fusion of relational topology and contextual logic; This represents a multilayer perceptron mapping operation used to project heterogeneous relation features onto a dimensional space consistent with the original nodes; The representation layer normalization operator is used to smooth the feature distribution and improve training stability.

[0073] Furthermore, this module introduces a contrastive learning mechanism to constrain the node feature manifold, ensuring that even with missing modal information, the feature representations generated by the remaining valid modalities still remain near the preset semantic manifold, thereby enhancing the model's adaptability to data incompleteness. This design provides stable and information-dense structured input for state diagnosis in subsequent modules. .

[0074] Through the processing of the above three sub-modules, this invention realizes a complete conversion process from raw heterogeneous signals to structured, robust node representations, forming a technical closed loop of the MAE module, and providing a data foundation for subsequent information saliency diagnosis based on spectral domain energy distribution.

[0075] (2) Spectrum-driven adaptive state awareness module (SAS)

[0076] This module is the core of the system. Its task is to adaptively partition node states based on the structured features output by the aforementioned modules, by measuring the spectral energy distribution of nodes in the relationship graph. Dialogue data exhibits significant non-uniformity. In long sequences, some nodes only serve a contextual function, while a small number of nodes show large structural fluctuations in the feature space. If a uniform aggregation strategy is adopted, it is easy to smooth out highly variable regions as a whole, especially when modalities are missing, making it impossible to determine the key processing objects. Therefore, this module introduces a spectral variance analysis mechanism to divide nodes into high-energy states and low-energy states, and implements differentiated modeling path selection based on state masks, providing a structured basis for subsequent feature reconstruction.

[0077] 1) Affect saliency diagnosis based on spectral domain energy distribution

[0078] After acquiring structured multimodal association features, the system needs to perform real-time diagnosis of the non-uniformity of information distribution in the input sequence. Since different data nodes contribute differently to the overall structure, using indiscriminate feature aggregation can easily lead to local high-variability regions being smoothed out by a stable background. To address this issue, this invention introduces an adaptive selection mechanism based on spectral domain energy analysis. By quantitatively evaluating the energy fluctuations of data nodes in the local relationship graph, it identifies high-energy and low-energy state nodes.

[0079] First, this module utilizes the comprehensive feature vector output by the MAE module. In the preset neighborhood window The intensity of feature evolution is measured internally. This is achieved by calculating the spectral variance between each node and its neighboring context nodes. This is used to quantify the degree of variation of nodes in the spatial manifold. A higher spectral variance indicates a greater difference between the node and its neighborhood feature distribution, reflecting stronger structural saliency. The calculation process is as follows:

[0080] ,

[0081] in, Indicates the first The spectral variance of each node; For the preset neighborhood set of the sliding window; Indicates the number of neighbors within the neighborhood. The comprehensive correlation characteristics of each node; This represents the mean vector of the features of all nodes within this neighborhood; This represents the total number of nodes in the neighborhood.

[0082] After completing the spectral variance quantization, this invention constructs a learning threshold-based system. The emotion-triggered gating mechanism aims to automatically classify nodes into different perceptual states. This mechanism generates a state mask through binarization. This provides routing instructions for subsequent differential feature processing paths:

[0083] ,

[0084] in, Indicates the first The state mask of each node; The learnable emotion trigger threshold is set internally by the system; when When this occurs, it indicates that the node is in an emotionally active state and requires in-depth structural modeling and repair; when When this occurs, it indicates that the node is in a passive emotional state, focusing on maintaining semantic stability.

[0085] Furthermore, this invention also designs a dedicated supervised loss function for this diagnostic process, continuously optimizing the trigger threshold by comparing the actual vitality distribution of nodes in the complete modality. The accuracy of this "diagnosis first, classification later" mechanism is high. Even under modality-deficient conditions, it can still achieve priority modeling of key areas, providing structured input for subsequent high-order dependency modeling and gating reconstruction.

[0086] 2) Higher-order dependency modeling based on weighted hypergraph

[0087] In the aforementioned state diagnosis process, nodes classified as high-energy states typically exhibit strong characteristic fluctuations and structural differences in their local relational structures. These nodes often have complex multi-dimensional associations with multiple neighboring nodes. Traditional binary graph structures only describe pairwise connections between nodes, making it difficult to characterize the collaborative relationships among multiple nodes, thus resulting in insufficient expressive power when modeling higher-order dependencies. To address this issue, this invention introduces a higher-order dependency modeling mechanism based on spectral domain weighted hypergraphs for high-energy state nodes. In this structure, hyperedges can connect multiple nodes simultaneously, thereby characterizing many-to-many associations at the structural level. Compared to ordinary graphs, hypergraphs can express higher-order topological constraints, making node feature updates not only dependent on local adjacency relationships but also influenced by the collective structural patterns.

[0088] Specifically, this invention constructs a weighted hypergraph structure with high-energy nodes at its core. Let... This is the hypergraph incidence matrix, used to represent the membership relationship between nodes and hyperedges; This is the hyperedge weight matrix, used to measure the contribution of different hyperedges to the overall structure. This is achieved by constructing the corresponding node degree matrix. With the boundary matrix This achieves standardized processing of structural information. Based on this, spectral domain convolution operations based on the Laplacian matrix are introduced to update the features of high-energy nodes. The calculation process is as follows:

[0089] ,

[0090] in, This represents the node enhancement features obtained after higher-order modeling. The comprehensive correlation features are the input; Represents the hypergraph incidence matrix; A learnable weight matrix representing the importance of hyperedges; and These are the node degree matrix and edge degree matrix of the hypergraph, respectively, used for structured normalization of the convolution process; It is a non-linear activation function.

[0091] 3) Semantic stability maintenance based on linear mapping

[0092] Unlike high-energy nodes, nodes classified as low-energy nodes exhibit smaller characteristic fluctuations in the local relational structure, and their main role is to maintain the continuity of the overall structure and the consistency of the background. For these nodes, using complex high-order hypergraph operations will not only increase computational overhead, but may also introduce structural perturbations due to over-modeling of low-change regions.

[0093] To address this, this invention designs a stable path modeling strategy based on linear mapping for low-energy state nodes. In this path, node characteristics... Updates are performed using lightweight spatial projection to preserve the original structural features and provide a stable reference baseline for the overall sequence. The processing procedure is defined as follows:

[0094] ,

[0095] in, This represents the updated feature vector of the low-energy state node; The comprehensive correlation features are the input; and These are the learnable weight matrix and bias term used for semantic preservation, respectively. This linear mapping path effectively reduces computational complexity while maintaining the consistency of feature structure, enabling the system to maintain a balance between efficiency and stability during differential modeling.

[0096] Finally, this invention generates the above-mentioned active paths through a structured realignment layer. Compared to passive path output Feature integration is performed. The integration process uses state masks. As a control signal, based on the state labels obtained by the nodes during the diagnostic phase, the features output from the two paths are selected and their positions restored bit by bit to generate the final state-enhanced feature vector. :

[0097] ,

[0098] in, Indicates the first The final output features of each node; This is the state mask corresponding to this node; High-energy node features obtained from high-order modeling of the hypergraph; The low-energy node features are obtained from the linear mapping path.

[0099] To obtain a robust representation of the entire dialogue, the entire sequence will be used. The feature vectors corresponding to each node are concatenated and spatially aligned, thereby integrating the scattered node features into a unified state-enhanced representation matrix. :

[0100] ,

[0101] in, This is the final output state enhancement representation matrix; The total number of nodes; The dimension of the feature vector; For feature concatenation and matrix recombination operators.

[0102] Through the aforementioned state splitting and mask recombination mechanisms, the system achieves enhanced modeling of high-variability regions and stable preservation of low-variability regions while ensuring overall structural coherence. The resulting representation matrix provides a unified, structured input foundation with differential representation capabilities for subsequent feature reconstruction modules.

[0103] (3) Gated Guidance Feature Reconstruction Module (GSR)

[0104] This module is the core of the system's feature reconstruction. Its task is to perform targeted feature compensation and structural reconstruction for data nodes with missing modalities through a gated routing mechanism. In real-world multimodal input environments, some modal information may be missing due to acquisition conditions or transmission factors, resulting in incomplete node representations and affecting the consistency of the overall structure. Especially when the missing information occurs at high-energy state nodes, key discriminative features may not be fully expressed. Therefore, this module constructs a reconstruction architecture based on selective hybrid experts (MoE), introducing global semantic anchors as guiding signals to achieve targeted recovery of missing information.

[0105] 1) High-energy node feature recovery based on hybrid expert routing

[0106] This module will output the global state enhancement representation matrix. As the initial input, to implement the differentiated reconstruction strategy, the system constructs a MoE Routing Gate and utilizes the aforementioned generated state mask sequence. As a hard routing control signal, for the matrix The characteristics of each node in the process are used for filtering and traffic distribution.

[0107] For those identified as high-energy states ( Data node characteristics The system assigns it to a high-energy reconstruction expert path. To ensure the reconstruction process remains consistent with the overall structure, this path introduces a cross-modal attention mechanism, using the current node's features as the query vector, and at the global semantic anchor point... Retrieve stable structure information from the data and generate a compensated representation for the missing modes:

[0108] ,

[0109] in, This represents the generated preliminary reconstructed feature vector; for The first state in the matrix that is determined to be a high-energy state Each node's characteristics; It is a globally stable global semantic anchor signal; , , These are the learnable projection matrices; This is the attention operation function. Subsequently, an adaptive gated residual structure is used to inject the compensation signal into the original node representation, generating the final high-energy reconstructed features. :

[0110] ,

[0111] in, This represents the final generated high-energy path reconstruction feature vector; The adaptive gating weights, used to control the semantic compensation strength, are generated by the sigmoid activation function. Through this gating injection mechanism, the system can perform targeted enhancement of missing modalities while preserving the original structural information, avoiding structural shifts introduced by direct replacement or simple filling.

[0112] 2) Low-energy expert semantic preservation and global sequence reconstruction

[0113] For those identified as low-energy states by the routing gateway ( Data node characteristics The system distributes these features to low-energy preservation expert paths. These paths maintain the semantic stability of the original features through linear projection operations, avoiding the introduction of complex computations or structural perturbations at low-intensity nodes. The computation process is as follows:

[0114] ,

[0115] in, This represents the generated low-energy expert-reconstructed feature vector; and These are the learnable weight matrix and bias term, respectively. This approach significantly reduces the overall computational complexity while ensuring the continuity of node features, enabling the system to maintain a balance between efficiency and stability during differentiated reconstruction.

[0116] After completing the feature updates for both high-energy and low-energy paths, this invention uses a mask fusion operator to convert the high-energy path output... With low-energy pathway output A global integration is then performed. This process uses state masks to remap the node features processed by different paths back to their original sequence positions, generating the final reconstructed feature matrix. :

[0117] ,

[0118] in, The output is the reconstructed feature matrix; This represents the total number of nodes in the sequence. This integration mechanism ensures that the system can accurately compensate for key information while maintaining the coherence and stability of global semantics when dealing with severe modality loss.

[0119] Through the aforementioned dual-path reconstruction and global reorganization strategies, the system can achieve differentiated feature recovery and overall structural consistency maintenance under the condition of incomplete modality, providing a stable, complete and discriminative multimodal representation foundation for subsequent prediction and inference modules.

[0120] (4) Multi-task discrimination and reasoning module (MPC)

[0121] This module is the core of the system's decision-making and optimization. Its main task is to refine the global dependencies of the reconstructed feature sequence and map it to a predefined task label space. Simultaneously, it enhances the model's robust training capability under modality-deficient conditions through a multi-task joint loss function. The multimodal inference result depends not only on the reconstruction quality of individual nodes but also on the consistency of the entire input sequence in its long-range structure. Therefore, this module constructs a global dependency modeling mechanism to perform secondary optimization of the reconstructed features, reducing local biases that may be introduced by the preceding reconstruction process and improving the final discrimination accuracy.

[0122] 1) Global Dependency Refinement and Feature Evolution Modeling

[0123] The system first completes the reconstruction of the feature matrix. As input, the data is fed into the global temporal refinement layer. To ensure that long-range structural relationships are maintained between nodes after differential expert path processing, this invention introduces a multi-head self-attention mechanism (MHSA) to perform secondary modeling of the full sequence features. This mechanism captures cross-node dependencies in different subspaces through parallel attention heads, achieving global feature reshaping and information redistribution, thereby generating a robust feature matrix containing global structural dependency information. :

[0124] ,

[0125] in, This is the final output global enhanced feature matrix; To reconstruct the feature matrix; This is a multi-head self-attention operation function. Through this global modeling process, the system can smooth out local feature fluctuations caused by modality loss repair and strengthen long-distance structural correlations, making the final representation more stable and consistent.

[0126] 2) Task-oriented prediction and discrimination output

[0127] After obtaining the global enhanced feature sequence, this invention targets each data node. A task-oriented multilayer perceptron (MLP) is constructed as the discriminative output layer. Through multi-layer nonlinear transformations, high-dimensional latent features are mapped to a predefined task label space, generating the predicted probability distribution for the corresponding category.

[0128] ,

[0129] ,

[0130] in, Indicates the first The predicted probability of each data node belonging to each task category; For categorical log odds (Logits); , and , These are the weight matrix and bias vector of the classifier, respectively; For activation functions; This is the normalized probability mapping function. The system selects the category with the highest predicted probability as the final task prediction result for that node.

[0131] 3) Multi-task joint loss function optimization strategy

[0132] To ensure that the model can still achieve high-fidelity feature reconstruction and stable task prediction even in modality-deficient environments, this invention designs a multi-task joint loss function. End-to-end joint optimization is performed on the entire network. The loss function consists of several functionally defined sub-terms:

[0133] ,

[0134] Using cross-entropy loss This is used to minimize the deviation between the predicted results and the true labels, ensuring accurate discrimination. Spectral selection loss. Used to supervise the state gating mechanism in the SAS module, improving the stability of high-energy and low-energy node partitioning and avoiding path deviations caused by state misclassification. Distribution alignment loss. High-energy reconstruction of expert output distribution and global semantic anchors using Kullback-Leibler divergence metric. The differences between them ensure consistency in the reconstruction process within the probability space, preventing feature drift. Feature Reconstruction Loss Reconstructed features are measured using mean squared error (MSE). The distance between the model and the feature representation under modal integrity directly improves the fidelity of the model in restoring missing information. Among these, , , These are preset hyperparameters used to balance the weights among different optimization objectives. Through the aforementioned joint optimization strategy, the system forms a collaborative optimization mechanism of "task-guided reconstruction and reconstruction-based enhanced discrimination" during training. This mechanism enables the model to maintain structural stability and prediction reliability even under extreme modal incompleteness conditions, thereby significantly improving overall robustness.

[0135] Experimental verification

[0136] To systematically verify the performance advantages of the method of this invention in representation learning and modality missing reconstruction in complex multimodal environments, a multimodal benchmark dataset containing multi-class missing scenarios was constructed and adopted. This dataset contains three types of modal information: ① Text modality: deep semantic features are extracted from the input text, and global semantic anchors are constructed simultaneously as stable reference signals for cross-modal reconstruction; ② Audio modality: spectrograms and prosodic features (such as pitch, intensity, and speech rate) are extracted using speech processing tools to characterize the structural changes in the auditory dimension; ③ Visual modality: facial action units (AUs), pose, and eye-tracking features are extracted using visual analysis techniques to describe the dynamic patterns in the visual dimension. The dataset contains thousands of multimodal sample segments and is divided into training and test sets. To verify the robustness of the model, a random missing modality ratio ranging from 10% to 70% was artificially set in the experiments to simulate various incomplete information scenarios in complex environments. Task labels cover multiple categories to verify the model's generalization ability in multi-class discrimination tasks.

[0137] To comprehensively evaluate the performance of the method of this invention, four representative multimodal modeling methods were selected for comparison: ① Simple-RNN: This method uses a unidirectional long short-term memory network to perform temporal modeling of cascaded multimodal features. It does not include a modality missing compensation mechanism and represents a basic temporal modeling method. ② Tensor-Fusion: This method achieves tensor-level fusion between modalities based on outer product operations and models the correlation between different modalities through static high-order interactions, representing a feature-level deep fusion method. ③ RGAT-Expert: This method models the dependencies between multiple entities based on relational graph convolution and self-attention mechanisms, and uses conventional interpolation strategies to handle missing modalities, representing an advanced method combining graph structures and attention mechanisms. ④ The method of this invention (SAGE-Net): This is the complete technical framework proposed in this paper. This method identifies high-energy state nodes through a spectrum-driven state diagnosis mechanism and uses a hybrid expert-based gated reconstruction architecture (GSR) combined with global semantic anchors for directional feature recovery.

[0138] All methods were evaluated under the same hardware environment and data partitioning conditions. Evaluation metrics included: ① Accuracy (ACC, %): the proportion of correctly predicted samples out of the total test samples, used to measure overall discriminative ability; ② Weighted F1 score (W-F1, %): the harmonic mean of precision and recall, weighted by the proportion of samples from each class, used to measure the model's stability under class imbalance and modality missing conditions; ③ Reconstruction fidelity (Recon-Fid, %): the cosine similarity between the output features of the reconstructed path and the features under complete modality conditions, used to evaluate the feature recovery quality of the GSR module; ④ Diagnostic consistency (Diag-Acc, %): the degree of consistency between the high-energy / low-energy state partitioning results generated by the model and the manually labeled structural saliency labels, used to measure the state recognition ability of the SAS module; ⑤ Inference time (Time, ms): the average time required to process a single data node feature, used to evaluate the real-time deployment efficiency of the system.

[0139] Table 1. Comparison of data from different methods under five major indicators.

[0140]

[0141] Since Simple-RNN and Tensor-Fusion models are designed as end-to-end discriminative classifiers, their goal is to directly output the task category result of the current data node. They do not have the ability to actively repair missing modalities or diagnose states. Therefore, the metrics "Reconstruction Fidelity" and "Diag-Acc" used to measure the quality of feature reconstruction and structure awareness are not applicable to them. At the same time, as a typical black-box time series model, the internal hidden states of Simple-RNN are difficult to directly map into interpretable structural diagnostic results. Although Tensor-Fusion achieves multimodal fusion, its fusion method is based on the static mapping of tensor outer products, lacking an endogenous state discrimination mechanism and unable to generate structured state masks. Therefore, it lacks an effective evaluation basis for the state-diagnostic consistency rate metric.

[0142] Given the differences in the dimensions and evaluation directions of the five indicators—accuracy, weighted F1 score, reconstruction fidelity, state diagnosis consistency rate, and inference time—making direct unified visualization impossible, this invention standardizes the indicators. For positive indicators (higher values ​​are better, including accuracy, weighted F1 score, reconstruction fidelity, and state diagnosis consistency rate), their normalized results remain unchanged. For negative indicators (lower values ​​are better, i.e., inference time), maximum and minimum value normalization is first performed, and then the direction is unified through "1 - normalized value," transforming all indicators into a "higher values ​​are better" format. To avoid distortion of the radar chart due to the weakest performing model normalizing to 0 on some indicators, this invention further introduces a linear interval compression mechanism, mapping the normalized results to the [0.2, 1.0] interval, thereby ensuring the integrity and comparability of the outlines of each method in the visualization.

[0143] The experimental results are shown in Table 1 and Figures 2 to 6. The method of this invention demonstrates significant advantages in overall performance. The basic temporal model, Simple-RNN, suffers from limited overall discriminative performance due to its failure to utilize complementary information between multimodal models. While Tensor-Fusion achieves static fusion, it lacks the ability to model dynamic cross-node associations, resulting in limited performance in complex long-sequence scenarios and the absence of a repair mechanism. RGAT-Expert improves performance compared to the previous two by modeling dependencies between entities through graph structures, validating the importance of relational modeling. However, this method only employs a general interpolation strategy under modality missing conditions, lacking adaptive state splitting and targeted repair capabilities. Therefore, it remains insufficient in terms of accuracy, reconstruction fidelity, and state diagnosis consistency.

[0144] It is worth noting that although the method of this invention includes multi-level computational modules such as graph attention modeling, hybrid expert routing (MoE), and global semantic anchor guidance, its inference time (172ms) is slightly higher than that of the simplest Tensor-Fusion, but significantly better than RGAT-Expert (245ms), which also uses complex graph structure modeling. This advantage stems from the spectrum-driven adaptive path selection mechanism: the system performs structural energy diagnosis on nodes through the SAS module, performing only lightweight linear mapping on the majority of low-energy nodes, while concentrating high-order modeling and semantic repair computation on a small number of high-energy nodes, realizing an "on-demand computation" resource allocation strategy. This mechanism effectively avoids the high-order computational burden on all nodes, significantly improving overall efficiency while ensuring modeling capabilities.

[0145] In summary, the proposed method (SAGE-Net) achieves systematic improvements in several core dimensions: both accuracy (91.2%) and weighted F1 score (90.5%) are optimal, verifying its multimodal collaborative discrimination capability; the reconstruction fidelity reaches 91.4%, indicating that the gated semantic reconstruction mechanism can achieve high-quality feature recovery; the state diagnosis consistency rate is 93.6%, indicating that the spectrum-driven diagnostic mechanism has stable and reliable structure perception capability; and the inference time remains at the millisecond level, meeting the deployment requirements of real-time interactive systems.

[0146] In summary, the experimental results fully verify the comprehensive advantages of this invention in key capabilities such as multimodal feature modeling, missing information reconstruction, and structural state perception, proving that it has significant technical value and engineering application potential in achieving robust discrimination and high-fidelity reconstruction in complex multimodal environments.

[0147] Example 2

[0148] This embodiment provides a multimodal representation learning inference system based on spectrum-driven adaptive selection and gated reconstruction.

[0149] A computer-readable storage medium storing a plurality of instructions adapted for loading and execution by a processor of a terminal device, the aforementioned multimodal representation learning inference method based on spectrum-driven adaptive selection and gated reconstruction.

[0150] A terminal device includes a processor and a computer-readable storage medium, the processor being configured to implement various instructions; the computer-readable storage medium being configured to store multiple instructions adapted for loading and execution by the processor of the aforementioned multimodal representation learning inference method based on spectrum-driven adaptive selection and gated reconstruction.

[0151] The above are all preferred embodiments of the present invention and are not intended to limit the scope of protection of the present invention. Therefore, all equivalent changes made in accordance with the structure, shape and principle of the present invention should be covered within the scope of protection of the present invention.

Claims

1. A multimodal representation learning and inference method based on spectrum-driven adaptive selection and gated reconstruction, characterized in that, include: Acquire heterogeneous multimodal data; Includes text, audio, image, and video modal data; Multimodal adaptive feature encoding is performed based on the acquired data, including heterogeneous multimodal feature fusion and global modeling, entity and context relationship graph construction, feature alignment and spatial reshaping; Spectrum-driven adaptive state perception based on encoded data includes sentiment saliency diagnosis based on spectral domain energy distribution, higher-order dependency modeling based on weighted hypergraphs, and semantic stability maintenance based on linear mapping. Gated guided feature reconstruction based on perceptual features includes high-energy node feature recovery, low-energy expert semantic preservation, and global sequence reassembly based on hybrid expert routing. Multi-task discrimination and reasoning based on reconstructed features; Output the reasoning result; The high-energy node feature recovery based on hybrid expert routing includes enhancing the output global state representation matrix. As initial input, a MoE routing gateway is constructed to implement the differentiated reconfiguration strategy, and the generated state mask sequence is utilized. As a hard routing control signal, for the matrix The node characteristics are used for filtering and diversion; for those identified as high-energy states... Data node characteristics The node is assigned to a high-energy reconstruction expert path. To ensure the reconstruction process remains consistent with the overall structure, a cross-modal attention mechanism is introduced, using the current node's features as the query vector and focusing on global semantic anchors. Retrieve stable structure information from the data and generate a compensated representation for the missing modes: ,in, This represents the generated preliminary reconstructed feature vector; for The first state in the matrix that is determined to be a high-energy state Each node's characteristics; It is a globally stable global semantic anchor signal; , , These are the learnable projection matrices; The attention function is then used, and the compensation signal is injected into the original node representation through an adaptive gated residual structure to generate the final high-energy reconstructed features. : ,in, This represents the final generated high-energy path reconstruction feature vector; The adaptive gating weights, which control the semantic compensation strength, are generated by the sigmoid activation function; The low-energy expert semantic preservation and global sequence reassembly include the process of identifying low-energy states by the routing gateway. Data node characteristics This is distributed to low-energy preservation expert paths, and the semantic stability of the original features is maintained through linear projection operations, avoiding the introduction of structural perturbations at low-intensity nodes, as shown below: ,in, This represents the generated low-energy expert-reconstructed feature vector; and These are the learnable weight matrix and bias term, respectively; after completing the feature updates for the high-energy path and the low-energy path, the high-energy path output is fused using a mask fusion operator. With low-energy pathway output A global integration is performed, and the node features processed by different paths are remapped back to their original sequence positions based on the state mask, generating the final reconstructed feature matrix. : ,in, The output is the reconstructed feature matrix; This represents the total number of nodes in the sequence.

2. The multimodal representation learning and inference method based on spectrum-driven adaptive selection and gated reconstruction according to claim 1, characterized in that, The heterogeneous multimodal feature fusion and global modeling include encoding multimodal heterogeneous data using a bidirectional gated recurrent unit (Bi-GRU), modeling the bidirectional dependencies between nodes at any time step in the sequence through a long short-term memory mechanism, thereby obtaining global node features with context-aware capabilities. First, for nodes at time step... Acquired sequence node data Perform raw feature extraction for each modality. Define binary existence indicator factor By using indicator factors and original modal features The product of these elements constructs a preliminary multimodal input vector: ,in, Indicates the first Multimodal fusion input vector of nodes; , as well as These represent the extracted original feature vectors for audio, visual, and text modalities, respectively. For binary representation, an indicator factor exists; to enable the model to model the identity features and time-series information of different entities, the identity identifier is mapped to an embedding vector in a continuous space and compared with the multimodal input vector. The information is then fused and fed into a bidirectional gated recurrent network to simultaneously model forward and backward contextual information, represented as: ,in, Indicates the first Context-aware node embeddings generated by encoding individual nodes; and These represent the forward and backward states at adjacent time points, respectively; This represents a bidirectional gated loop operation function; while acquiring sequence features, it simultaneously extracts deep semantic information of the text modality and maps it to global semantic anchor signals through a linear transformation layer. And serve as a reference benchmark for cross-modal guidance and information compensation: ,in, This represents the generated global text semantic anchor vector; The input is the original text feature vector; Represents a learnable linear projective weight matrix; As a bias vector, the original heterogeneous multimodal data is transformed into a node representation matrix with contextual relevance through an encoding process. .

3. The multimodal representation learning and inference method based on spectrum-driven adaptive selection and gated reconstruction according to claim 2, characterized in that, The entity and context relationship graph construction includes, based on the obtained node representations, introducing a dynamic heterogeneous graph construction mechanism based on a sliding window, and jointly modeling the identity dimension and temporal dimension in a non-Euclidean space through a relationship graph convolutional network. First, a size is defined as... A dynamic sliding window, for any central node Its neighborhood node set is defined as: By identifying the identity identifiers of each node in the neighborhood, connection edges reflecting the interaction relationships of multiple entities are constructed. and according to the entity and The correspondence is assigned to a specific relation type. Simultaneously, establish connections between nodes carrying time-series labels. Connection edge Construct a topology that reflects multi-directional contextual logic, including a set of time-series tags. It includes multiple directions: forward, current, and backward. To achieve feature extraction and aggregation of heterogeneous relationships, a two-layer parameter-shared relation graph convolutional network is used to calculate identity relationship features separately. and context features The feature update process is as follows: ; ,in, Indicates the first The spatial feature vector obtained by aggregating nodes along the identity relationship dimension; Indicates the first The feature vector obtained by aggregating nodes in the temporal dimension; Indicates in relation type Next node The set of neighboring nodes; The first output of the initial output A context-aware node embedding; and These represent different relation types. Learnable transformation weight matrix; This represents a nonlinear activation function; after obtaining the two-dimensional features, a nonlinear fusion layer is used to fuse the identity relationship features and the contextual temporal features to generate a comprehensive node feature vector for subsequent spectral domain state diagnosis. : ,in, This represents the comprehensive discourse feature vector after integrating multidimensional topological associations; Indicates feature concatenation operation; This is the global feature transformation matrix; This is a learnable bias term used to adjust the distribution.

4. The multimodal representation learning and inference method based on spectrum-driven adaptive selection and gated reconstruction according to claim 3, characterized in that, The feature alignment and spatial reshaping include, after completing the aggregation of multidimensional interaction dependencies, introducing a spatial reshaping strategy based on an adaptive residual mechanism on the basis of dual-path relation feature fusion. This strategy maps the fused features to a unified alignment space through nonlinear projection and complements the original node representations. First, a feature reshaping network consisting of a linear transformation layer and a normalization layer is constructed to regularize the geometric distribution of nodes in the cross-modal alignment space. During the reshaping process, a residual connection structure is used to transfer the original node features... With fusion features Perform weighted consolidation: ,in, This represents the enhanced node feature vector after spatial reshaping and semantic alignment. The initial node representation is obtained by bidirectional temporal coding; This is a comprehensive feature derived from the fusion of relational topology and contextual logic; This represents the multilayer perceptron mapping operation; The representation layer normalization operator is used; by introducing a contrastive learning mechanism to constrain the node feature manifold, the feature representations generated by the remaining effective modalities can still remain near the preset semantic manifold even when some modal information is missing, and provide a stable and information-dense structured input for subsequent state diagnosis. .

5. The multimodal representation learning and inference method based on spectrum-driven adaptive selection and gated reconstruction according to claim 4, characterized in that, The emotion saliency diagnosis based on spectral domain energy distribution includes introducing an adaptive selection mechanism based on spectral domain energy analysis. This mechanism quantitatively assesses the energy fluctuations of data nodes in the local relationship graph, identifying high-energy and low-energy nodes. Firstly, it utilizes comprehensive feature vectors... In the preset neighborhood window The feature evolution intensity is measured internally by calculating the spectral variance between each node and its neighboring context nodes. To quantify the degree of change of nodes in the spatial manifold, it is expressed as: ,in, Indicates the first The spectral variance of each node; For the preset neighborhood set of the sliding window; Indicates the number of neighbors within the neighborhood. The comprehensive correlation characteristics of each node; This represents the mean vector of the features of all nodes within this neighborhood; Given the total number of nodes in the neighborhood, after completing the spectral variance quantization, a learnable threshold is constructed. The emotion-triggered gating mechanism automatically classifies nodes into different perceptual states and generates state masks through binarization. It provides routing instructions for the differential feature processing path, expressed as: ,in, Indicates the first The state mask of each node; The set learnable emotion trigger threshold; when When, it indicates that the node is in an emotionally active state; when At this time, it indicates that the node is in a passive emotional state.

6. The multimodal representation learning and inference method based on spectrum-driven adaptive selection and gated reconstruction according to claim 5, characterized in that, The higher-order dependency modeling based on weighted hypergraphs and the semantic stability maintenance based on linear mappings include first introducing spectral domain convolution operations based on the Laplacian matrix to update the features of high-energy nodes: in, This represents the node enhancement features obtained after higher-order modeling. The comprehensive correlation features are the input; Represents the hypergraph incidence matrix; A learnable weight matrix representing the importance of hyperedges; and These are the node degree matrix and edge degree matrix of the hypergraph, respectively; Let be a nonlinear activation function; then, a stable path modeling strategy based on linear mapping is constructed for low-energy state nodes. Node features are updated through lightweight spatial projection, defined as: ,in, This represents the updated feature vector of the low-energy state node; The comprehensive correlation features are the input; and These are the learnable weight matrix and bias term used for semantic preservation; finally, the active path output is processed through a structured realignment layer. Compared to passive path output Feature integration is performed, and the integration process uses state masks. As a control signal, based on the state labels obtained by the nodes during the diagnostic phase, the features output from the two paths are selected bit-by-bit and their positions restored to generate the final state-enhanced feature vector. : ,in, Indicates the first The final output features of each node; This is the state mask corresponding to the node; High-energy node features obtained from high-order modeling of the hypergraph; Low-energy node features are obtained from linear mapping paths; to obtain a robust representation of the entire dialogue, the whole sequence is... The feature vectors corresponding to each node are concatenated and spatially aligned, thereby integrating the scattered node features into a unified state-enhanced representation matrix. : ,in, This is the final output state enhancement representation matrix; The total number of nodes; The dimension of the feature vector; For feature concatenation and matrix recombination operators.

7. The multimodal representation learning and inference method based on spectrum-driven adaptive selection and gated reconstruction according to claim 6, characterized in that, The multi-task discrimination and reasoning based on reconstructed features includes first reconstructing the feature matrix... As input, the data is fed into a global temporal refinement layer, and a multi-head self-attention mechanism (MHSA) is introduced to perform secondary modeling of the full sequence features. By using parallel attention heads to capture cross-node dependencies in different subspaces, feature reshaping and information redistribution are achieved globally, thereby generating a robust feature matrix containing global structural dependency information. : ,in, This is the final output global enhanced feature matrix; To reconstruct the feature matrix; For multi-head self-attention operation functions; then, after obtaining the global enhanced feature sequence, for each data node A task-oriented multilayer perceptron (MLP) is constructed as the discriminative output layer. Through multi-layer nonlinear transformations, high-dimensional latent features are mapped to a predefined task label space to generate the predicted probability distribution of the corresponding category. ; ,in, Indicates the first The predicted probability of each data node belonging to each task category; For categorical log odds; , and , These are the weight matrix and bias vector of the classifier, respectively; For activation functions; The normalized probability mapping function is used; finally, to ensure high-fidelity feature reconstruction and stable task prediction even in modality-deficient environments, a multi-task joint loss function is constructed. Perform end-to-end joint optimization on the entire network, loss function Represented as: Using cross-entropy loss Spectral selection loss is used to minimize the deviation between the predicted result and the true label. Used to monitor the state gating mechanism in the SAS module, distributed alignment loss High-energy reconstruction of expert output distribution and global semantic anchors using Kullback-Leibler divergence metric The difference between them, feature reconstruction loss Reconstructing features using mean square error metric The distance between the feature representation and the modality completeness. , , These are the preset hyperparameters.

8. A multimodal representation learning and inference system based on spectrum-driven adaptive selection and gated reconstruction, executing the multimodal representation learning and inference method based on spectrum-driven adaptive selection and gated reconstruction as described in claim 1, characterized in that, include: The data acquisition module is configured to acquire heterogeneous multimodal data; The feature encoding module is configured to perform multimodal adaptive feature encoding based on the acquired data, including heterogeneous multimodal feature fusion and global modeling, entity and context relationship graph construction, feature alignment and spatial reshaping; The state-aware module is configured to perform spectrum-driven adaptive state awareness based on encoded data, including sentiment saliency diagnosis based on spectral domain energy distribution, higher-order dependency modeling based on weighted hypergraphs, and semantic stability maintenance based on linear mapping. The feature reconstruction module is configured to perform gated guided feature reconstruction based on perceptual features, including high-energy node feature recovery based on hybrid expert routing, low-energy expert semantic preservation, and global sequence recombination. The reasoning module is configured to perform multi-task discrimination and reasoning based on the reconstructed features; and output the reasoning results.