A structure-aware multi-modal open intent recognition method and related device
By employing a structure-aware multimodal open intent recognition method, which combines cross-modal fusion and structural alignment loss, the problem of compression of modality-specific information is solved, achieving high-precision multimodal intent recognition and unknown intent detection, and improving the model's generalization ability and robustness.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SOUTHWESTERN UNIV OF FINANCE & ECONOMICS
- Filing Date
- 2026-05-25
- Publication Date
- 2026-06-19
Smart Images

Figure CN122241403A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of multimodal intent recognition technology, and in particular to a structure-aware multimodal open intent recognition method and related apparatus. Background Technology
[0002] The multimodal open intent recognition task aims to utilize multiple modalities such as text, video, and audio to accurately classify known intents within a given distribution while effectively identifying unknown intents that emerge during the testing phase. This capability is crucial for building secure, reliable, and highly generalizable human-computer interaction systems. With the continuous development of applications such as financial services, intelligent customer service, and online consultation, the interaction between users and systems has gradually expanded from traditional single text input to multimodal inputs including text, audio, and video. Compared to relying solely on text information, multimodal data can characterize user intent from multiple perspectives. The text modality reflects semantic content, the audio modality provides tone and emotional information, and the video modality reflects behavioral state and environmental information. Therefore, multimodal information fusion helps improve the accuracy and robustness of intent recognition in complex scenarios.
[0003] In existing technologies, cross-modal alignment is a crucial step in multimodal representation learning. Current methods typically employ sample-level alignment strategies, achieving modality alignment by reducing the distance between samples of the same type or with the same label across different modalities. However, these methods often assume that different modalities consistently express the same intent, ignoring differences in expression granularity and semantic emphasis between modalities. This can easily lead to the compression of modality-specific information, thereby reducing the model's representational power and generalization performance. Summary of the Invention
[0004] The purpose of this invention is to overcome the problems of the prior art and provide a structure-aware multimodal open intent recognition method and related apparatus.
[0005] The objective of this invention is achieved through the following technical solution: a structure-aware multimodal open intent recognition method, which includes the following steps: Feature extraction is performed on multimodal input data to obtain feature representations for text modality, video modality, and audio modality; Cross-modal fusion of the feature representations of each modality is performed to obtain a fused representation; Calculate classification loss based on fusion representation; Based on the feature representation of each modality, the class centroid of each known intent category in the corresponding modality is calculated. Based on the class centroids, the inter-class similarity of each modality is calculated, and the inter-class similarity of each modality is structurally aligned to construct the cross-modal structural alignment loss. The intent recognition model is jointly trained using classification loss and cross-modal structure alignment loss; During the inference phase, feature extraction and cross-modal fusion processing are performed on the samples to be identified to obtain a fused representation; the trained intent recognition model is then used to classify known intents or detect unknown intents on the fused representation.
[0006] In one example, the structural alignment of inter-class similarity for each modality includes: Using the inter-class similarity of the text modality as a reference, the inter-class similarity of the video modality and the audio modality are constrained to be consistent with the inter-class similarity of the text modality.
[0007] In one example, the cross-modal fusion of feature representations for each modality includes: Using the feature representation of the text modality as the query vector, cross-attention interactions are performed with the feature representations of the video modality and the audio modality, respectively, to obtain text-enhanced video features and text-enhanced audio features; The text modality feature representation is residually fused with text-enhanced video features and text-enhanced audio features to obtain a text-dominated fused intermediate representation; The fused representation is obtained by concatenating and mapping the text-dominated intermediate representation, the feature representation of the video modality, and the feature representation of the audio modality.
[0008] In one example, when jointly training the intent recognition model using classification loss and cross-modal structure alignment loss, the following is also included: The original feature representations of each modality are compared with the projected representations obtained after mapping. The projected representations are constrained to retain the local neighborhood structure in the original feature representations, and a single-modal structure preservation loss is constructed. The intent recognition model is jointly trained using classification loss, cross-modal structure alignment loss, and single-modal structure preservation loss.
[0009] In one example, the loss for maintaining the construction of a single-modal structure includes: The original feature representations of each modality are mapped to the shared space to obtain the projected representations of each modality; In the original feature representation space of each modality, a first local neighborhood distribution of each sample is constructed, and in the projected representation space of each modality, a second local neighborhood distribution of each sample is constructed. The consistency between the first local neighborhood distribution and the second local neighborhood distribution of each sample is measured, and the consistency measures of each sample are weighted and combined at multiple scales to obtain the structure preservation sub-loss of each modality. The structure preservation loss of each mode is averaged to obtain the single-mode structure preservation loss.
[0010] In one example, before averaging the structure-preserving sub-loss for each mode, the following steps are also included: Calculate the first matching probability between the feature representation of the text modality and the feature representation of the video modality for each sample; calculate the second matching probability between the feature representation of the text modality and the feature representation of the audio modality for each sample; determine the cross-modal semantic consistency weight of the sample based on the first matching probability and the second matching probability of each sample. The consistency between the first local neighborhood distribution and the second local neighborhood distribution of each sample is measured. The cross-modal semantic consistency weight is used as the weighting coefficient for each sample. The consistency measures of each sample are weighted and synthesized at multiple scales to obtain the structure-preserving sub-loss of each modality.
[0011] In one example, during the inference phase, when using the trained intent recognition model to classify known intents or detect unknown intents on the fused representation, the process includes: Based on the fusion representation of known class samples in the training set, calculate the center vector and global covariance matrix for each known class; Calculate the distance from the fused representation of the test sample to the centroid of each known class, and take the minimum distance as the discrimination score; When the discrimination score is greater than a preset threshold, the test sample is determined to be of unknown intent; otherwise, it is determined to be of known intent. The fused representation of the test sample is then input into the classifier to obtain the specific category of the test sample.
[0012] It should be further noted that the technical features corresponding to the above examples can be combined or replaced to form new technical solutions.
[0013] The present invention also includes a computer program product comprising a computer program that, when executed by a processor, implements the steps of the structure-aware multimodal open intent recognition method formed by any or a combination of the above examples.
[0014] The present invention also includes a storage medium storing computer instructions thereon, which, when executed, perform the steps of the structure-aware multimodal open intent recognition method formed by any or more of the above examples.
[0015] The present invention also includes a terminal comprising a memory and a processor, wherein the memory stores computer instructions executable on the processor, and the processor executes the steps of the structure-aware multimodal open intent recognition method formed by any or more of the above examples when executing the computer instructions.
[0016] Compared with the prior art, the beneficial effects of the present invention are: 1. By performing structural alignment on the inter-class similarity of different modalities instead of traditional sample-level alignment, each modality can achieve semantic consistency at the category relationship level while retaining its own unique semantic information, avoiding forced compression of modality-specific features; at the same time, by combining classification loss and cross-modal structural alignment loss for joint training, it is possible to achieve high-precision classification of known categories and effective detection of unknown categories in multimodal open intent recognition, thereby improving the model's generalization ability and robustness in open environments.
[0017] 2. By using the inter-class similarity of the text modality as a reference, the inter-class similarity of the video and audio modalities is constrained to maintain consistency with it. By leveraging the dominance of the text modality in the semantic expression of intent, non-text modalities are aligned with the text semantic space while maintaining their own structural characteristics. This preserves the core guiding role of the text in cross-modal alignment, avoids semantic conflicts between multimodalities, and enhances the consistency of category relationships across different modalities.
[0018] 3. Based on the classification loss and cross-modal structure alignment loss, a single-modal structure preservation loss is further introduced. By constraining the projection representation, the local neighborhood structure in the original features is preserved, so that each modality maintains its original discriminative distribution when mapped to the shared space. This avoids the destruction of the internal geometric relationship of the single modality by cross-modal alignment, and thus effectively preserves the original discriminative structure of each modality, thereby improving the stability of the model in complex multimodal scenarios.
[0019] 4. By constructing local neighborhood distributions of samples in the original and projected spaces of each modality, and weighting and synthesizing the distribution consistency at multiple scales, the mapped projected representation not only meets the requirements of cross-modal alignment, but also retains the discriminative information in the original feature space to the maximum extent, thereby enhancing the model's ability to characterize known class distributions and its accuracy in detecting unknown intentions.
[0020] 5. By calculating the cross-modal semantic consistency weight for each sample and incorporating this weight as a sample-level weighting coefficient into the multi-scale structure preservation loss, the weight of samples with high semantic consistency and low noise interference in the single-modal structure preservation constraint is increased to enhance their positive effect on model optimization, while the weight of samples with semantic misalignment or modality loss in the single-modal structure preservation constraint is reduced. This weakens the interference of noisy and misaligned samples on model training and improves the stability and robustness of the multimodal representation learning process. Attached Figure Description
[0021] The specific embodiments of the present invention will be further described in detail below with reference to the accompanying drawings. The accompanying drawings are provided to provide a further understanding of the present application and constitute a part of the present application. The same reference numerals are used in these drawings to denote the same or similar parts. The illustrative embodiments of the present application and their descriptions are used to explain the present application and do not constitute an improper limitation of the present application.
[0022] Figure 1 This is a flowchart of a method provided in an embodiment of the present invention; Figure 2 The method flowchart is provided for a preferred embodiment of the present invention. Detailed Implementation
[0023] The technical solution of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0024] In the description of this invention, ordinal numbers (e.g., "first and second," etc.) are used to distinguish objects and are not limited to this order, nor should they be construed as indicating or implying relative importance. Furthermore, the technical features involved in the different embodiments of the invention described below can be combined with each other as long as they do not conflict with each other.
[0025] This invention discloses a structure-aware multimodal open intent recognition method, applicable to human-computer interaction scenarios such as intelligent customer service, intelligent dialogue systems, remote business processing, and autonomous driving. In the remote banking scenario, text modal input consists of command text generated by the user clicking the screen, or text content transcribed in real-time from the user's speech; video modal input consists of the user's facial expressions and lip movements captured by the camera; and audio modal input consists of the user's voice signal recorded by the microphone. In the autonomous driving scenario, text modal input consists of command text generated by the driver clicking the screen, or text content transcribed in real-time from the driver's speech; video modal input consists of the driver's gestures, head orientation, and facial expressions captured by the in-vehicle camera; and audio modal input consists of the driver's voice signal recorded by the microphone.
[0026] In the embodiments of this invention, a multimodal intent dataset is given. Each input sample Representing text modalities respectively Video modal With audio modality Input features, For the corresponding intent category label, For indexing, Let be the total number of samples. The label set can be represented as... ,in Represents a set of known intent categories. The total number of known intent categories, and Indicates an unknown intent category.
[0027] During the training and validation phases, only sets of known categories are used. The samples used in the model learning process are used; during the testing phase, the test set includes both known intent samples and unknown intent samples. The goal of this invention is to learn a multimodal intent recognition model that can recognize intents from various sources during the inference phase. The samples are accurately classified into their corresponding categories, while samples from unknown categories are identified as... This enables multimodal intent classification and unknown intent detection in open scenarios.
[0028] In one embodiment, such as Figure 1 As shown, a structure-aware multimodal open intent recognition method includes the following steps: S1: Perform feature extraction on the multimodal input data to obtain feature representations of the text modality, video modality, and audio modality.
[0029] In step S1, the pre-trained language model BERT can be used to extract feature representations for the text modality, the visual model Swin Transformer can be used to extract feature representations for the video modality, and the pre-trained speech model WavLM can be used to extract feature representations for the audio modality. It should be noted that the video modality input data in this invention can be equivalently replaced with image input data.
[0030] S2: Perform cross-modal fusion of the feature representations of each modality to obtain a fused representation.
[0031] In step S2, attention mechanisms or weighted fusion can be used to perform cross-modal fusion of the feature representations of each modality, so that the semantic information of different modalities complements each other, thereby more accurately judging the user's true intention.
[0032] S3: During the training phase, the classification loss is calculated based on the fused representation.
[0033] In step S3, using labeled samples of known intent categories, the fused representation is input into the classifier to obtain the predicted category probability, which is then compared with the true label. A classification loss, such as cross-entropy loss, is calculated. This loss drives the intent recognition model to learn the decision boundary that distinguishes different known intent categories, thereby achieving accurate classification of open intents. The intent recognition model in this invention includes a feature extraction module, a cross-modal fusion module, and a classifier, used to map multimodal inputs to the intent category space and to achieve known intent classification and unknown intent detection.
[0034] S4: During the training phase, the class centroid of each known intent category in the corresponding modality is calculated based on the feature representation of each modality. The inter-class similarity of each modality is calculated based on the class centroids, and the inter-class similarity of each modality is structurally aligned to construct the cross-modal structural alignment loss.
[0035] In step S4, the category centroid is the average vector of feature representations of all samples belonging to the same known intent category in the corresponding modality. This invention proposes a structure-aware cross-modal alignment mechanism. This mechanism no longer directly constrains the overlap of individual sample representations in different modalities, but instead models at the category structure level. By constructing the relationship structure between categories in each modality and aligning the similarity distribution between categories in different modalities, it ensures consistency in category relationships across different modalities. Compared to traditional instance-level alignment methods, the structure-level alignment method proposed in this invention effectively alleviates the semantic compression problem caused by overly strong alignment, achieving cross-modal semantic consistency modeling while maintaining modal differences, thereby improving the flexibility and generalization ability of multimodal representation learning.
[0036] S5: During the training phase, the intent recognition model is jointly trained using classification loss and cross-modal structure alignment loss.
[0037] In step S5, the classification loss and cross-modal structure alignment loss are weighted and combined to form the total loss function. The model is then trained using the backpropagation algorithm to update its parameters. This step, through joint training, enables the model to accurately distinguish known intent categories while maintaining cross-modal structural consistency at the category relationship level, thereby improving the model's generalization ability and robustness in open scenarios.
[0038] S6: In the inference phase, feature extraction and cross-modal fusion processing are performed on the sample to be identified to obtain a fused representation; the trained intent recognition model is used to classify known intents or detect unknown intents on the fused representation to complete the intent recognition task in an open environment.
[0039] In one embodiment, feature extraction of multimodal input data includes: The embedded representation of text modalities is extracted using the pre-trained large language model BERT. , For the real number field, The length of the text sequence. This represents the dimension of the text feature vector. The Swing Transformer, a large-scale visual model, is used to extract the embedding representation of the video modality. To depict the high-level semantic information of video content in the spatial and temporal dimensions, among which The length of the video sequence. Let be the dimension of the video embedding vector. For the audio modality, the raw speech signal is first preprocessed using a Python audio processing library, and then the embedding representation of the audio modality is extracted using a pre-trained speech model, WavLM. , The length of the audio sequence. is the dimension of the audio embedding vector.
[0040] Then, the embedded representations of text, video, and audio modalities are... The input is a Transformer-based encoder, which models dependencies in the sequence and learns higher-level semantic representations through a multi-head attention mechanism, thereby obtaining a more discriminative modal representation. For the text modality, a pre-trained BERT model is used as the encoder, and the output vector corresponding to the first classification label in the last hidden state of the BERT model is used as the sentence-level feature representation of the entire input text. in, [CLS] represents the sentence-level feature vector of the text modality, and [CLS] represents the output state of the corresponding classification label in the last layer of the Transformer, serving as the aggregate representation of the entire input sequence.
[0041] For video and audio modalities, a standard Transformer is used to obtain sequence-level representations, followed by aggregation of temporal information through mean pooling, and then linear projection is used to map to a shared feature space. in, , These are the global feature vectors for the video modality and the audio modality, respectively. Let be the linear projection matrix of the video modality. The linear projection matrix for the audio modality projects video and audio features onto a fusion dimension consistent with the text features using two linear projection matrices. , It represents the feature dimensions of the multimodal shared semantic space, providing a unified representation space for subsequent cross-modal fusion.
[0042] In one embodiment, cross-modal fusion of feature representations of each modality includes: S21: Using the feature representation of the text modality as the query vector, cross-attention interaction is performed with the feature representations of the video modality and the feature representations of the audio modality respectively to obtain text-enhanced video features and text-enhanced audio features.
[0043] Obtaining global feature vectors for three modalities Subsequently, a cross-attention mechanism is employed for multimodal fusion to explicitly model the semantic relationships between textual and non-textual modalities. Unlike simple weighted summation, cross-attention can adaptively select information from the video and audio that is relevant to the textual intent based on the semantics of the current sample, thereby mitigating the effects of multimodal noise, semantic misalignment, and redundant information.
[0044] Specifically, using text feature vectors As the query vector, cross-modal attention interactions are performed with the video feature representation and the audio feature representation, respectively, and the cross-attention operator is denoted as [missing information]. for: in, Represents the query vector. and These represent the key and value, respectively, and ⊤ represents transpose. This represents the normalized exponential function. Because... and Projected into the fusion dimension Therefore, attention matching can be performed between the three modalities in a unified space.
[0045] Construct cross-attention outputs for text-to-video and text-to-audio respectively: in, This represents supplementary information provided to the text by the video modality, i.e., video features that enhance the text; This represents the supplementary information provided to the text by the audio modality, i.e., the audio features that enhance the text. Through the above calculations, the model... and (or The similarity between the two modalities is used to automatically assign cross-modal aggregation weights, thereby highlighting modal cues that are consistent with the current intent semantics.
[0046] S22: Perform residual fusion of the text modality feature representation with the text-enhanced video features and the text-enhanced audio features to obtain a text-dominated fusion intermediate representation.
[0047] In step S22, to improve fusion stability and avoid excessive perturbation of the original text semantics by cross-modal interactions, a residual fusion method is used to combine the original text representation with the two enhanced representations to obtain a text-dominated fusion intermediate representation. : This design can preserve the main semantics of the text while introducing complementary information from video and audio, making the fused representation more robust in complex scenarios.
[0048] S23: After concatenating and mapping the text-dominated fusion intermediate representation, the feature representation of the video modality, and the feature representation of the audio modality, a fusion representation is obtained.
[0049] Finally, , and The data is spliced together, and the final fused representation is obtained through linear mapping. : in, This represents vector concatenation. This is the fusion mapping matrix, used to compress the concatenated representation back to the fusion dimension. From this, we obtain It will serve as a unified input for subsequent classification and open-class detection, supporting both known intent recognition and unknown intent detection tasks simultaneously.
[0050] The fusion strategy of this invention adopts a text-centric cross-modal interaction approach, in which text representation is responsible for guiding the extraction of supplementary cues related to intent from video and audio. This strategy fits the common characteristics of multimodal intent understanding scenarios: text often provides the most direct semantic intent, while video and audio play more of a role in contextual supplementation, emotional reinforcement, or ambiguity resolution. Therefore, cross-attention can more effectively utilize multimodal complementarity and suppress the negative contribution of noisy modalities.
[0051] In one embodiment, calculating the classification loss based on the fused representation includes: Fusion characterization The main task is to supervise and classify the data. For the first... Each sample, whose fusion representation is as follows: Furthermore, a fusion classification head is introduced. ,in, This represents the model parameters. The fusion classification head is used to map the fused representation to a known class space and output the corresponding raw class scores. , represented as: in, This indicates the number of known intent categories.
[0052] Based on the original scores of the above categories, this invention constructs a labeled smoothed cross-entropy loss function. To improve the stability and generalization ability of the classification boundary, it is defined as: in, Indicates the number of samples in the batch; Indicates a category index; This indicates that the target distribution after label smoothing, by introducing a label smoothing mechanism, can effectively alleviate the problem of overconfidence in known categories during model training, thereby reducing the risk of misclassifying unknown samples as known categories in open environments. In some implementations, the label smoothing coefficient is set to 0.1.
[0053] In addition to sample-level discriminative ability, the relative geometric relationships between different categories should also remain consistent across different modalities. Therefore, in one embodiment, this invention introduces a cross-modal structural alignment loss, which enhances cross-modal semantic consistency by aligning the inter-class similarity structures in different modalities. The construction of the cross-modal structural alignment loss specifically includes: For each valid category Calculate the class centroid vector for each modality. This centroid is obtained by taking the mean value of the features of samples belonging to that class, and is expressed as: in, Representing modes Next The centroid vectors of known intent categories; Indicates the first Each sample in modality The following features are represented. The value is text modal Video modal and audio modality ; Indicates the category in the current batch. The sample index set. Based on the above category centroids, construct the inter-class cosine similarity matrix, as follows: in, and Represents a category index, and satisfies , This represents the set of all valid categories in the current batch. Indicate category and In modality Cosine similarity is used to measure the structural relationship between different categories; Representing modes Next The centroid vector of a known intent category.
[0054] In the cross-modal alignment process, the category structure of the text modality is used as the reference structure, and this reference structure is kept fixed during the structural alignment process. That is, the inter-class similarity of the text modality is used as a reference, and the inter-class similarity of the video modality and the audio modality is constrained to be consistent with the inter-class similarity of the text modality. Based on this, a mean squared error constraint is applied to the off-diagonal elements; that is, a mean squared error constraint is applied only to the off-diagonal elements in the inter-class similarity matrix, so that the off-diagonal similarity values in the video and audio modalities approximate the corresponding off-diagonal similarity values in the text modality. This is used to construct the cross-modal structural alignment loss. , represented as: in, This represents the category similarity within the text modality. This cross-modal structural alignment loss emphasizes the consistency of relationships between classes and is more robust than per-sample strong alignment. It can maintain the stability of the overall category geometry even in the presence of noise or misalignment, thus providing a clearer prior knowledge of known class structure for open set detection.
[0055] It should be further explained that, compared to existing technologies that use spheres as the basic unit and employ a comparative loss function to bring similar spheres closer together and push dissimilar spheres further apart, this invention is essentially a similarity alignment method based on local representation units. Instead of using spheres or individual samples as the direct alignment object, this invention starts from the category structure level. By constructing a similarity relationship matrix between categories in each modality, it models the relative geometric relationships between categories and constrains this structural relationship to remain consistent across different modalities, thereby achieving a shift from similarity alignment to structural consistency alignment.
[0056] When performing cross-modal alignment in a shared space, if the projection mapping process excessively alters the local neighborhood structure in the original representation, it may disrupt the original discriminative feature distribution of each modality, thereby weakening the model's classification ability and leading to instability in the fused representation. Therefore, in one embodiment, this invention introduces a single-modal structure preservation constraint mechanism. By applying structure preservation regularization to the output of the projection branch, each modality can retain, as far as possible, the local structural relationships in the original feature space after mapping to the shared space, thus achieving a balance between alignment and discriminability. It should be noted that the projection branch in this invention is only used for structural constraints during the training phase; during the inference phase, the model still relies solely on the fused representation. Perform intent classification and unknown intent detection.
[0057] At this point, during the training phase, a single-modal structure-preserving loss is introduced, including: The original feature representations of each modality are compared with the projected representations obtained after mapping. The projected representations are constrained to retain the local neighborhood structure in the original feature representations, and a single-modal structure preservation loss is constructed. The intent recognition model is jointly trained using classification loss, cross-modal structure alignment loss, and single-modal structure preservation loss.
[0058] In one embodiment, constructing a single-modal structure retention loss includes: (1) Map the original feature representations of each modality to the shared space to obtain the projected representations of each modality.
[0059] For text modal Video modal Audio modality The present invention sets lightweight projection functions respectively. This is used to map single-modal representations to a unified shared space. The projection representation of a sample in each modality is defined as: in, , , They represent the first The original representations of each sample in text, video, and audio modalities; , , These represent the projected representations after mapping by a lightweight projection function. This lightweight projection function is preferably a linear mapping layer or a multilayer perceptron, used to achieve a unified mapping of different modal representations to a shared space.
[0060] Preferably, to improve the stability of the similarity measurement, the present invention performs the following on the projected representation: After normalization, we obtain the normalized representation: in, Indicates sample In modality Normalized projection representation under the given conditions; Indicates the first Each sample in modality Projection representation under; Represents the corresponding projection representation Norm. Normalization can reduce the impact of differences in feature scales across different modalities on subsequent similarity calculations.
[0061] (2) Construct the first local neighborhood distribution of each sample in the original feature representation space of each modality, and construct the second local neighborhood distribution of each sample in the projection representation space of each modality.
[0062] Specifically, in order to maintain the internal structure of a single mode at different local scales, the present invention further constructs a multi-scale structure preservation regularization term.
[0063] For any representation matrix ,in This indicates the number of samples in the current batch. First, each sample is represented as... Normalization, denoted as: in, Indicates the first The feature representation vector of each sample; Indicates the first Normalized feature representation vector of each sample; This represents the corresponding feature vector. Norm.
[0064] Then at temperature Define sample With sample Cosine similarity between : in, Indicates the first The normalized feature representation vector of each sample.
[0065] To highlight the local neighborhood structure and reduce the impact of distant noise samples, this invention introduces... Truncation mechanism. (Note:) For the sample The most similar ones in the current representation space The set of indices corresponding to each of the neighbors, where... This represents the neighborhood size parameter. When... When , it indicates that all samples in the current batch are used for modeling. Based on this, the truncated neighborhood distribution is constructed as follows: in, Indicates temperature Neighborhood scale Under the conditions, the sample Samples in the neighborhood distribution The probability weights they occupy; sample With sample Cosine similarity between them; Let be an indicator function, taking a value of 1 when the condition within the parentheses is true, and 0 otherwise. The above definition is equivalent to performing softmax normalization only within the local neighborhood, thus making the resulting distribution more focused on the local structural relationships of the samples. Let... This represents the distribution of the first local neighborhood in the original representation space before projection. This represents the distribution of the second local neighborhood in the shared space after projection.
[0066] (3) Measure the consistency between the first local neighborhood distribution and the second local neighborhood distribution of each sample, and weight and synthesize the consistency metric values of each sample at multiple scales to obtain the structure preservation sub-loss of each modality.
[0067] use Divergence measures the consistency between the first and second local neighborhood distributions of each sample. The row-wise Jensen-Shannon divergence between the first and second local neighborhood distributions is defined as follows: in, Indicates the intermediate distribution; Indicates sample First local neighborhood distribution Distribution with the second local neighborhood The Jensen-Shannon divergence between the projection and the target area is used to measure the consistency of the neighborhood distribution before and after projection; the smaller the value, the better the structure preservation. To take into account local structural features at different scales, this invention employs a multi-scale parameter set. The structural offset is comprehensively constrained, and the structural preservation sub-loss is defined. for: in, Representing modes The original feature representation of all samples; Representing modes Normalized projective feature representation of all samples; Indicates the number of configurations in the multi-scale parameter set; Indicates temperature Neighborhood scale Under the conditions, the sample The first local neighborhood distribution; Indicates temperature Neighborhood scale Under the conditions, the sample The second local neighborhood distribution.
[0068] In some implementations, the multi-scale parameter set can be set as follows: Among them, parameter pairs This indicates that all batch samples are used at a temperature of 0.05; parameter pairing and These represent local structure modeling at different temperatures and different neighborhood sizes.
[0069] (4) The structure preservation loss of each mode is averaged to obtain the single-mode structure preservation loss.
[0070] Furthermore, at the single-modal level, structure preservation constraints are applied to the text modality, video modality, and audio modality respectively, and the structure preservation sub-losses of the three modalities are averaged to obtain the final single-modal structure preservation loss. : Through the aforementioned single-modal structure preservation constraint mechanism, this invention can effectively maintain the local structural stability within each modality during cross-modal structure alignment, reduce the damage of shared space mapping to the original discriminative structure, and thus improve the stability and robustness of the model in complex multimodal open environments.
[0071] In this embodiment, to address the issue of easily disrupted discriminative structures within single modalities during cross-modal alignment, a single-modal structure preservation mechanism is proposed. This mechanism maintains the stability of the local geometric structure within the single-modal feature space by constraining the local neighborhood relationships between samples in the single-modal representations before and after mapping. Specifically, this invention constructs similarity neighborhood distributions between samples at different scales and uses a distribution consistency metric to constrain structural changes before and after mapping, thereby avoiding excessive distortion of the single-modal representation space while performing cross-modal structure alignment. This mechanism effectively preserves the original discriminative structure within each modality, improving the stability and robustness of the model in complex multimodal scenarios.
[0072] Considering the potential issues of semantic inconsistency, modal noise, or missing information in multimodal data, this invention proposes a consistency-driven dynamic weight adjustment mechanism. This mechanism evaluates the cross-modal consistency of samples by measuring the degree of semantic matching between different modalities and adaptively adjusts the contribution weights of different samples in the structural alignment and preservation process accordingly. For samples with high consistency, this invention assigns a larger weight to enhance their positive impact on model optimization; for samples with low consistency, their weight is reduced to weaken the interference of noisy and misaligned samples on model training. This mechanism can significantly improve the stability and robustness of the multimodal representation learning process. Specifically, this invention first calculates the cross-modal consistency weight and uses the cross-modal semantic consistency weight as the weighting coefficient for each sample, weighting and synthesizing the consistency metrics of each sample across multiple scales, including: (1) Calculate the first matching probability between the feature representation of the text modality and the feature representation of the video modality for each sample; calculate the second matching probability between the feature representation of the text modality and the feature representation of the audio modality for each sample; determine the cross-modal semantic consistency weight of the sample based on the first matching probability and the second matching probability of each sample.
[0073] Specifically, we first construct cross-modal similarity matrices between text modality and video modality, and between text modality and audio modality, which are represented as follows: in, Indicates the first The first text sample and the second Cross-modal similarity between video samples; Indicates the first The first text sample and the second Cross-modal similarity between audio samples Represents the dot product of vectors; Indicates the first Text modality normalized projection representation of each sample; Indicates the first Normalized projection representation of video modalities for each sample; Indicates the first Normalized projection representation of the audio modalities of each sample; This represents the temperature parameter used to adjust the smoothness of the similarity distribution.
[0074] Based on this, a softmax operation is applied to each row of the similarity matrix to obtain the first matching probability. Second matching probability Furthermore, this invention uses the matching probability at the diagonal position as the credibility of cross-modal pairing within the sample itself, and defines the sample-level cross-modal consistency weight as: in, Indicates the first The first text sample and the second The matching probability of a video sample. Indicates the first The first text sample and the second The matching probability of an audio sample. Indicates the first Consistency weights for each sample.
[0075] Based on the above definition, samples with higher consistency will receive larger weights, thus playing a stronger role in the structure preservation loss; while samples with lower consistency will receive smaller weights to reduce the negative impact of semantic misalignment or modal noise on model training.
[0076] (2) Measure the consistency between the first local neighborhood distribution and the second local neighborhood distribution of each sample. Use the cross-modal semantic consistency weight as the weighting coefficient for each sample. Weight the consistency measurement values of each sample at multiple scales to obtain the structure preservation sub-loss for each modality: This embodiment introduces sample-level weights, which can further reduce the impact of semantically inconsistent samples on the structure preservation constraint.
[0077] Prioritizing the integration of the aforementioned classification loss, cross-modal structure alignment loss, and single-modal structure preservation loss, a joint training strategy is employed to collaboratively optimize the primary task's discrimination objective and the structural constraint objective. The overall training objective function... Defined as: in, The weights representing the cross-modal structure alignment loss are set to constants during training. The dynamic weights representing the single-modal structure-preserving loss adaptively adjust with the number of training epochs. To avoid excessive structural constraints in the later stages of training leading to a decrease in discriminative ability, this invention... A scheduling strategy that decays with the number of training rounds is adopted, and its specific form is defined as follows: in, The initial weights represent the loss of structure preservation. Indicates the current training round number. This indicates the number of rounds (annealing length) during which the weight decays.
[0078] In the early stages of training, the structure-preserving constraint weights are relatively large, which helps stabilize the shared feature space and enhance cross-modal structural consistency; as training progresses, By gradually reducing the size of the boundary, the model optimization process gradually focuses on the classification objective, thereby improving the refinement of the decision boundary and avoiding excessive restrictions on the discriminative ability by structural constraints.
[0079] Furthermore, using the above loss function Backpropagation is performed to calculate gradients and update the model's trainable parameters to gradually improve the model's performance in open intent classification.
[0080] The model update process also includes: (1) Evaluation of the validation set After each round of training, the model's evaluation metric score on the validation set is calculated to assess the model's current performance.
[0081] (2) Update the optimal validation score and the optimal model. If the current evaluation score is higher than the historical best score, then the score and the corresponding model parameters are set as the new optimal result.
[0082] (3) Continue training Subsequent training iterations are based on the parameters of the current best model, which continuously improves the model's adaptability in complex noisy environments.
[0083] If the validation score does not improve after 10 consecutive validations, the early stopping strategy is triggered, the representation learning is stopped, and the optimal model parameters are saved.
[0084] This invention designs three complementary optimization terms. First, a supervised classification loss based on fusion representation is introduced to improve the model's ability to discriminate known categories. Second, a cross-modal structure alignment loss is designed to constrain the geometric relationships of different modalities in the feature space to remain consistent, thereby enhancing cross-modal semantic consistency. Finally, a single-modal structure preservation constraint is introduced to limit the destruction of the original modality's local structure during the projection mapping process, and a misalignment gating mechanism is combined to reduce the negative impact of semantically inconsistent samples. By jointly optimizing the classification loss, cross-modal structure alignment loss, and single-modal structure preservation loss, the model can learn a more stable and discriminative fusion representation under complex conditions such as modal noise and semantic misalignment, and provide a clearer structural prior for subsequent open set detection. Through the above joint optimization mechanism, this invention can achieve an effective balance between cross-modal semantic consistency and single-modal discriminative ability, thereby obtaining a more stable multimodal representation with good generalization ability.
[0085] In one embodiment, during the inference phase, when using the trained intent recognition model to classify known intents or detect unknown intents on the fused representation, the process includes: (1) Based on the fusion representation of known class samples in the training set, calculate the center vector and global covariance matrix of each known class.
[0086] For a given test sample, its fused representation is first obtained through the feature extraction module and the cross-modal cross-attention fusion module. The calculation method for the fused representation remains consistent with that in the training phase and serves as a unified criterion for the inference phase. Furthermore, the calculation method for each known category is derived from the training samples. center vector And the covariance matrix estimated based on the training samples .
[0087] (2) Calculate the distance from the fusion representation of the test sample to the centroid of each known class, and take the minimum distance as the discrimination score.
[0088] Specifically, to determine whether a sample belongs to a known category, this invention constructs a discriminant function based on Mahalanobis distance. For each known category... Calculate the distance between the sample and the class center: in, Indicates the current test sample to the category Mahalanobis distance; and These represent the fusion features extracted from the training set and the test set, respectively. For category The number of samples; The inverse matrix of the covariance matrix; Indicates the first training set The fusion representation vector of each sample.
[0089] (3) When the discrimination score is greater than the preset threshold, the test sample is determined to be of unknown intent; otherwise, proceed to the next step (4).
[0090] Specifically, the distance score of the test sample is defined. for: Based on Mahalanobis distance score and preset threshold Determine whether a sample belongs to an unknown category based on the relationship between the two categories: in, This represents the intention category label that the model predicts for the input sample.
[0091] (4) When the discrimination score is less than the preset threshold, it is determined to be a known intent, and the fusion representation of the test sample is input into the classifier to obtain the specific category of the test sample.
[0092] Specifically, when a test sample is determined to belong to a known category, the fused representation of that test sample is input into the fusion classification head. This yields the corresponding raw scores for each category: The final predicted category is calculated using the softmax function: in, The number of known intent categories.
[0093] Combining the above embodiments, such as Figure 2 As shown, a preferred embodiment of the present invention is obtained, and the method includes the following steps: S10: Perform feature extraction on the multimodal input data to obtain feature representations of the text modality, video modality, and audio modality; S20: Perform cross-modal fusion of the feature representations of each modality to obtain a fused representation; S30: Calculate classification loss based on fused representation; S40: Calculate the class centroid of each known intent category in the corresponding modality based on the feature representation of each modality, calculate the inter-class similarity of each modality based on the class centroid, and perform structural alignment on the inter-class similarity of each modality to construct the cross-modal structural alignment loss; S50: Map the original feature representations of each modality to the shared space to obtain the projected representations of each modality; Calculate the first matching probability between the feature representation of the text modality and the feature representation of the video modality for each sample; calculate the second matching probability between the feature representation of the text modality and the feature representation of the audio modality for each sample; determine the cross-modal semantic consistency weight of the sample based on the first matching probability and the second matching probability of each sample. In the original feature representation space of each modality, a first local neighborhood distribution of each sample is constructed, and in the projected representation space of each modality, a second local neighborhood distribution of each sample is constructed. The consistency between the first local neighborhood distribution and the second local neighborhood distribution of each sample is measured. The cross-modal semantic consistency weight is used as the weighting coefficient for each sample. The consistency measurement values of each sample are weighted and combined at multiple scales to obtain the structure preservation sub-loss of each modality. The structure preservation loss of each mode is averaged to obtain the single-mode structure preservation loss. S60: Jointly train the intent recognition model using classification loss, cross-modal structure alignment loss, and single-modal structure preservation loss; S70: In the inference phase, feature extraction and cross-modal fusion processing are performed on the sample to be identified to obtain a fused representation; the trained intent recognition model is used to classify known intents or detect unknown intents on the fused representation.
[0094] This invention first utilizes a pre-trained large model to extract features from multimodal input data, obtaining high-level semantic representations of text, video, and audio modalities. Then, a text-centric cross-modal attention mechanism is used to fuse multimodal information, resulting in a unified fused representation. Subsequently, in the representation learning stage, this invention introduces a structure-aware cross-modal alignment mechanism. By constraining the consistency of category structural relationships across different modalities, cross-modal semantic structure modeling is achieved. Simultaneously, a single-modal structure preservation mechanism is introduced to maintain the consistency of local neighborhood relationships before and after single-modal representations, avoiding the disruption of the original single-modal discriminative structure by cross-modal alignment. Furthermore, a consistency-driven dynamic weight adjustment mechanism is combined to adaptively adjust the contribution of different samples during training based on their cross-modal semantic consistency, reducing the negative impact of semantic misalignment and noise interference. Finally, by jointly optimizing the classification objective and the structural constraint objective, a unified improvement in the discriminative power and structural robustness of multimodal representations is achieved. In the inference stage, known intent classification and unknown intent recognition are completed based on the fused representation.
[0095] The present invention also provides a computer program product, comprising a computer program that, when executed by a processor, implements the steps of the structure-aware multimodal open intent recognition method formed by any or a combination of the above examples. The processor may be a single-core or multi-core central processing unit or a specific integrated circuit, or one or more integrated circuits configured to implement the present invention.
[0096] The present invention also provides a storage medium having the same inventive concept as the structure-aware multimodal open intent recognition method formed by any or more of the above examples, wherein computer instructions are stored thereon, and the computer instructions, when executed, perform the steps of the structure-aware multimodal open intent recognition method formed by any or more of the above examples.
[0097] Based on this understanding, the technical solution of this embodiment, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0098] This invention also provides a terminal having the same inventive concept as any or a combination of examples corresponding to the above-described structure-aware multimodal open intent recognition method, including a memory and a processor. The memory stores computer instructions executable on the processor, and when the processor executes the computer instructions, it performs the steps of the above-described structure-aware multimodal open intent recognition method. The processor may be a single-core or multi-core central processing unit or a specific integrated circuit, or one or more integrated circuits configured to implement this invention.
[0099] In one example, the terminal, i.e., the electronic device, is represented in the form of a general-purpose computing device. The components of the electronic device may include, but are not limited to: at least one processing unit (processor) mentioned above, at least one storage unit mentioned above, and a bus connecting different system components (including storage units and processing units).
[0100] The storage unit stores program code that can be executed by the processing unit, causing the processing unit to perform the steps described in the "Exemplary Methods" section of this specification according to various exemplary embodiments of the present invention. For example, the processing unit can execute the aforementioned structure-aware multimodal open intent recognition method.
[0101] The storage unit may include readable media in the form of volatile storage units, such as random access memory (RAM) and / or cache storage units, and may further include read-only memory (ROM).
[0102] The storage unit may also include a program / utility having a set (at least one) of program modules, including but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of these examples may include an implementation of a network environment.
[0103] A bus can represent one or more of several types of bus structures, including a memory cell bus or memory cell controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local bus that uses any of the various bus structures.
[0104] The electronic device can also communicate with one or more external devices (e.g., keyboards, pointing devices, Bluetooth devices, etc.), one or more devices that enable a user to interact with the electronic device, and / or any device that enables the electronic device to communicate with one or more other computing devices (e.g., routers, modems, etc.). This communication can be performed via input / output (I / O) interfaces. Furthermore, the electronic device can communicate with one or more networks (e.g., local area networks (LANs), wide area networks (WANs), and / or public networks, such as the Internet) via a network adapter. The network adapter communicates with other modules of the electronic device via a bus. It should be understood that other hardware and / or software modules can be used in conjunction with the electronic device, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems.
[0105] Through the above description, those skilled in the art will readily understand that the exemplary embodiments described herein can be implemented by software or by combining software with necessary hardware. Therefore, the technical solution according to this exemplary embodiment can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (such as a CD-ROM, USB flash drive, external hard drive, etc.) or on a network, including several instructions to cause a computing device (such as a personal computer, server, terminal device, or network device, etc.) to execute the method of the exemplary embodiment of this application.
[0106] The above detailed embodiments are a description of the present invention. It should not be considered that the specific embodiments of the present invention are limited to these descriptions. For those skilled in the art, several simple deductions and substitutions can be made without departing from the concept of the present invention, and all of these should be considered to fall within the protection scope of the present invention.
Claims
1. A structure-aware multimodal open intent recognition method, characterized in that, Includes the following steps: Feature extraction is performed on multimodal input data to obtain feature representations for text modality, video modality, and audio modality; Cross-modal fusion of the feature representations of each modality is performed to obtain a fused representation; Calculate classification loss based on fusion representation; Based on the feature representation of each modality, the class centroid of each known intent category in the corresponding modality is calculated. Based on the class centroids, the inter-class similarity of each modality is calculated, and the inter-class similarity of each modality is structurally aligned to construct the cross-modal structural alignment loss. The intent recognition model is jointly trained using classification loss and cross-modal structure alignment loss; During the inference phase, feature extraction and cross-modal fusion processing are performed on the samples to be identified to obtain a fused representation; the trained intent recognition model is then used to classify known intents or detect unknown intents on the fused representation.
2. The structure-aware multimodal open intent recognition method according to claim 1, characterized in that, The structural alignment of inter-class similarity for each modality includes: Using the inter-class similarity of the text modality as a reference, the inter-class similarity of the video modality and the audio modality are constrained to be consistent with the inter-class similarity of the text modality.
3. The structure-aware multimodal open intent recognition method according to claim 1, characterized in that, The cross-modal fusion of feature representations for each modality includes: Using the feature representation of the text modality as the query vector, cross-attention interactions are performed with the feature representations of the video modality and the audio modality, respectively, to obtain text-enhanced video features and text-enhanced audio features; The text modality feature representation is residually fused with text-enhanced video features and text-enhanced audio features to obtain a text-dominated fused intermediate representation; The fused representation is obtained by concatenating and mapping the text-dominated intermediate representation, the feature representation of the video modality, and the feature representation of the audio modality.
4. The structure-aware multimodal open intent recognition method according to claim 1, characterized in that, When jointly training the intent recognition model using classification loss and cross-modal structure alignment loss, it also includes: The original feature representations of each modality are compared with the projected representations obtained after mapping. The projected representations are constrained to retain the local neighborhood structure in the original feature representations, and a single-modal structure preservation loss is constructed. The intent recognition model is jointly trained using classification loss, cross-modal structure alignment loss, and single-modal structure preservation loss.
5. The structure-aware multimodal open intent recognition method according to claim 4, characterized in that, The loss for maintaining the single-modal structure includes: The original feature representations of each modality are mapped to the shared space to obtain the projected representations of each modality; In the original feature representation space of each modality, a first local neighborhood distribution of each sample is constructed, and in the projected representation space of each modality, a second local neighborhood distribution of each sample is constructed. The consistency between the first local neighborhood distribution and the second local neighborhood distribution of each sample is measured, and the consistency measures of each sample are weighted and combined at multiple scales to obtain the structure preservation sub-loss of each modality. The structure preservation loss of each mode is averaged to obtain the single-mode structure preservation loss.
6. The structure-aware multimodal open intent recognition method according to claim 5, characterized in that, Before averaging the structure-preserving subloss for each mode, the following steps are also included: Calculate the first matching probability between the feature representation of the text modality and the feature representation of the video modality for each sample; calculate the second matching probability between the feature representation of the text modality and the feature representation of the audio modality for each sample; determine the cross-modal semantic consistency weight of the sample based on the first matching probability and the second matching probability of each sample; The consistency between the first local neighborhood distribution and the second local neighborhood distribution of each sample is measured. The cross-modal semantic consistency weight is used as the weighting coefficient for each sample. The consistency measures of each sample are weighted and synthesized at multiple scales to obtain the structure-preserving sub-loss of each modality.
7. The structure-aware multimodal open intent recognition method according to claim 1, characterized in that, When using a trained intent recognition model to classify known intents or detect unknown intents on fused representations, the following steps are included: Based on the fusion representation of known class samples in the training set, calculate the center vector and global covariance matrix for each known class; Calculate the distance from the fused representation of the test sample to the centroid of each known class, and take the minimum distance as the discrimination score; When the discrimination score is greater than a preset threshold, the test sample is determined to be of unknown intent; otherwise, it is determined to be of known intent. The fused representation of the test sample is then input into the classifier to obtain the specific category of the test sample.
8. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by the processor, it implements the steps of the structure-aware multimodal open intent recognition method according to any one of claims 1-7.
9. A storage medium storing computer instructions thereon, characterized in that, When the computer instructions are executed, they perform the steps of the structure-aware multimodal open intent recognition method according to any one of claims 1-7.
10. A terminal comprising a memory and a processor, wherein the memory stores computer instructions executable on the processor, characterized in that, When the processor executes the computer instructions, it performs the steps of the structure-aware multimodal open intent recognition method according to any one of claims 1-7.