Cross-modal attention fusion method and device based on voiceprint features and language semantics
By employing a cross-modal attention fusion method, utilizing an emotion-aware masking mechanism and contextual emotion memory units, a cross-modal relationship graph of voiceprint features and semantic features is constructed. This solves the problem of the separation between voiceprint recognition and semantic understanding, achieving a high-accuracy and personalized voice interaction experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGDONG GUANGXIN COMM SERVICES COMPANY
- Filing Date
- 2026-05-28
- Publication Date
- 2026-06-23
AI Technical Summary
Existing voiceprint recognition and semantic understanding suffer from low accuracy, insufficient personalization, and weak adaptability in voice interaction, mainly due to the fragmented processing of voiceprint features and language semantics, and the underutilization of different time sensitivities.
By employing an emotion-aware masking mechanism and contextual emotion memory units, voiceprint features and semantic features are fused across modalities. By constructing a cross-modal relationship graph and reconstructing the loss function, the temporal dependency and fusion of voiceprint features and semantic features are realized, and the dialogue strategy is dynamically adjusted to generate interactive responses.
It improves the accuracy and personalization of voice interaction, enabling the generation of intelligent responses that better match the user's true intentions and emotional needs in complex scenarios, and enhancing the depth and accuracy of semantic understanding.
Smart Images

Figure CN122266366A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of speech recognition technology, and in particular to a cross-modal attention fusion method and apparatus based on voiceprint features and language semantics. Background Technology
[0002] Currently, in the field of voice interaction, voiceprint recognition and semantic understanding usually adopt a separate processing architecture. For example, they can be fused by simply splicing acoustic features and text features, or by using a conventional attention mechanism to perform weighted fusion of speech and text features.
[0003] However, practice has revealed the following technical shortcomings in existing voiceprint and text fusion methods: First, current technologies typically treat speech signals as a static set of features, ignoring the dynamic evolution of voiceprint features (such as emotion and identity) and language semantics over time. This separation of voiceprint features and language semantics can easily lead to a disconnect between emotion recognition and intent understanding. Second, mainstream multimodal fusion methods usually only align features at the same time step, ignoring the different time sensitivities of voiceprint and semantic features. Third, existing voiceprint recognition and semantic understanding models are typically trained and deployed independently, lacking a collaborative optimization mechanism based on interactive data. These shortcomings often result in low accuracy, insufficient personalization, and weak adaptability in high-quality voice interaction scenarios such as telecommunications customer service, remote banking, and government hotlines, leading to suboptimal voice interaction experiences. Therefore, proposing a new dynamic fusion scheme for voiceprint features and speech semantics to achieve a high-accuracy and personalized voice interaction experience is particularly important. Summary of the Invention This invention provides a cross-modal attention fusion method and device based on voiceprint features and language semantics, which can achieve a high-accuracy and personalized voice interaction experience.
[0004] To address the aforementioned technical problems, the first aspect of this invention discloses a cross-modal attention fusion method based on voiceprint features and language semantics, the method comprising: Acquire multimodal input data, which includes the user's voice data, text data, and historical interaction data for the current round; Perform feature extraction on the speech data to obtain the voiceprint features of the current round, wherein the voiceprint features include at least one of identity features, emotion features, and noise features; Based on the emotion-aware masking mechanism and the contextual emotion memory unit, the emotion features contained in the voiceprint features are fused into the text data to obtain the semantic features corresponding to the text data; Construct a cross-modal relationship graph of the voiceprint features and the semantic features; and determine the temporal dependency relationship between the voiceprint features and the semantic features based on the edge weights generated by the cross-modal relationship graph. Based on the temporal dependency, the voiceprint features and semantic features are projected into the same semantic space, and a feature reconstruction and fusion operation is performed based on the reconstruction loss function to obtain the voiceprint and semantic fusion features. The dialogue strategy corresponding to the current round of the user, determined based on the voiceprint and semantic fusion features, is adjusted to obtain the adjusted dialogue strategy; the adjusted dialogue strategy is used to generate interactive response data with the historical interaction data.
[0005] As an optional implementation, in the first aspect of the present invention, performing feature extraction on the speech data to obtain the voiceprint features of the current round includes: The original acoustic features are extracted from the speech data; and a preliminary modeling operation is performed on the original acoustic features to obtain basic acoustic features. The preliminary modeling operation includes at least one of temporal modeling operation, feature compression operation, and coding enhancement operation. The original acoustic features are input into a temporal decoupling network for decoupling to obtain a latent feature space, which includes at least two of the following: an identity feature subspace, an emotion feature subspace, and a noise feature subspace. The basic acoustic features are used as input to a multi-level temporal granularity parser for parsing, resulting in multi-granularity intermediate acoustic features. These multi-granularity intermediate acoustic features include at least two of the following: short-term emotional fluctuation features, mid-term intonation change features, and long-term identity features. The temporal granularity corresponding to the short-term emotional fluctuation features is smaller than that corresponding to the mid-term intonation change features, and the temporal granularity corresponding to the mid-term intonation change features is smaller than that corresponding to the long-term identity features. A time alignment operation is performed on the short-term emotional fluctuation features and the mid-term intonation change features; after the time alignment operation is completed, the short-term emotional fluctuation features and the mid-term intonation change features are fused to obtain joint features; based on the joint features, a feature mapping operation is performed on the emotional feature subspace to obtain emotional features; Based on the long-term identity features, the feature mapping processing operation is performed on the identity feature subspace to obtain the identity features; The feature mapping operation is performed on the noise subspace to obtain noise features.
[0006] As an optional implementation, in the first aspect of the present invention, adjusting the dialogue strategy corresponding to the user's current turn determined based on the voiceprint and semantic fusion features to obtain an adjusted dialogue strategy includes: Based on the voiceprint and semantic fusion features, the dialogue strategy corresponding to the user's current round is determined; Based on the contextual emotional memory unit, a set of continuous emotional features is determined according to the emotional features contained in the voiceprint features of the current round, wherein the set of continuous emotional features includes emotional features of multiple consecutive rounds, and all emotional features of the consecutive rounds include at least the emotional features of the current round. Based on a preset voiceprint change rate construction algorithm, the continuous emotional feature set is calculated to obtain the user's voiceprint change rate; and based on the user's voiceprint change rate, the exploration ratio adjustment parameter corresponding to the dialogue strategy is determined. Based on the continuous set of emotional features, determine the user's emotional profile and the user's emotional state corresponding to the current round; and based on the user's emotional profile and the user's emotional state, determine the service level adjustment parameters corresponding to the dialogue strategy. The dialogue strategy is adjusted based on the exploration ratio adjustment parameter and the service level adjustment parameter to obtain the adjusted dialogue strategy.
[0007] As an optional implementation, in the first aspect of the present invention, the voiceprint change rate construction algorithm includes: ; in, This represents the rate of change of the user's voiceprint. Indicates the current round. Indicates the first Rounds, Indicates the first The emotional characteristics corresponding to each round Indicates the first Rounds, Indicates the first The emotional characteristics corresponding to each round Indicates the number of rounds.
[0008] As an optional implementation, in the first aspect of the present invention, determining the exploration ratio adjustment parameter corresponding to the dialogue strategy based on the user's voiceprint change rate includes: Based on the user voiceprint change rate and at least one preset voiceprint change stage, a target change stage corresponding to the user voiceprint change rate is determined. All the voiceprint change stages include at least one of a first change stage, a second change stage, and a third change stage. The user voiceprint change rate in the first change stage is less than a first preset change rate. The user voiceprint change rate in the second change stage is greater than or equal to the first preset change rate and less than the second preset change rate. The user voiceprint change rate in the third change stage is greater than or equal to the second preset change rate. Based on the target change stage corresponding to the user's voiceprint change rate, determine the degree of the user's emotional change and the degree of voice change; Based on the degree of emotional change and the degree of voice change, the exploration ratio adjustment parameter corresponding to the dialogue strategy is determined.
[0009] As an optional implementation, in the first aspect of the present invention, the method further includes: When fusing the voiceprint features and the semantic features, the voiceprint recognition confidence level corresponding to the voiceprint features and the semantic understanding accuracy corresponding to the semantic features are detected. When the confidence level of the voiceprint recognition is detected to be lower than the preset confidence level, the semantic context analysis operation corresponding to the semantic feature is performed to generate a correction signal, and the voiceprint feature is corrected according to the correction signal. When the semantic understanding accuracy is detected to be lower than the preset accuracy, the corresponding semantic disambiguation operation is performed based on the identity features contained in the voiceprint features.
[0010] As an optional implementation, in the first aspect of the present invention, the interactive response data is generated in the following manner: Based on the historical interaction data, the language style parameters are adjusted to obtain the adjusted language style parameters; The output content of the pre-built expression generator is controlled by the adjusted dialogue strategy, and the output style of the expression generator is controlled by the adjusted language style parameters to generate interactive response data.
[0011] A second aspect of this invention discloses a cross-modal attention fusion device based on voiceprint features and language semantics, the device comprising: The acquisition module is used to acquire multimodal input data, which includes the user's voice data, text data, and historical interaction data in the current round. An extraction module is used to perform feature extraction operations on the speech data to obtain the voiceprint features of the current round, wherein the voiceprint features include at least one of identity features, emotion features, and noise features; The fusion module is used to fuse the emotional features contained in the voiceprint features into the text data based on the emotion perception masking mechanism and the context emotion memory unit to obtain the semantic features corresponding to the text data. The association module is used to construct a cross-modal relationship graph between the voiceprint features and the semantic features; and based on the edge weights generated by the cross-modal relationship graph, to determine the temporal dependency between the voiceprint features and the semantic features. The fusion module is further configured to project the voiceprint features and semantic features into the same semantic space based on the temporal dependency relationship, and perform a feature reconstruction fusion operation based on the reconstruction loss function to obtain voiceprint and semantic fusion features; The adjustment module is used to adjust the dialogue strategy corresponding to the current round of the user, which is determined based on the voiceprint and semantic fusion features, to obtain the adjusted dialogue strategy; the adjusted dialogue strategy is used to generate interactive response data with the historical interaction data.
[0012] As an optional implementation, in a second aspect of the present invention, the extraction module performs feature extraction on the speech data to obtain the voiceprint features of the current round in the following specific ways: The original acoustic features are extracted from the speech data; and a preliminary modeling operation is performed on the original acoustic features to obtain basic acoustic features. The preliminary modeling operation includes at least one of temporal modeling operation, feature compression operation, and coding enhancement operation. The original acoustic features are input into a temporal decoupling network for decoupling to obtain a latent feature space, which includes at least two of the following: an identity feature subspace, an emotion feature subspace, and a noise feature subspace. The basic acoustic features are used as input to a multi-level temporal granularity parser for parsing, resulting in multi-granularity intermediate acoustic features. These multi-granularity intermediate acoustic features include at least two of the following: short-term emotional fluctuation features, mid-term intonation change features, and long-term identity features. The temporal granularity corresponding to the short-term emotional fluctuation features is smaller than that corresponding to the mid-term intonation change features, and the temporal granularity corresponding to the mid-term intonation change features is smaller than that corresponding to the long-term identity features. A time alignment operation is performed on the short-term emotional fluctuation features and the mid-term intonation change features; after the time alignment operation is completed, the short-term emotional fluctuation features and the mid-term intonation change features are fused to obtain joint features; based on the joint features, a feature mapping operation is performed on the emotional feature subspace to obtain emotional features; Based on the long-term identity features, the feature mapping processing operation is performed on the identity feature subspace to obtain the identity features; The feature mapping operation is performed on the noise subspace to obtain noise features.
[0013] As an optional implementation, in a second aspect of the present invention, the adjustment module adjusts the dialogue strategy corresponding to the user's current turn, determined based on the voiceprint and semantic fusion features, to obtain the adjusted dialogue strategy in the following specific ways: Based on the voiceprint and semantic fusion features, the dialogue strategy corresponding to the user's current round is determined; Based on the contextual emotional memory unit, a set of continuous emotional features is determined according to the emotional features contained in the voiceprint features of the current round, wherein the set of continuous emotional features includes emotional features of multiple consecutive rounds, and all emotional features of the consecutive rounds include at least the emotional features of the current round. Based on a preset voiceprint change rate construction algorithm, the continuous emotional feature set is calculated to obtain the user's voiceprint change rate; and based on the user's voiceprint change rate, the exploration ratio adjustment parameter corresponding to the dialogue strategy is determined. Based on the continuous set of emotional features, determine the user's emotional profile and the user's emotional state corresponding to the current round; and based on the user's emotional profile and the user's emotional state, determine the service level adjustment parameters corresponding to the dialogue strategy. The dialogue strategy is adjusted based on the exploration ratio adjustment parameter and the service level adjustment parameter to obtain the adjusted dialogue strategy.
[0014] As an optional implementation, in a second aspect of the present invention, the voiceprint change rate construction algorithm includes: ; in, This represents the rate of change of the user's voiceprint. Indicates the current round. Indicates the first Rounds, Indicates the first The emotional characteristics corresponding to each round Indicates the first Rounds, Indicates the first The emotional characteristics corresponding to each round Indicates the number of rounds.
[0015] As an optional implementation, in a second aspect of the present invention, the specific method by which the adjustment module determines the exploration ratio adjustment parameter corresponding to the dialogue strategy based on the user's voiceprint change rate includes: Based on the user voiceprint change rate and at least one preset voiceprint change stage, a target change stage corresponding to the user voiceprint change rate is determined. All the voiceprint change stages include at least one of a first change stage, a second change stage, and a third change stage. The user voiceprint change rate in the first change stage is less than a first preset change rate. The user voiceprint change rate in the second change stage is greater than or equal to the first preset change rate and less than the second preset change rate. The user voiceprint change rate in the third change stage is greater than or equal to the second preset change rate. Based on the target change stage corresponding to the user's voiceprint change rate, determine the degree of the user's emotional change and the degree of voice change; Based on the degree of emotional change and the degree of voice change, the exploration ratio adjustment parameter corresponding to the dialogue strategy is determined.
[0016] As an optional implementation, in a second aspect of the invention, the apparatus further includes: The detection module is used to detect the voiceprint recognition confidence level corresponding to the voiceprint feature and the semantic understanding accuracy corresponding to the semantic feature when fusing the voiceprint feature and the semantic feature. The correction module is used to perform semantic context analysis operation corresponding to the semantic feature when the confidence level of the voiceprint recognition is detected to be lower than the preset confidence level, so as to generate a correction signal and correct the voiceprint feature according to the correction signal; The disambiguation module is used to perform corresponding semantic disambiguation operations based on the identity features contained in the voiceprint features when the semantic understanding accuracy is detected to be lower than a preset accuracy.
[0017] As an optional implementation, in a second aspect of the invention, the interactive response data is generated in the following manner: Based on the historical interaction data, the language style parameters are adjusted to obtain the adjusted language style parameters; The output content of the pre-built expression generator is controlled by the adjusted dialogue strategy, and the output style of the expression generator is controlled by the adjusted language style parameters to generate interactive response data.
[0018] A third aspect of the present invention discloses another cross-modal attention fusion device based on voiceprint features and language semantics, the device comprising: Memory containing executable program code; A processor coupled to the memory; The processor calls the executable program code stored in the memory to execute the cross-modal attention fusion method based on voiceprint features and language semantics disclosed in the first aspect of the present invention.
[0019] The fourth aspect of the present invention discloses a computer storage medium storing computer instructions, which, when invoked, are used to execute the cross-modal attention fusion method based on voiceprint features and language semantics disclosed in the first aspect of the present invention.
[0020] Compared with the prior art, the embodiments of the present invention have the following beneficial effects: In this embodiment of the invention, multimodal input data is acquired, including the user's current round of voice data, text data, and historical interaction data; feature extraction is performed on the voice data to obtain the voiceprint features of the current round, which include at least one of identity features, emotion features, and noise features; based on an emotion-aware masking mechanism and a contextual emotion memory unit, the emotion features contained in the voiceprint features are fused into the text data to obtain the semantic features corresponding to the text data; a cross-modal relationship graph of voiceprint features and semantic features is constructed; and based on the edge weights generated by the cross-modal relationship graph, the temporal dependency relationship between the voiceprint features and the semantic features is determined; based on the temporal dependency relationship, the voiceprint features and semantic features are projected into the same semantic space, and a feature reconstruction fusion operation is performed based on a reconstruction loss function to obtain the voiceprint and semantic fusion features; the dialogue strategy corresponding to the user's current round determined based on the voiceprint and semantic fusion features is adjusted to obtain the adjusted dialogue strategy; the adjusted dialogue strategy is used to generate interactive response data with historical interaction data. As can be seen, implementing this invention firstly enables the acquisition of multimodal data including voice, text, and historical interactions, comprehensively capturing the user's current intent and contextual background, laying a data foundation for subsequent accurate understanding; then, it performs feature extraction on the voice data to obtain voiceprint features containing identity, emotion, and noise characteristics, achieving fine-grained deconstruction of multidimensional information in the voice signal; subsequently, based on the emotion-aware masking mechanism and contextual emotion memory unit, it fuses emotional features into the text data to obtain semantic features, making the text semantics no longer limited to the literal meaning, but fully incorporating the user's true emotional tendencies, which is conducive to significantly improving the depth and accuracy of semantic understanding; then, it constructs a cross-modal relationship graph of voiceprint features and semantic features, explicitly modeling the complex associations between the two modalities through graph structure; and based on the edge weights generated by the cross-modal relationship graph, it determines the temporal dependency relationship, which can effectively capture the dynamic evolution of voice and text in the time dimension, which is conducive to avoiding information loss caused by static fusion; on this basis... By projecting voiceprint and semantic features into the same semantic space and performing feature reconstruction fusion based on a reconstruction loss function, the alignment consistency and fusion accuracy of voiceprint and semantic features can be improved. Finally, based on the fused features, the dialogue strategy is adjusted and combined with historical interaction data to generate interactive response data. This enables the system to dynamically optimize the dialogue direction according to the user's real-time emotional state and semantic intent, generating more empathetic and contextually coherent interactive responses. Overall, this solution achieves deep collaboration between speech and text modalities at the semantic and emotional levels through feature extraction, emotion fusion, cross-modal relationship graph modeling, temporal dependency analysis, and joint space reconstruction. It not only breaks through the bottleneck of traditional single-modal or shallow fusion methods in emotion perception, but also ensures the accuracy of fused features through joint optimization of cross-modal relationship graphs and reconstruction loss. This allows the dialogue system to make more intelligent responses that fit the user's true intent and emotional needs when facing complex real-world scenarios. Attached Figure Description
[0021] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0022] Figure 1 This is a flowchart illustrating a cross-modal attention fusion method based on voiceprint features and language semantics disclosed in an embodiment of the present invention. Figure 2 This is a flowchart illustrating another cross-modal attention fusion method based on voiceprint features and language semantics disclosed in an embodiment of the present invention. Figure 3 This is a schematic diagram of the structure of a cross-modal attention fusion device based on voiceprint features and language semantics disclosed in an embodiment of the present invention; Figure 4 This is a schematic diagram of another cross-modal attention fusion device based on voiceprint features and language semantics disclosed in an embodiment of the present invention; Figure 5 This is a schematic diagram of the structure of another cross-modal attention fusion device based on voiceprint features and language semantics disclosed in an embodiment of the present invention. Detailed Implementation
[0023] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0024] The terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this invention are used to distinguish different objects, not to describe a specific order. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, apparatus, product, or end that includes a series of steps or units is not limited to the listed steps or units, but may optionally include steps or units not listed, or may optionally include other steps or units inherent to these processes, methods, products, or ends.
[0025] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of the invention. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.
[0026] This invention discloses a cross-modal attention fusion method and apparatus based on voiceprint features and language semantics. First, by acquiring multimodal data including speech, text, and historical interactions, it comprehensively captures the user's current intent and contextual background, laying a data foundation for accurate subsequent understanding. Then, it performs feature extraction on the speech data to obtain voiceprint features containing identity, emotion, and noise characteristics, achieving fine-grained deconstruction of multidimensional information in the speech signal. Next, based on an emotion-aware masking mechanism and a contextual emotion memory unit, it fuses emotional features into the text data to obtain semantic features, making the text semantics no longer limited to literal meaning but fully incorporating the user's true emotional inclination, significantly improving the depth and accuracy of semantic understanding. Subsequently, it constructs a cross-modal relationship graph of voiceprint and semantic features, explicitly modeling the complex relationships between the two modalities through graph structure. Finally, based on the edge weights generated by the cross-modal relationship graph, it determines temporal dependencies, effectively capturing the dynamic evolution of speech and text in the time dimension, thus avoiding static fusion. This approach avoids information loss caused by fusion. By projecting voiceprint and semantic features into the same semantic space and performing feature reconstruction fusion based on a reconstruction loss function, the alignment consistency and fusion accuracy of voiceprint and semantic features can be improved. Finally, based on this fused feature, the dialogue strategy is adjusted and combined with historical interaction data to generate interactive response data. This allows the system to dynamically optimize the dialogue direction based on the user's real-time emotional state and semantic intent, generating more empathetic and contextually coherent interactive responses. Overall, this solution, through feature extraction, emotion fusion, cross-modal relationship graph modeling, temporal dependency analysis, and joint space reconstruction, achieves deep synergy between speech and text modalities at the semantic and emotional levels. It not only overcomes the bottleneck of traditional single-modal or shallow fusion methods in emotion perception but also ensures the accuracy of fused features through joint optimization of cross-modal relationship graphs and reconstruction loss. This enables the dialogue system to make more intelligent responses that better match the user's true intent and emotional needs when facing complex real-world scenarios. These will be explained in detail below.
[0027] Example 1 Please see Figure 1 , Figure 1 This is a flowchart illustrating a cross-modal attention fusion method based on voiceprint features and language semantics disclosed in an embodiment of the present invention. Figure 1The described cross-modal attention fusion method based on voiceprint features and language semantics can be applied to a cross-modal attention fusion device based on voiceprint features and language semantics. This device may include a fusion equipment or a fusion server, where the fusion server may include a cloud server or a local server. Optionally, this method can also be applied to a cross-modal attention fusion system based on voiceprint features and language semantics, and this system includes a multi-granularity voiceprint feature extraction module, a semantic-emotion joint coding module, a cross-modal temporal reconstruction fusion module, a dynamic strategy optimization module, and a closed-loop collaborative optimization module. This invention does not limit the scope of the application. Figure 1 As shown, the cross-modal attention fusion method based on voiceprint features and language semantics can include the following operations: 101. Obtain multimodal input data.
[0028] In this embodiment of the invention, the multimodal input data includes the user's current voice data, text data, and historical interaction data. The "current round" refers to the current round of the dialogue between the user and the system.
[0029] 102. Perform feature extraction on the speech data to obtain the voiceprint features of the current round.
[0030] In this embodiment of the invention, optionally, the voiceprint features include at least one of identity features, emotion features, and noise features. Specifically, a multi-granularity voiceprint feature extraction module performs feature extraction operations on the speech data to obtain the voiceprint features for the current round. The multi-granularity voiceprint feature extraction module can extract the basic acoustic features of the speech data by employing an improved X-vector architecture combined with a WaveNet encoder; and set up multi-level temporal granularity parsers to perform feature extraction operations to obtain voiceprint features at different temporal granularities; and decompose the original acoustic features into identity subspace, emotion subspace, and noise subspace through a temporal decoupling network (TDN), wherein the TDN employs an adversarial training mechanism, guiding feature decoupling through an identity discriminator and an emotion discriminator.
[0031] 103. Based on the emotion perception masking mechanism and contextual emotion memory unit, the emotion features contained in the voiceprint features are fused into the text data to obtain the semantic features corresponding to the text data.
[0032] In this embodiment of the invention, specifically, the semantic-sentiment joint encoding module fuses the emotional features contained in the voiceprint features into the text data to obtain the semantic features corresponding to the text data. Specifically, the semantic-sentiment joint encoding module uses a pre-trained language model (e.g., BERT, RoBERTa) as the basic encoder; it also sets up an emotion-aware masking mechanism to dynamically adjust the masking ratio based on the emotional intensity output by the voiceprint module; and it introduces a contextual emotion memory unit to store the emotional evolution trajectory in historical dialogues, thus solving the problem of emotion drift in long dialogues.
[0033] 104. Construct a cross-modal relationship graph of voiceprint features and semantic features; and determine the temporal dependency relationship between voiceprint features and semantic features based on the edge weights generated by the cross-modal relationship graph.
[0034] In this embodiment of the invention, specifically, a cross-modal relationship graph of voiceprint features and semantic features is constructed through a cross-modal temporal reconstruction and fusion module; and the temporal dependency between voiceprint features and semantic features is determined based on the edge weights generated by the cross-modal relationship graph. The cross-modal temporal reconstruction and fusion module can be configured with feature temporal decoupling, dynamic relationship modeling, and reconstruction and fusion mechanisms. Specifically, the feature temporal decoupling mechanism projects voiceprint features and semantic features into latent spaces with different temporal resolutions using a multi-resolution temporal encoder (MRTE), avoiding the limitations of fixed time step alignment in traditional methods; dynamic relationship modeling designs a cross-modal relationship graph (CRG), where nodes represent feature segments at different temporal granularities, and edge weights are dynamically generated through a learnable function to capture the non-linear temporal dependency between voiceprint and semantic features.
[0035] 105. Based on temporal dependency, project voiceprint features and semantic features into the same semantic space, and perform feature reconstruction fusion operation based on reconstruction loss function to obtain voiceprint and semantic fusion features.
[0036] In this embodiment of the invention, the reconstruction fusion mechanism is as follows: by setting a Feature Reconstruction Fusion (FRF) strategy, unlike conventional weighted fusion, FRF first projects the decoupled features onto a unified semantic space, and then guides the model to recover the original input through a reconstruction loss function, forcing the model to learn the essential association between voiceprint and semantics.
[0037] In this embodiment of the invention, the reconstruction loss function includes: ; in, Represents voice data, Represents text data, Represents the acoustic reconstruction function. This represents a text reconstruction function. Indicates identity characteristics, Indicates emotional characteristics, Represents semantic features, , This is the loss balance coefficient.
[0038] 106. Adjust the dialogue strategy corresponding to the current round of the user determined based on the fusion features of voiceprint and semantics to obtain the adjusted dialogue strategy.
[0039] In this embodiment of the invention, the adjusted dialogue strategy is used to generate interactive response data with historical interaction data. Specifically, the dynamic strategy optimization module adjusts the dialogue strategy corresponding to the user's current turn, determined based on voiceprint and semantic fusion features, to obtain the adjusted dialogue strategy. The dynamic strategy optimization module can construct a dialogue state representation, integrating real-time voiceprint-semantic fusion features; set a policy gradient adapter to dynamically adjust the dialogue strategy exploration-exploitation balance based on the user's voice change rate (VCR); and use a hybrid training mechanism of imitation learning and reinforcement learning to keep the strategy adjustment synchronized with changes in the user's state.
[0040] It is evident that implementation Figure 1The described cross-modal attention fusion method based on voiceprint features and language semantics firstly captures the user's current intent and contextual background by acquiring multimodal data including speech, text, and historical interactions, laying a data foundation for accurate subsequent understanding. It then extracts voiceprint features from the speech data, including identity, emotion, and noise characteristics, achieving fine-grained deconstruction of multidimensional information in the speech signal. Next, based on an emotion-aware masking mechanism and contextual emotion memory units, it fuses emotional features into the text data to obtain semantic features, making the text semantics no longer limited to literal meaning but fully incorporating the user's true emotional inclinations, significantly improving the depth and accuracy of semantic understanding. Subsequently, it constructs a cross-modal relationship graph of voiceprint and semantic features, explicitly modeling the complex relationships between the two modalities through graph structure. Finally, it determines temporal dependencies based on the edge weights generated by the cross-modal relationship graph, effectively capturing the dynamic evolution of speech and text over time, thus avoiding the drawbacks of static fusion. Information loss is addressed by projecting voiceprint and semantic features into the same semantic space and performing feature reconstruction fusion based on a reconstruction loss function. This improves the alignment consistency and fusion accuracy of voiceprint and semantic features. Finally, based on this fused feature, the dialogue strategy is adjusted and combined with historical interaction data to generate interactive response data. This enables the system to dynamically optimize the dialogue direction based on the user's real-time emotional state and semantic intent, generating more empathetic and contextually coherent interactive responses. Overall, this solution achieves deep collaboration between speech and text modalities at the semantic and emotional levels through feature extraction, emotion fusion, cross-modal relationship graph modeling, temporal dependency analysis, and joint space reconstruction. It not only overcomes the bottleneck of traditional single-modal or shallow fusion methods in emotion perception but also ensures the accuracy of fused features through joint optimization of cross-modal relationship graphs and reconstruction loss. This allows the dialogue system to make more intelligent responses that better match the user's true intent and emotional needs when facing complex real-world scenarios.
[0041] In an optional embodiment, the feature extraction operation on the speech data in step 102 above to obtain the voiceprint features of the current round includes: Extract raw acoustic features from speech data; and perform preliminary modeling operations on the raw acoustic features to obtain basic acoustic features. The preliminary modeling operations include at least one of temporal modeling operations, feature compression operations, and coding enhancement operations. The original acoustic features are input into a temporal decoupling network for decoupling to obtain a latent feature space, which includes at least two of the following: identity feature subspace, emotion feature subspace, and noise feature subspace. The basic acoustic features are used as input to a multi-level temporal granularity parser for parsing, resulting in multi-granularity intermediate acoustic features. These multi-granularity intermediate acoustic features include at least two of the following: short-term emotional fluctuation features, mid-term intonation change features, and long-term identity features. The temporal granularity corresponding to the short-term emotional fluctuation features is smaller than that corresponding to the mid-term intonation change features, and the temporal granularity corresponding to the mid-term intonation change features is smaller than that corresponding to the long-term identity features. A time alignment operation is performed on the short-term emotional fluctuation features and the mid-term intonation change features; after the time alignment operation is completed, the short-term emotional fluctuation features and the mid-term intonation change features are fused to obtain joint features; based on the joint features, a feature mapping operation is performed on the emotional feature subspace to obtain the emotional features. Based on long-term identity characteristics, feature mapping processing is performed on the identity characteristic subspace to obtain the identity characteristics; Perform feature mapping on the noise subspace to obtain noise features.
[0042] In this embodiment of the invention, the original acoustic features refer to the primary acoustic representations (such as frame-level vectors, spectral features, etc.) obtained from the original speech data through basic acoustic coding (such as WaveNet encoders or frame-level feature extraction), without undergoing multi-granularity modeling or semantic enhancement processing. The basic acoustic features refer to the input features obtained after preliminary modeling (such as temporal modeling, feature compression, or coding enhancement) based on the original acoustic features, used for subsequent multi-granularity parsing. That is, the basic acoustic features are obtained by encoding and modeling the original acoustic features. Furthermore, the original acoustic features serve as the input to the Temporal Decoupling Network (TDN), and the basic acoustic features serve as the input to a multi-level temporal granularity parser, used to extract features at different temporal granularities.
[0043] Specifically, a multi-level temporal granularity parser captures emotional fluctuations at a short-term granularity (e.g., 100-300ms), tone changes at a medium-term granularity (e.g., 1-3s), and identity features at a long-term granularity (e.g., entire sentences). Specifically, short-term emotional fluctuation features and medium-term tone change features are mapped to a unified timeline, and the two types of features are fused using concatenation or attention mechanisms to obtain a joint feature representation. The fused features are then input into a neural network (e.g., MLP or Transformer) for transformation to generate a high-level semantic representation. Finally, the emotional features are output, providing the emotional feature representation for subsequent modules.
[0044] Specifically, TDN decouples the original acoustic features into corresponding subspace representations, such as the identity subspace. Emotional Subspace Noise subspace Subsequently, feature extraction or mapping (e.g., pooling, weighted convergence, or nonlinear transformation) is performed on each subspace, such as identity features. Emotional characteristics Noise characteristics .
[0045] As can be seen, this optional embodiment can extract raw acoustic features from speech data and perform preliminary modeling operations including temporal modeling, feature compression, and encoding enhancement. This effectively compresses redundant dimensions and enhances the anti-interference ability of features while preserving rich acoustic information, providing high-quality basic acoustic features for subsequent processing. The raw acoustic features are then input into a temporal decoupling network for decoupling, resulting in a latent feature space covering subspaces such as identity, emotion, and noise. This achieves effective separation of different speech factors, avoiding feature confusion that could interfere with speaker recognition accuracy. Furthermore, the basic acoustic features are input into a multi-level temporal granularity parser to obtain multi-granularity intermediate acoustic features covering short-term emotional fluctuations, mid-term intonation changes, and long-term identity features, enabling the model to simultaneously capture speaker information at multiple time scales. Subsequently, short-term emotional fluctuation features and mid-term intonation change features are time-aligned and then fused to obtain joint features, ensuring consistent performance across different time granularities. The system ensures the temporal consistency of relevant information, avoiding information misalignment due to granularity differences. Then, based on this joint feature, feature mapping is performed on the emotional feature subspace. The aligned, complete emotional information is used to accurately extract emotional features, effectively improving the accuracy of emotional modeling. Furthermore, feature mapping is performed on the identity feature subspace based on long-term identity features, fully utilizing stable and unchanging identity attributes over long periods to obtain highly discriminative identity features. Additionally, feature mapping is performed on the noise subspace to obtain noise features, enabling explicit modeling of environmental noise and channel interference, facilitating subsequent noise suppression or precise noise decision-making. Finally, high-quality voiceprint features are output through decoupling, multi-granularity parsing, temporal alignment fusion, and subspace mapping. This effectively eliminates the interference of emotional fluctuations and environmental noise on identity recognition while retaining the discriminative power of identity features, providing accurate and highly distinguishable feature representations for subsequent voiceprint comparison or verification.
[0046] Example 2 Please see Figure 2 , Figure 2 This is a flowchart illustrating a cross-modal attention fusion method based on voiceprint features and language semantics disclosed in an embodiment of the present invention. Figure 2The described cross-modal attention fusion method based on voiceprint features and language semantics can be applied to a cross-modal attention fusion device based on voiceprint features and language semantics. This device may include a fusion equipment or a fusion server, where the fusion server may include a cloud server or a local server. Optionally, this method can also be applied to a cross-modal attention fusion system based on voiceprint features and language semantics, and this system includes a multi-granularity voiceprint feature extraction module, a semantic-emotion joint coding module, a cross-modal temporal reconstruction fusion module, a dynamic strategy optimization module, and a closed-loop collaborative optimization module. This invention does not limit the scope of the application. Figure 2 As shown, the cross-modal attention fusion method based on voiceprint features and language semantics can include the following operations: 201. Obtain multimodal input data.
[0047] 202. Perform feature extraction on the speech data to obtain the voiceprint features of the current round.
[0048] 203. Based on the emotion perception masking mechanism and contextual emotion memory unit, the emotion features contained in the voiceprint features are fused into the text data to obtain the semantic features corresponding to the text data.
[0049] 204. Construct a cross-modal relationship graph of voiceprint features and semantic features; and determine the temporal dependency relationship between voiceprint features and semantic features based on the edge weights generated by the cross-modal relationship graph.
[0050] 205. Based on temporal dependency, the voiceprint features and semantic features are projected into the same semantic space, and a feature reconstruction fusion operation is performed based on the reconstruction loss function to obtain the voiceprint and semantic fusion features.
[0051] In this embodiment of the invention, for other descriptions of steps 101-105, please refer to the detailed description of steps 201-205 in Embodiment 1. These descriptions will not be repeated in this embodiment of the invention.
[0052] 206. Based on the fusion features of voiceprint and semantics, determine the dialogue strategy corresponding to the user's current round.
[0053] In this embodiment of the invention, specifically, a reference dialogue strategy for the user is obtained from the dialogue system. The reference dialogue strategy can be the dialogue strategy corresponding to the user's previous round or the dialogue strategy in the user's historical dialogue. Then, by using voiceprint and semantic fusion features, a control coefficient corresponding to the reference dialogue strategy is generated, thereby determining the dialogue strategy corresponding to the user's current round based on the control coefficient corresponding to the reference dialogue strategy.
[0054] 207. Based on the contextual emotional memory unit, determine the set of continuous emotional features according to the emotional features contained in the voiceprint features of the current round.
[0055] In this embodiment of the invention, the continuous emotional feature set includes emotional features from multiple consecutive rounds, and the emotional features from all consecutive rounds include at least the emotional features from the current round.
[0056] 208. Based on the preset voiceprint change rate construction algorithm, calculate the continuous emotional feature set to obtain the user's voiceprint change rate; and based on the user's voiceprint change rate, determine the exploration ratio adjustment parameter corresponding to the dialogue strategy.
[0057] In this embodiment of the invention, the voiceprint change rate construction algorithm includes: ; in, Indicates the rate of change of the user's voiceprint. Indicates the current round. Indicates the first Rounds, Indicates the first The emotional characteristics corresponding to each round Indicates the first Rounds, Indicates the first The emotional characteristics corresponding to each round Indicates the number of rounds. Simply put, This represents the current time step, corresponding to the sentiment feature index at the current moment, not the total number of time granularities. This represents the index of the time step within the sliding window, i.e., the [number]th [time step]. The time step (or the first) (Emotional feature frames); Indicates the first The emotional feature vector corresponding to each time step; this feature originates from the emotional features output by the voiceprint module (which can be obtained by mapping the emotional subspace); This represents the sliding window size, i.e., the number of time steps used to calculate the average rate of change (e.g., corresponding to the most recent). (a time step or time segment). By analyzing the most recent The average value of the change in emotional characteristics within each time step is used to characterize the rate of emotional change, thus obtaining the VCR.
[0058] 209. Based on the continuous set of emotional features, determine the user's emotional profile and the user's emotional state corresponding to the current round; and based on the user's emotional profile and the user's emotional state, determine the service level adjustment parameters corresponding to the dialogue strategy.
[0059] For example, in the telecommunications customer service scenario, identity verification can be completed within 3 seconds of the start of a conversation using voiceprint features, eliminating the need for a dedicated verification process; by integrating voiceprint emotion change rate and semantic context, potential user needs can be predicted 1-2 rounds into the conversation; personalized suggestion cards can be generated, including script recommendations, historical preferences, and emotional response strategies; and service levels and recommended content can be dynamically adjusted based on the user profile identified by voiceprint recognition and real-time emotional state.
[0060] 210. Adjust the dialogue strategy according to the exploration ratio adjustment parameters and the service level adjustment parameters to obtain the adjusted dialogue strategy.
[0061] It is evident that implementation Figure 2The described cross-modal attention fusion method based on voiceprint features and language semantics firstly captures the user's current intent and contextual background by acquiring multimodal data including speech, text, and historical interactions, laying a data foundation for accurate subsequent understanding. It then extracts voiceprint features from the speech data, including identity, emotion, and noise characteristics, achieving fine-grained deconstruction of multidimensional information in the speech signal. Next, based on an emotion-aware masking mechanism and contextual emotion memory units, it fuses emotional features into the text data to obtain semantic features, making the text semantics no longer limited to literal meaning but fully incorporating the user's true emotional inclinations, significantly improving the depth and accuracy of semantic understanding. Subsequently, it constructs a cross-modal relationship graph of voiceprint and semantic features, explicitly modeling the complex relationships between the two modalities through graph structure. Finally, it determines temporal dependencies based on the edge weights generated by the cross-modal relationship graph, effectively capturing the dynamic evolution of speech and text over time, thus avoiding the drawbacks of static fusion. Information loss is addressed by projecting voiceprint and semantic features into the same semantic space and performing feature reconstruction fusion based on a reconstruction loss function. This improves the alignment consistency and fusion accuracy of voiceprint and semantic features. Finally, based on this fused feature, the dialogue strategy is adjusted and combined with historical interaction data to generate interactive response data. This enables the system to dynamically optimize the dialogue direction based on the user's real-time emotional state and semantic intent, generating more empathetic and contextually coherent interactive responses. Overall, this solution achieves deep collaboration between speech and text modalities at the semantic and emotional levels through feature extraction, emotion fusion, cross-modal relationship graph modeling, temporal dependency analysis, and joint space reconstruction. It not only overcomes the bottleneck of traditional single-modal or shallow fusion methods in emotion perception but also ensures the accuracy of fused features through joint optimization of cross-modal relationship graphs and reconstruction loss. This allows the dialogue system to make more intelligent responses that better match the user's true intent and emotional needs when facing complex real-world scenarios.Furthermore, by acquiring multimodal data including voice, text, and historical interactions, it can comprehensively capture the user's current intent and contextual background, laying a solid data foundation for subsequent accurate understanding. Then, feature extraction is performed on the voice data to obtain voiceprint features encompassing identity, emotion, and noise characteristics, enabling fine-grained deconstruction of multidimensional information in the voice signal and effectively separating identity attributes, emotional states, and environmental interference. Subsequently, based on an emotion-aware masking mechanism and contextual emotion memory units, emotional features are fused into text data to obtain semantic features, making text semantics no longer limited to literal meaning but fully incorporating the user's true emotional inclinations, significantly improving the depth and accuracy of semantic understanding. A cross-modal relationship graph of voiceprint and semantic features is then constructed, explicitly modeling the complex relationships between the two modalities through graph structure, compensating for the shortcomings of traditional vector concatenation methods in depicting nonlinear interaction relationships. Temporal dependencies are determined based on the edge weights generated by the cross-modal relationship graph, effectively capturing the dynamic evolution of voice and text in the time dimension and avoiding the loss of temporal information caused by static fusion. Finally, the voiceprint and semantic features are projected onto... By performing feature reconstruction and fusion operations within the same semantic space and based on a reconstruction loss function, the alignment consistency and fusion accuracy of features from both modalities are effectively improved. This ensures that the fused features retain their individual discriminative power while also possessing cross-modal complementarity. Finally, based on this fused feature, the dialogue strategy is adjusted and interactive response data is generated by combining historical interaction data. This enables the system to dynamically optimize the dialogue flow based on the user's real-time emotional state and semantic intent, generating more empathetic and contextually coherent interactive responses. Through a series of processing operations, including multimodal data acquisition, fine-grained voiceprint deconstruction, emotion-aware semantic enhancement, cross-modal relationship graph modeling, temporal dependency analysis, and joint space reconstruction, deep synergy between speech and text modalities at the semantic and emotional levels can be achieved. This not only breaks through the bottleneck of traditional single-modal or shallow fusion methods in emotion perception but also ensures the accuracy and robustness of the fused features through joint optimization of cross-modal relationship graphs and reconstruction loss. As a result, the dialogue system can make more intelligent responses that better match the user's true intent and emotional needs when facing complex real-world scenarios, providing a high-quality decision-making basis for subsequent dialogue management and personalized service execution.
[0062] In an optional embodiment, step 208 above, determining the exploration ratio adjustment parameter corresponding to the dialogue strategy based on the user's voiceprint change rate, includes: Based on the user's voiceprint change rate and at least one preset voiceprint change stage, the target change stage corresponding to the user's voiceprint change rate is determined. All voiceprint change stages include at least one of the first change stage, the second change stage, and the third change stage. The user's voiceprint change rate in the first change stage is less than the first preset change rate. The user's voiceprint change rate in the second change stage is greater than or equal to the first preset change rate and less than the second preset change rate. The user's voiceprint change rate in the third change stage is greater than or equal to the second preset change rate. Based on the target change stage corresponding to the user's voiceprint change rate, determine the degree of the user's emotional change and the degree of voice change; Based on the degree of emotional and vocal changes, the exploration ratio adjustment parameters corresponding to the dialogue strategy are determined.
[0063] For example, taking a user voiceprint change rate (VCR) range of 0.1, if the user's voiceprint change rate in the first stage is less than 0.2 (denoted as VCR < 0.2), it can be determined that the user's emotional change is relatively stable and the voice change is small. The exploration ratio adjustment parameter can then be set to reduce the exploration ratio to enhance exploitation, prioritize existing optimal dialogue strategies (such as standard scripts and historical preference strategies), and maintain dialogue coherence and efficiency. If the user's voiceprint change rate in the second stage is greater than or equal to 0.2 and less than 0.5 (denoted as 0.2 ≤ VCR < 0.5), it can be determined that the user's emotional change is relatively stable and the voice change is small. Both emotional and vocal changes exhibit some fluctuation, but not drastic changes. The exploration ratio adjustment parameter is determined to be a moderate increase, introducing some flexible adjustments (e.g., slight changes in speech style, increased confirmatory expressions) to adapt to changes in user state. In the third stage of change, the user's voiceprint change rate is greater than or equal to 0.5, denoted as VCR≥0.5. If the user's voiceprint change rate is in the third stage, it can be determined that the user's emotional or vocal state has changed significantly (e.g., from calm to agitation, dissatisfaction, etc.), and the exploration ratio adjustment parameter is determined to be a significant increase. New dialogue strategies (e.g., emotional soothing, switching to human intervention, adjusting tone or rhythm) are prioritized to quickly respond to changes in user state. By using different exploration-utilization trade-offs corresponding to different VCR intervals, dynamic adaptive adjustment of the dialogue strategy is achieved.
[0064] As can be seen, this optional embodiment can first determine the target change stage based on the user's voiceprint change rate and preset first, second, and third change stages. The first stage change rate is lower than a first preset threshold, the second stage is between two preset thresholds, and the third stage is higher than a second preset threshold. By dividing voiceprint changes into multiple gradient stages, it achieves a refined hierarchical discrimination of the dynamic evolution of the user's voiceprint, avoiding coarse-grained misjudgments caused by single threshold determination, thus more accurately depicting the user's current voiceprint state. Furthermore, based on the target change stage, it determines the user's emotional change level and speech change level, mapping the voiceprint change rate to quantitative indicators of both emotion and speech dimensions, achieving a semantic interpretation from acoustic signal changes to changes in the user's psychological state and expression. This provides a more interpretive basis for strategy adjustments. Subsequently, based on the degree of emotional and vocal changes, the exploration ratio adjustment parameters corresponding to the dialogue strategy are determined. Taking into account both emotional stability and vocal activity, the ratio of exploration to utilization is dynamically adjusted. This allows the system to appropriately explore new topics when the user's emotions are stable, while prioritizing stabilization and soothing when the user's emotions fluctuate drastically, thus achieving more personalized and adaptive dialogue strategy control. Overall, this solution, through gradient division of voiceprint change stages, quantitative mapping of multi-dimensional change degrees, and dynamic adjustment of the exploration ratio, enables the subsequent dialogue strategy generation stage to accurately match the user's current emotional and vocal state, ensuring that the generated interactive response maintains reasonable dialogue diversity and flexibility while meeting the user's real needs.
[0065] In another alternative embodiment, the method may further include: When fusing voiceprint features and semantic features, the confidence level of voiceprint recognition corresponding to the voiceprint features and the accuracy of semantic understanding corresponding to the semantic features are detected. When the confidence level of voiceprint recognition is detected to be lower than the preset confidence level, the semantic context analysis operation corresponding to the semantic feature is performed to generate a correction signal, and the voiceprint feature is corrected according to the correction signal. When the semantic understanding accuracy is detected to be lower than the preset accuracy, the corresponding semantic disambiguation operation is performed based on the identity features contained in the voiceprint features.
[0066] In this embodiment of the invention, a closed-loop collaborative optimization module is used to detect the voiceprint recognition confidence level corresponding to voiceprint features and the semantic understanding accuracy corresponding to semantic features. Specifically, the closed-loop collaborative optimization module sets up a voiceprint-semantic bidirectional correction mechanism. When the voiceprint recognition confidence level is lower than a preset confidence level, for example, lower than 60%, a correction signal is generated using semantic context. When there is ambiguity in semantic understanding, disambiguation is performed using voiceprint identity features. A personalized expression generator is also constructed to dynamically adjust language style parameters based on historical interaction data, achieving automated annotation and filtering of interaction data, and continuously optimizing the model through online distillation technology.
[0067] As can be seen, this optional embodiment can simultaneously detect the confidence level of voiceprint recognition and the accuracy of semantic understanding when fusing voiceprint features and semantic features, thereby constructing a real-time monitoring mechanism for dual-modal quality. This mechanism can promptly detect unreliable states of either modality in the current scenario, providing a trigger basis for subsequent adaptive correction. Furthermore, when the voiceprint recognition confidence level is lower than a preset confidence level, semantic context analysis is performed to generate a correction signal and correct the voiceprint features accordingly. This effectively utilizes semantic information to compensate and repair low-quality voiceprints, avoiding misjudgment of identity or feature deviation caused by inaccurate voiceprint recognition, thus improving the fault tolerance of the fusion process. Additionally, when the semantic understanding accuracy is lower than a preset accuracy level... Semantic disambiguation is performed based on identity features in voiceprint characteristics. By leveraging prior knowledge such as user profiles and behavioral habits carried by identity information, ambiguous semantics are constrained and disambiguated. This effectively compensates for the shortcomings of pure text semantic understanding when context is missing or expression is ambiguous, thereby significantly improving the accuracy of semantic understanding. Overall, this scheme, through a bidirectional complementary confidence detection and correction disambiguation mechanism, enables mutual supervision and error correction between voiceprint and semantic modalities during the fusion process. This helps ensure that reliable fusion features can still be obtained even when the quality of a single modality deteriorates, providing a solid guarantee for the accurate generation of subsequent dialogue strategies and the high-quality output of interactive responses.
[0068] In yet another optional embodiment, the interaction response data is generated in the following manner: Based on historical interaction data, adjust the language style parameters to obtain the adjusted language style parameters; The output content of the pre-built expression generator is controlled by the adjusted dialogue strategy, and the output style of the expression generator is controlled by the adjusted language style parameters to generate interactive response data.
[0069] In this embodiment of the invention, historical interaction data is used as input to drive the dynamic adjustment of language style parameters; the language style parameters further control the output style of the personalized expression generator.
[0070] Specifically, the first step is to acquire the user's historical interaction data. This historical interaction data includes basic user characteristics, voiceprint and emotional characteristics, semantic and intent characteristics, interaction behavior characteristics, and feedback and result characteristics. This historical interaction data provides a foundation for subsequent language style adjustments. For example, the user's historical call count, user type, average emotional intensity, emotional fluctuation, high-frequency intent categories, historical problem-solving rate, interruption frequency, speech rate, satisfaction rating, and complaint records can all serve as a basis for subsequent style adjustments.
[0071] After acquiring historical interaction data, the data undergoes structured processing. Since different types of historical interaction data have different data formats and dimensions—for example, call counts are count-based data, emotional intensity is continuous data, user type is categorical data, and complaint records are binary data—the system needs to normalize, encode, or statistically process this data to convert it into a unified feature representation. Through this processing, the raw historical interaction data can be transformed into feature vectors recognizable by the expression generator or style adjustment module.
[0072] For example, historical call counts can be normalized into call frequency features, user types can be encoded into user value level features, the proportions of emotions such as dissatisfaction and anxiety can be merged into negative emotion proportions, complaint records can be converted into complaint identifiers, and high-frequency intentions such as bill inquiries and package changes can be converted into business sensitivity features. After the above processing, the originally scattered historical interaction data is transformed into user feature data with a unified value range and clear meaning.
[0073] After obtaining the processed user feature data, intermediate profile metrics are further calculated based on these user features. These intermediate profile metrics are used to make a general judgment on the interaction methods a user might currently need. In other words, this solution does not directly adjust language style parameters based on a single feature, but rather integrates multiple related features to obtain a more stable user profile result. For example, the user's reassurance need index can be calculated based on average emotional intensity, emotional fluctuation, proportion of negative emotions, complaint records, and satisfaction rating; the user's efficiency preference index can be calculated based on historical call count, speech rate, interruption frequency, and historical problem resolution rate; the user's guidance and explanation need index can be calculated based on average number of conversation turns, pause duration, historical problem resolution rate, and intent diversity; and the formal compliance need index can be calculated based on user type, complaint records, and business sensitivity.
[0074] The aforementioned intermediate profile indicators have different functional divisions. The reassurance need index is mainly used to determine whether the response needs to be more reassuring and caring; the efficiency preference index is mainly used to determine whether the response needs to be more concise and direct; the guidance and explanation need index is mainly used to determine whether the response needs more steps of explanation and confirmation; and the formality and compliance need index is mainly used to determine whether the response needs to be more formal, cautious, and standardized. Therefore, the intermediate profile indicators are equivalent to a bridge connecting "historical interaction data" and "language style parameters".
[0075] After obtaining intermediate profile metrics, the language style parameters are dynamically adjusted based on preset mapping rules or trained parameters. In other words, profile results such as the reassurance need index, efficiency preference index, guidance and explanation need index, and formal compliance need index can be mapped to specific language style control parameters. For example, when the reassurance need index is high, the emotional intensity can be increased, and the tone can be adjusted to a reassurance style; when the efficiency preference index is high, response redundancy can be reduced, making the response more concise; when the formal compliance need index is high, formality can be increased, making the response more standardized; and when the guidance and explanation need index is high, the proportion of confirmatory expressions can be increased, and step-by-step explanations can be appropriately added.
[0076] Furthermore, the adjusted language style parameters serve as input control conditions for the expression generator. When generating interactive response data, the expression generator is controlled by the dialogue strategy to determine the business content, processing steps, and guiding logic that the response should include; and by the language style parameters to determine the tone, formality, emotional intensity, speech rate, sentence complexity, response length, and proportion of confirmatory expressions that the response should employ.
[0077] In other words, the dialogue strategy mainly controls "what to say," while the language style parameter mainly controls "how to say it." For example, in a bill inquiry scenario, the dialogue strategy determines whether the dialogue system needs to first confirm the user's intent, then query the bill details, and finally explain the reason for the change in cost; while the language style parameter determines whether the system expresses itself in a concise and direct way, in a reassuring and caring way, or in a more formal and cautious way.
[0078] For example, for users with a history of complaints, a high proportion of negative emotions, and who are considered high-value users, a higher reassurance need index and formal compliance need index can be calculated based on historical interaction data. Subsequently, the system will increase the formality and emotional intensity in the language style parameters and adjust the tone to reassurance or care. In this case, the response generated by the expression generator will be more likely to use expressions such as "Let me confirm for you," "I will check it for you," and "Please rest assured, I will explain the handling method directly."
[0079] Conversely, if a user has a history of speaking quickly, interrupting frequently, having a high problem-solving rate, and being relatively satisfied, a higher efficiency preference index and a lower reassurance need index will be calculated. In this case, by reducing response redundancy and minimizing repetitive explanations and excessive reassurance, the expression generator can produce more concise and direct responses. For example, the system could generate expressions like, "Let me first check your bill for this month, and then I'll explain the reason for the cost change later."
[0080] In actual interaction, the aforementioned language style parameters can be adjusted in real time based on the current conversation state. That is, language style parameters formed from historical interaction data can be used as initial values, and then dynamically updated by combining information such as real-time emotion, interruptions, current speech rate, and changes in intent within the current conversation. If the user's emotion intensifies in the current conversation, the emotional intensity and reassuring tone can be temporarily increased; if the user repeatedly interrupts system instructions, response redundancy can be reduced, sentence length shortened, and key processing information prioritized for output.
[0081] In this embodiment of the invention, historical interaction data can be structured into user-level and session-level features, as shown in the following example: Basic user characteristics include: User ID (User_ID), historical call count (Call_Count, e.g., 12 times), and User type (User_Type, e.g., high-value user / ordinary user). Voiceprint and emotional features include: average emotional intensity (Avg_Emotion_Score, e.g., 0.65), mean emotional fluctuation (Emotion_Variance, e.g., 0.18), and distribution of common emotional types, denoted as: Emotion_Distribution, e.g., {calm: 40%, dissatisfaction: 35%, anxiety: 25%}; Semantic and intent features include: high-frequency intent categories (Top_Intents, such as: {bill inquiry, package change}), historical problem resolution rate (Resolution_Rate, such as: 0.82), and average number of dialogue rounds (Avg_Turns, such as: 6 rounds). Interaction behavior characteristics include: interruption frequency (Interrupt_Rate, e.g., 0.3 times / minute), speech rate (Speech_Rate, e.g., 4.2 words / second), and pause duration (Pause_Duration, e.g., average 0.8 seconds). Feedback and outcome characteristics include: user satisfaction score (e.g., 4.3 / 5) and complaint record (e.g., 0 / 1).
[0082] In this embodiment of the invention, language style parameters can be modeled as continuous or discrete control variables, as shown in the following examples: Formality Level (Formality_Level) can range from [0,1], such as 0.8 (indicating slightly formal). Emotion Intensity is used to express the degree of emotion, and its value range can be [0,1], such as 0.6 (indicating moderate comfort). Tone type is used to represent discrete categories, such as: {neutral, caring, soothing, guiding}; Speech Rate Control (Speech_Rate_Control) is used to adjust the pace of the response, such as: slow (0.8x), normal (1.0x), fast (1.2x); Sentence complexity, such as simple sentence / compound sentence (which can be quantified as average sentence length, such as 12 words); Response redundancy (Verbosity_Level) is used to control the response length, and its value range can be [0,1], such as 0.4 (indicating a more concise response). Confirmation Ratio, such as 0.3 (meaning 30% of sentences contain confirmation or restatement).
[0083] As can be seen, this optional embodiment can first adjust language style parameters based on historical interaction data, fully exploring the user's past language habits, expression preferences, and communication patterns in dialogues. This ensures that the adjusted language style parameters highly match the user's personalized communication style, reducing the occurrence of monotonous and mechanical responses, thereby significantly improving the naturalness and friendliness of the interaction. Furthermore, by controlling the output content of the expression generator through the adjusted dialogue strategy and controlling the output style through the adjusted language style parameters, it can achieve decoupled and precise control of dialogue content and expression form. This allows the system to generate semantically correct response content based on the current strategy decision and express it in a language style that matches the user, effectively balancing the accuracy and appropriateness of the response. This collaborative generation mechanism of history-driven style adaptation and strategy-style dual-channel control helps ensure that the final output interaction response data accurately responds to the user's intent at the semantic level and highly matches the user's habits at the style level. This helps ensure the real-time and stable output of subsequent interaction responses and also helps improve user satisfaction.
[0084] Example 3 Please see Figure 3 , Figure 3 This is a schematic diagram of the structure of a cross-modal attention fusion device based on voiceprint features and language semantics, as disclosed in an embodiment of the present invention. Figure 3The described cross-modal attention fusion device based on voiceprint features and language semantics may include a fusion device or a fusion server, wherein the fusion server may include a cloud server or a local server; optionally, the method can also be applied to a cross-modal attention fusion system based on voiceprint features and language semantics, and the system includes a multi-granularity voiceprint feature extraction module, a semantic-emotion joint coding module, a cross-modal temporal reconstruction fusion module, a dynamic strategy optimization module, and a closed-loop collaborative optimization module, which are not limited in the embodiments of the present invention. Figure 3 As shown, the cross-modal attention fusion device based on voiceprint features and language semantics may include: an acquisition module 301, used to acquire multimodal input data, which includes the user's voice data, text data and historical interaction data for the current round.
[0085] The extraction module 302 is used to perform feature extraction operations on the speech data to obtain the voiceprint features of the current round. The voiceprint features include at least one of identity features, emotion features, and noise features.
[0086] The fusion module 303 is used to fuse the emotional features contained in the voiceprint features into the text data based on the emotion perception masking mechanism and the context emotion memory unit to obtain the semantic features corresponding to the text data.
[0087] The association module 304 is used to construct a cross-modal relationship graph of voiceprint features and semantic features; and based on the edge weights generated by the cross-modal relationship graph, to determine the temporal dependency relationship between voiceprint features and semantic features.
[0088] The fusion module 303 is also used to project voiceprint features and semantic features into the same semantic space based on temporal dependencies, and to perform feature reconstruction fusion operation based on reconstruction loss function to obtain voiceprint and semantic fusion features.
[0089] The adjustment module 305 is used to adjust the dialogue strategy corresponding to the current round of the user, which is determined based on the fusion features of voiceprint and semantics, to obtain the adjusted dialogue strategy; the adjusted dialogue strategy is used to generate interactive response data with historical interaction data.
[0090] It is evident that implementation Figure 3The described cross-modal attention fusion device based on voiceprint features and language semantics first acquires multimodal data including speech, text, and historical interactions, comprehensively capturing the user's current intent and contextual background, laying a data foundation for subsequent accurate understanding. It then performs feature extraction on the speech data to obtain voiceprint features containing identity, emotion, and noise characteristics, achieving fine-grained deconstruction of multidimensional information in the speech signal. Next, based on an emotion-aware masking mechanism and contextual emotion memory units, it fuses emotional features into the text data to obtain semantic features, making the text semantics no longer limited to literal meaning but fully incorporating the user's true emotional inclinations, significantly improving the depth and accuracy of semantic understanding. Subsequently, it constructs a cross-modal relationship graph of voiceprint and semantic features, explicitly modeling the complex relationships between the two modalities through graph structure. Based on the edge weights generated by the cross-modal relationship graph, it determines temporal dependencies, effectively capturing the dynamic evolution of speech and text in the time dimension, thus avoiding the drawbacks of static fusion. Information loss is addressed by projecting voiceprint and semantic features into the same semantic space and performing feature reconstruction fusion based on a reconstruction loss function. This improves the alignment consistency and fusion accuracy of voiceprint and semantic features. Finally, based on this fused feature, the dialogue strategy is adjusted and combined with historical interaction data to generate interactive response data. This enables the system to dynamically optimize the dialogue direction based on the user's real-time emotional state and semantic intent, generating more empathetic and contextually coherent interactive responses. Overall, this solution achieves deep collaboration between speech and text modalities at the semantic and emotional levels through feature extraction, emotion fusion, cross-modal relationship graph modeling, temporal dependency analysis, and joint space reconstruction. It not only overcomes the bottleneck of traditional single-modal or shallow fusion methods in emotion perception but also ensures the accuracy of fused features through joint optimization of cross-modal relationship graphs and reconstruction loss. This allows the dialogue system to make more intelligent responses that better match the user's true intent and emotional needs when facing complex real-world scenarios.
[0091] In an optional embodiment, the extraction module 302 performs feature extraction on the speech data to obtain the voiceprint features of the current round in the following specific ways: Extract raw acoustic features from speech data; and perform preliminary modeling operations on the raw acoustic features to obtain basic acoustic features. The preliminary modeling operations include at least one of temporal modeling operations, feature compression operations, and coding enhancement operations. The original acoustic features are input into a temporal decoupling network for decoupling to obtain a latent feature space, which includes at least two of the following: identity feature subspace, emotion feature subspace, and noise feature subspace. The basic acoustic features are used as input to a multi-level temporal granularity parser for parsing, resulting in multi-granularity intermediate acoustic features. These multi-granularity intermediate acoustic features include at least two of the following: short-term emotional fluctuation features, mid-term intonation change features, and long-term identity features. The temporal granularity corresponding to the short-term emotional fluctuation features is smaller than that corresponding to the mid-term intonation change features, and the temporal granularity corresponding to the mid-term intonation change features is smaller than that corresponding to the long-term identity features. A time alignment operation is performed on the short-term emotional fluctuation features and the mid-term intonation change features; after the time alignment operation is completed, the short-term emotional fluctuation features and the mid-term intonation change features are fused to obtain joint features; based on the joint features, a feature mapping operation is performed on the emotional feature subspace to obtain the emotional features. Based on long-term identity characteristics, feature mapping processing is performed on the identity characteristic subspace to obtain the identity characteristics; Perform feature mapping on the noise subspace to obtain noise features.
[0092] As can be seen, this optional embodiment can extract raw acoustic features from speech data and perform preliminary modeling operations including temporal modeling, feature compression, and encoding enhancement. This effectively compresses redundant dimensions and enhances the anti-interference ability of features while preserving rich acoustic information, providing high-quality basic acoustic features for subsequent processing. The raw acoustic features are then input into a temporal decoupling network for decoupling, resulting in a latent feature space covering subspaces such as identity, emotion, and noise. This achieves effective separation of different speech factors, avoiding feature confusion that could interfere with speaker recognition accuracy. Furthermore, the basic acoustic features are input into a multi-level temporal granularity parser to obtain multi-granularity intermediate acoustic features covering short-term emotional fluctuations, mid-term intonation changes, and long-term identity features, enabling the model to simultaneously capture speaker information at multiple time scales. Subsequently, short-term emotional fluctuation features and mid-term intonation change features are time-aligned and then fused to obtain joint features, ensuring consistent performance across different time granularities. The system ensures the temporal consistency of relevant information, avoiding information misalignment due to granularity differences. Then, based on this joint feature, feature mapping is performed on the emotional feature subspace. The aligned, complete emotional information is used to accurately extract emotional features, effectively improving the accuracy of emotional modeling. Furthermore, feature mapping is performed on the identity feature subspace based on long-term identity features, fully utilizing stable and unchanging identity attributes over long periods to obtain highly discriminative identity features. Additionally, feature mapping is performed on the noise subspace to obtain noise features, enabling explicit modeling of environmental noise and channel interference, facilitating subsequent noise suppression or precise noise decision-making. Finally, high-quality voiceprint features are output through decoupling, multi-granularity parsing, temporal alignment fusion, and subspace mapping. This effectively eliminates the interference of emotional fluctuations and environmental noise on identity recognition while retaining the discriminative power of identity features, providing accurate and highly distinguishable feature representations for subsequent voiceprint comparison or verification.
[0093] In another optional embodiment, the adjustment module 305 adjusts the dialogue strategy corresponding to the user's current turn, determined based on the voiceprint and semantic fusion features. The specific methods for obtaining the adjusted dialogue strategy include: Based on the fusion features of voiceprint and semantics, the dialogue strategy corresponding to the user's current round is determined; Based on the contextual emotional memory unit, a continuous emotional feature set is determined according to the emotional features contained in the voiceprint features of the current round. The continuous emotional feature set includes emotional features from multiple consecutive rounds, and the emotional features of all consecutive rounds include at least the emotional features of the current round. Based on the preset voiceprint change rate construction algorithm, the continuous emotional feature set is calculated to obtain the user's voiceprint change rate; and based on the user's voiceprint change rate, the exploration ratio adjustment parameter corresponding to the dialogue strategy is determined. Based on the continuous set of emotional features, determine the user's emotional profile and the user's emotional state corresponding to the current round; and based on the user's emotional profile and user's emotional state, determine the service level adjustment parameters corresponding to the dialogue strategy. The dialogue strategy is adjusted based on the exploration ratio adjustment parameters and the service level adjustment parameters to obtain the adjusted dialogue strategy.
[0094] In this embodiment of the invention, the voiceprint change rate construction algorithm includes: ; in, Indicates the rate of change of the user's voiceprint. Indicates the current round. Indicates the first Rounds, Indicates the first The emotional characteristics corresponding to each round Indicates the first Rounds, Indicates the first The emotional characteristics corresponding to each round Indicates the number of rounds.
[0095] As can be seen, this optional embodiment can comprehensively capture the user's current intent and context by acquiring multimodal data including voice, text, and historical interactions, laying a solid data foundation for subsequent accurate understanding. Furthermore, feature extraction is performed on the voice data to obtain voiceprint features covering identity, emotion, and noise characteristics, enabling fine-grained deconstruction of multidimensional information in the voice signal and effectively separating identity attributes, emotional states, and environmental interference. Subsequently, based on an emotion-aware masking mechanism and contextual emotion memory units, emotional features are fused into text data to obtain semantic features, making text semantics no longer limited to literal meaning but fully incorporating the user's true emotional inclination, which significantly improves the depth and accuracy of semantic understanding. A cross-modal relationship graph of voiceprint and semantic features is then constructed, explicitly modeling the complex relationships between the two modalities through graph structure, compensating for the shortcomings of traditional vector concatenation methods in depicting nonlinear interaction relationships. Temporal dependencies are determined based on the edge weights generated by the cross-modal relationship graph, effectively capturing the dynamic evolution of voice and text in the time dimension and avoiding the loss of temporal information caused by static fusion. Finally, the voiceprint and semantic features are combined... Projecting features onto the same semantic space and performing feature reconstruction and fusion operations based on the reconstruction loss function effectively improves the alignment consistency and fusion accuracy of features from both modalities, ensuring that the fused features retain their respective discriminative power while possessing cross-modal complementarity. Finally, based on the fused features, the dialogue strategy is adjusted and interactive response data is generated by combining historical interaction data. This enables the system to dynamically optimize the dialogue direction based on the user's real-time emotional state and semantic intent, generating more empathetic and contextually coherent interactive responses. Through a series of processing operations, including multimodal data acquisition, fine-grained voiceprint deconstruction, emotion-aware semantic enhancement, cross-modal relationship graph modeling, temporal dependency analysis, and joint space reconstruction, deep synergy between speech and text modalities at the semantic and emotional levels can be achieved. This not only breaks through the bottleneck of traditional single-modal or shallow fusion methods in emotion perception but also ensures the accuracy and robustness of the fused features through joint optimization of cross-modal relationship graphs and reconstruction loss. As a result, the dialogue system can make more intelligent responses that fit the user's true intent and emotional needs when facing complex real-world scenarios, providing a high-quality decision-making basis for subsequent dialogue management and personalized service execution.
[0096] In this optional embodiment, as an optional implementation method, the adjustment module 305 determines the exploration ratio adjustment parameter corresponding to the dialogue strategy based on the user's voiceprint change rate in the following specific ways: Based on the user's voiceprint change rate and at least one preset voiceprint change stage, the target change stage corresponding to the user's voiceprint change rate is determined. All voiceprint change stages include at least one of the first change stage, the second change stage, and the third change stage. The user's voiceprint change rate in the first change stage is less than the first preset change rate. The user's voiceprint change rate in the second change stage is greater than or equal to the first preset change rate and less than the second preset change rate. The user's voiceprint change rate in the third change stage is greater than or equal to the second preset change rate. Based on the target change stage corresponding to the user's voiceprint change rate, determine the degree of the user's emotional change and the degree of voice change; Based on the degree of emotional and vocal changes, the exploration ratio adjustment parameters corresponding to the dialogue strategy are determined.
[0097] As can be seen, this optional implementation can first determine the target change stage based on the user's voiceprint change rate and preset first, second, and third change stages. The first stage change rate is below a first preset threshold, the second stage is between two preset thresholds, and the third stage is above the second preset threshold. By dividing voiceprint changes into multiple gradient stages, it achieves a refined hierarchical discrimination of the dynamic evolution of the user's voiceprint, avoiding coarse-grained misjudgments caused by single threshold determination, thus more accurately depicting the user's current voiceprint state. Furthermore, based on the target change stage, it determines the user's emotional change level and speech change level, mapping the voiceprint change rate to quantitative indicators of both emotion and speech dimensions, achieving a semantic interpretation from acoustic signal changes to changes in the user's psychological state and expression. This provides a more interpretive basis for strategy adjustments. Subsequently, based on the degree of emotional and vocal changes, the exploration ratio adjustment parameters corresponding to the dialogue strategy are determined. Taking into account both emotional stability and vocal activity, the ratio of exploration to utilization is dynamically adjusted. This allows the system to appropriately explore new topics when the user's emotions are stable, while prioritizing stabilization and soothing when the user's emotions fluctuate drastically, thus achieving more personalized and adaptive dialogue strategy control. Overall, this solution, through gradient division of voiceprint change stages, quantitative mapping of multi-dimensional change degrees, and dynamic adjustment of the exploration ratio, enables the subsequent dialogue strategy generation stage to accurately match the user's current emotional and vocal state, ensuring that the generated interactive response maintains reasonable dialogue diversity and flexibility while meeting the user's real needs.
[0098] In yet another alternative embodiment, such as Figure 4 As shown, Figure 4 This is a schematic diagram of a cross-modal attention fusion device based on voiceprint features and language semantics disclosed in an embodiment of the present invention. The device may further include: The detection module 306 is used to detect the voiceprint recognition confidence corresponding to the voiceprint features and the semantic understanding accuracy corresponding to the semantic features when fusing voiceprint features and semantic features.
[0099] The correction module 307 is used to perform semantic context analysis operation corresponding to the semantic features when the confidence level of voiceprint recognition is detected to be lower than the preset confidence level, so as to generate a correction signal and correct the voiceprint features according to the correction signal.
[0100] The disambiguation module 308 is used to perform corresponding semantic disambiguation operations based on the identity features contained in the voiceprint features when the semantic understanding accuracy is detected to be lower than the preset accuracy.
[0101] As can be seen, this optional embodiment can simultaneously detect the confidence level of voiceprint recognition and the accuracy of semantic understanding when fusing voiceprint features and semantic features, thereby constructing a real-time monitoring mechanism for dual-modal quality. This mechanism can promptly detect unreliable states of either modality in the current scenario, providing a trigger basis for subsequent adaptive correction. Furthermore, when the voiceprint recognition confidence level is lower than a preset confidence level, semantic context analysis is performed to generate a correction signal and correct the voiceprint features accordingly. This effectively utilizes semantic information to compensate and repair low-quality voiceprints, avoiding misjudgment of identity or feature deviation caused by inaccurate voiceprint recognition, thus improving the fault tolerance of the fusion process. Additionally, when the semantic understanding accuracy is lower than a preset accuracy level... Semantic disambiguation is performed based on identity features in voiceprint characteristics. By leveraging prior knowledge such as user profiles and behavioral habits carried by identity information, ambiguous semantics are constrained and disambiguated. This effectively compensates for the shortcomings of pure text semantic understanding when context is missing or expression is ambiguous, thereby significantly improving the accuracy of semantic understanding. Overall, this scheme, through a bidirectional complementary confidence detection and correction disambiguation mechanism, enables mutual supervision and error correction between voiceprint and semantic modalities during the fusion process. This helps ensure that reliable fusion features can still be obtained even when the quality of a single modality deteriorates, providing a solid guarantee for the accurate generation of subsequent dialogue strategies and the high-quality output of interactive responses.
[0102] In yet another optional embodiment, the interaction response data is generated in the following manner: Based on historical interaction data, adjust the language style parameters to obtain the adjusted language style parameters; The output content of the pre-built expression generator is controlled by the adjusted dialogue strategy, and the output style of the expression generator is controlled by the adjusted language style parameters to generate interactive response data.
[0103] As can be seen, this optional embodiment can first adjust language style parameters based on historical interaction data, fully exploring the user's past language habits, expression preferences, and communication patterns in dialogues. This ensures that the adjusted language style parameters highly match the user's personalized communication style, reducing the occurrence of monotonous and mechanical responses, thereby significantly improving the naturalness and friendliness of the interaction. Furthermore, by controlling the output content of the expression generator through the adjusted dialogue strategy and controlling the output style through the adjusted language style parameters, it can achieve decoupled and precise control of dialogue content and expression form. This allows the system to generate semantically correct response content based on the current strategy decision and express it in a language style that matches the user, effectively balancing the accuracy and appropriateness of the response. This collaborative generation mechanism of history-driven style adaptation and strategy-style dual-channel control helps ensure that the final output interaction response data accurately responds to the user's intent at the semantic level and highly matches the user's habits at the style level. This helps ensure the real-time and stable output of subsequent interaction responses and also helps improve user satisfaction.
[0104] Example 4 Please see Figure 5 , Figure 5 This is a schematic diagram of another cross-modal attention fusion device based on voiceprint features and language semantics disclosed in an embodiment of the present invention. Figure 5 As shown, the cross-modal attention fusion device based on voiceprint features and language semantics may include: Memory 401 storing executable program code; Processor 402 coupled to memory 401; The processor 402 calls the executable program code stored in the memory 401 to execute the steps in the cross-modal attention fusion method based on voiceprint features and language semantics described in Embodiment 1 or Embodiment 2 of the present invention.
[0105] Example 5 This invention discloses a computer storage medium storing computer instructions. When these computer instructions are invoked, they are used to execute the steps in the cross-modal attention fusion method based on voiceprint features and language semantics described in Embodiment 1 or Embodiment 2 of this invention.
[0106] Example 6 This invention discloses a computer program product, which includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to perform the steps in the cross-modal attention fusion method based on voiceprint features and language semantics described in Embodiment 1 or Embodiment 2.
[0107] The device embodiments described above are merely illustrative. The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical modules; that is, they may be located in one place or distributed across multiple network modules. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0108] Through the detailed description of the above embodiments, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, including read-only memory (ROM), random access memory (RAM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), one-time programmable read-only memory (OTPROM), electrically-Erasable Programmable Read-Only Memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disc storage, disk storage, magnetic tape storage, or any other computer-readable medium that can be used to carry or store data.
[0109] Finally, it should be noted that the cross-modal attention fusion method and apparatus based on voiceprint features and language semantics disclosed in the embodiments of the present invention are merely preferred embodiments of the present invention and are only used to illustrate the technical solutions of the present invention, not to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A cross-modal attention fusion method based on voiceprint features and language semantics, characterized in that, The method includes: Acquire multimodal input data, which includes the user's voice data, text data, and historical interaction data for the current round; Perform feature extraction on the speech data to obtain the voiceprint features of the current round, wherein the voiceprint features include at least one of identity features, emotion features, and noise features; Based on the emotion-aware masking mechanism and the contextual emotion memory unit, the emotion features contained in the voiceprint features are fused into the text data to obtain the semantic features corresponding to the text data; Construct a cross-modal relationship graph of the voiceprint features and the semantic features; and determine the temporal dependency relationship between the voiceprint features and the semantic features based on the edge weights generated by the cross-modal relationship graph. Based on the temporal dependency, the voiceprint features and semantic features are projected into the same semantic space, and a feature reconstruction and fusion operation is performed based on the reconstruction loss function to obtain the voiceprint and semantic fusion features. The dialogue strategy corresponding to the current round of the user, determined based on the voiceprint and semantic fusion features, is adjusted to obtain the adjusted dialogue strategy; the adjusted dialogue strategy is used to generate interactive response data with the historical interaction data.
2. The cross-modal attention fusion method based on voiceprint features and language semantics according to claim 1, characterized in that, The step of performing feature extraction on the speech data to obtain the voiceprint features of the current round includes: The original acoustic features are extracted from the speech data; and a preliminary modeling operation is performed on the original acoustic features to obtain basic acoustic features. The preliminary modeling operation includes at least one of temporal modeling operation, feature compression operation, and coding enhancement operation. The original acoustic features are input into a temporal decoupling network for decoupling to obtain a latent feature space, which includes at least two of the following: an identity feature subspace, an emotion feature subspace, and a noise feature subspace. The basic acoustic features are used as input to a multi-level temporal granularity parser for parsing, resulting in multi-granularity intermediate acoustic features. These multi-granularity intermediate acoustic features include at least two of the following: short-term emotional fluctuation features, mid-term intonation change features, and long-term identity features. The temporal granularity corresponding to the short-term emotional fluctuation features is smaller than that corresponding to the mid-term intonation change features, and the temporal granularity corresponding to the mid-term intonation change features is smaller than that corresponding to the long-term identity features. A time alignment operation is performed on the short-term emotional fluctuation features and the mid-term intonation change features; after the time alignment operation is completed, the short-term emotional fluctuation features and the mid-term intonation change features are fused to obtain joint features; based on the joint features, a feature mapping operation is performed on the emotional feature subspace to obtain emotional features; Based on the long-term identity features, the feature mapping processing operation is performed on the identity feature subspace to obtain the identity features; The feature mapping operation is performed on the noise subspace to obtain noise features.
3. The cross-modal attention fusion method based on voiceprint features and language semantics according to claim 2, characterized in that, The step of adjusting the dialogue strategy corresponding to the user's current round, determined based on the voiceprint and semantic fusion features, to obtain the adjusted dialogue strategy includes: Based on the voiceprint and semantic fusion features, the dialogue strategy corresponding to the user's current round is determined; Based on the contextual emotional memory unit, a continuous emotional feature set is determined according to the emotional features contained in the voiceprint features of the current round, wherein the continuous emotional feature set includes emotional features of multiple consecutive rounds, and all the emotional features of the consecutive rounds include at least the emotional features of the current round. Based on a preset voiceprint change rate construction algorithm, the continuous emotional feature set is calculated to obtain the user's voiceprint change rate; and based on the user's voiceprint change rate, the exploration ratio adjustment parameter corresponding to the dialogue strategy is determined. Based on the continuous set of emotional features, determine the user's emotional profile and the user's emotional state corresponding to the current round; and based on the user's emotional profile and the user's emotional state, determine the service level adjustment parameters corresponding to the dialogue strategy. The dialogue strategy is adjusted based on the exploration ratio adjustment parameter and the service level adjustment parameter to obtain the adjusted dialogue strategy.
4. The cross-modal attention fusion method based on voiceprint features and language semantics according to claim 3, characterized in that, The algorithm for constructing the voiceprint change rate includes: ; in, This represents the rate of change of the user's voiceprint. Indicates the current round. Indicates the first Rounds, Indicates the first The emotional characteristics corresponding to each round Indicates the first Rounds, Indicates the first The emotional characteristics corresponding to each round Indicates the number of rounds.
5. The cross-modal attention fusion method based on voiceprint features and language semantics according to claim 3, characterized in that, The step of determining the exploration ratio adjustment parameter corresponding to the dialogue strategy based on the user's voiceprint change rate includes: Based on the user voiceprint change rate and at least one preset voiceprint change stage, a target change stage corresponding to the user voiceprint change rate is determined. All the voiceprint change stages include at least one of a first change stage, a second change stage, and a third change stage. The user voiceprint change rate in the first change stage is less than a first preset change rate. The user voiceprint change rate in the second change stage is greater than or equal to the first preset change rate and less than the second preset change rate. The user voiceprint change rate in the third change stage is greater than or equal to the second preset change rate. Based on the target change stage corresponding to the user's voiceprint change rate, determine the degree of the user's emotional change and the degree of voice change; Based on the degree of emotional change and the degree of voice change, the exploration ratio adjustment parameter corresponding to the dialogue strategy is determined.
6. The cross-modal attention fusion method based on voiceprint features and language semantics according to any one of claims 1-5, characterized in that, The method further includes: When fusing the voiceprint features and the semantic features, the voiceprint recognition confidence level corresponding to the voiceprint features and the semantic understanding accuracy corresponding to the semantic features are detected. When the confidence level of the voiceprint recognition is detected to be lower than the preset confidence level, the semantic context analysis operation corresponding to the semantic feature is performed to generate a correction signal, and the voiceprint feature is corrected according to the correction signal. When the semantic understanding accuracy is detected to be lower than the preset accuracy, the corresponding semantic disambiguation operation is performed based on the identity features contained in the voiceprint features.
7. The cross-modal attention fusion method based on voiceprint features and language semantics according to any one of claims 1-5, characterized in that, The interactive response data is generated in the following way: Based on the historical interaction data, the language style parameters are adjusted to obtain the adjusted language style parameters; The output content of the pre-built expression generator is controlled by the adjusted dialogue strategy, and the output style of the expression generator is controlled by the adjusted language style parameters to generate interactive response data.
8. A cross-modal attention fusion device based on voiceprint features and language semantics, characterized in that, The device includes: The acquisition module is used to acquire multimodal input data, which includes the user's voice data, text data, and historical interaction data in the current round. An extraction module is used to perform feature extraction operations on the speech data to obtain the voiceprint features of the current round, wherein the voiceprint features include at least one of identity features, emotion features, and noise features; The fusion module is used to fuse the emotional features contained in the voiceprint features into the text data based on the emotion perception masking mechanism and the context emotion memory unit to obtain the semantic features corresponding to the text data. The association module is used to construct a cross-modal relationship graph between the voiceprint features and the semantic features; and to determine the temporal dependency between the voiceprint features and the semantic features based on the edge weights generated by the cross-modal relationship graph. The fusion module is further configured to project the voiceprint features and semantic features into the same semantic space based on the temporal dependency relationship, and perform a feature reconstruction fusion operation based on the reconstruction loss function to obtain voiceprint and semantic fusion features; The adjustment module is used to adjust the dialogue strategy corresponding to the current round of the user, which is determined based on the voiceprint and semantic fusion features, to obtain the adjusted dialogue strategy; the adjusted dialogue strategy is used to generate interactive response data with the historical interaction data.
9. A cross-modal attention fusion device based on voiceprint features and language semantics, characterized in that, The device includes: Memory containing executable program code; A processor coupled to the memory; The processor calls the executable program code stored in the memory to execute the cross-modal attention fusion method based on voiceprint features and language semantics as described in any one of claims 1-7.
10. A computer storage medium, characterized in that, The computer storage medium stores computer instructions, which, when invoked, are used to execute the cross-modal attention fusion method based on voiceprint features and language semantics as described in any one of claims 1-7.