Multi-intent natural language understanding method and device, electronic equipment and storage medium

CN122197890APending Publication Date: 2026-06-12广州广哈通信股份有限公司

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
广州广哈通信股份有限公司
Filing Date
2026-03-11
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In existing technologies, multi-intent natural language understanding methods have difficulty effectively coordinating intent recognition and slot filling within a unified framework, resulting in inconsistencies between intent recognition and slot extraction results when processing multi-intent scenarios, leading to inaccurate natural language processing results.

Method used

A graph attention mechanism is used to propagate information and update features in word-level embedding representations and intent label embedding representations. A multi-objective loss function is constructed and its parameters are optimized. Cooperative decoding of intent recognition and slot recognition is achieved through global aggregation and cooperative decoding.

🎯Benefits of technology

It improves the accuracy and consistency of natural language processing results, and enhances the expressive power of local semantic structures and the integration of global semantic information.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122197890A_ABST
    Figure CN122197890A_ABST
Patent Text Reader

Abstract

The application relates to the technical field of natural language processing, and provides a multi-intent natural language understanding method and device, electronic equipment and a storage medium. An implementation scheme is as follows: each word-level embedding representation and each intent label embedding representation are updated based on a graph attention mechanism to obtain each first slot node representation and each first intent node representation; each word-level embedding representation and each intent label embedding representation are globally aggregated to obtain a global embedding representation; a multi-target loss function is constructed based on each word-level embedding representation, each intent label embedding representation and the global embedding representation to obtain each second slot node representation and each second intent node representation; and each first slot node representation, each second slot node representation, each first intent node representation and each second intent node representation are cooperatively decoded to obtain a natural language processing result of a user. The application can improve the accuracy of the natural language processing result.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of natural language processing technology, and in particular to a multi-intent natural language understanding method, apparatus, electronic device, and storage medium. Background Technology

[0002] Natural Language Understanding (NLE) modules typically transform user-input natural language utterances into structured semantic information, primarily involving two core tasks: intent recognition and slot filling. Intent recognition is a coarse-grained sentence-level classification task used to determine the overall intent expressed by the user's utterance; slot filling is a fine-grained word-level sequence labeling task used to extract key entity information related to intent from the utterance. The two tasks are naturally related at the semantic level: intent information provides global semantic constraints for slot recognition, while slot information provides local semantic evidence for intent determination.

[0003] In real-world applications, user expressions often possess complex semantic structures, with a single natural language utterance potentially containing multiple intents simultaneously. This multi-intent expression places higher demands on natural language understanding systems, requiring models to simultaneously identify multiple intents and accurately extract relevant entity information. However, current technologies often treat intent recognition and slot filling as independent tasks, preventing models from fully leveraging the semantic relationships between the two. This hinders the collaborative modeling of sentence-level and word-level semantic information within a unified framework. Furthermore, while some studies attempt to improve performance through joint modeling and parameter sharing, the information interaction is typically unidirectional or weakly coupled, making parallel collaborative reasoning between the two tasks difficult. Consequently, inconsistencies between intent recognition and slot extraction results can easily arise when handling multi-intent scenarios, leading to inaccurate natural language processing outcomes. Summary of the Invention

[0004] This invention provides a multi-intent natural language understanding method, apparatus, electronic device, and storage medium, which can solve at least one of the above-mentioned technical problems.

[0005] In a first aspect, embodiments of the present invention provide a multi-intent natural language understanding method, including: The user-input natural language utterance is encoded and discriminatively embedded to obtain word-level embedding representations and intent tag embedding representations. Based on the graph attention mechanism, update each of the word-level embedding representations and each of the intent tag embedding representations to obtain each of the first slot node representations and each of the first intent node representations; The word-level embedding representations and the intent tag embedding representations are globally aggregated to obtain a global embedding representation; Based on each of the word-level embedding representations, each of the intent tag embedding representations, and the global embedding representation, a multi-objective loss function is constructed, and the minimum value of the multi-objective loss function is solved to obtain the representations of each second slot node and each second intent node. The user's natural language processing result is obtained by co-decoding the representations of each first slot node, each second slot node, each first intent node, and each second intent node.

[0006] Secondly, embodiments of the present invention provide a multi-intent natural language understanding device, comprising: The encoding and embedding module is used to encode and discriminate the natural language utterances input by the user, and obtain word-level embedding representations and intent tag embedding representations. The embedding representation update module is used to update each of the word-level embedding representations and each of the intent tag embedding representations based on the graph attention mechanism, so as to obtain each first slot node representation and each first intent node representation; The global aggregation module is used to globally aggregate the word-level embedding representations and the intent tag embedding representations to obtain a global embedding representation; A multi-objective loss function construction module is used to construct a multi-objective loss function based on each of the word-level embedding representations, each of the intent tag embedding representations, and the global embedding representation, and to solve for the minimum value of the multi-objective loss function to obtain each second slot node representation and each second intent node representation; The collaborative decoding module is used to collaboratively decode the representations of each first slot node, each second slot node, each first intent node, and each second intent node to obtain the user's natural language processing result.

[0007] Thirdly, embodiments of the present invention also provide an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the method described in any one of the embodiments of the present invention.

[0008] Fourthly, embodiments of the present invention also provide a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause a computer to perform the method described in any one of the embodiments of the present invention.

[0009] The technical solution of this invention first encodes and discriminates the natural language utterance input by the user, obtaining word-level embedding representations and intent tag embedding representations. This allows for the simultaneous characterization of semantic features at both the lexical and intent levels, providing a multi-granular semantic representation foundation for subsequent semantic modeling. Subsequently, a graph attention mechanism is used to propagate information and update features in the word-level and intent tag embedding representations, resulting in representations of first slot nodes and first intent nodes. This explicitly models the relationships between lexical nodes and intent nodes within the graph structure, and adaptively highlights key semantic nodes through attention weights, thereby enhancing the expressive power of local semantic structures. In parallel, the word-level and intent tag embedding representations are globally aggregated to obtain a global embedding representation. This integrates global semantic and structural information within the utterance at the overall level, forming a unified representation of the overall semantics of the utterance. Furthermore, a multi-objective loss function is constructed based on each word-level embedding representation, each intent tag embedding representation, and the global embedding representation. By minimizing this multi-objective loss function, the model parameters are jointly optimized. This allows the model to simultaneously constrain the representational relationships between local lexical semantics, intent semantics, and global semantics during training, thereby improving the consistency and discriminative ability of semantic representations. Finally, by collaboratively decoding the representations of the first slot node, the second slot node, the first intent node, and the second intent node, slot recognition and intent recognition tasks can be jointly completed on the basis of multi-layer semantic representations, thus effectively improving the prediction accuracy of natural language processing results.

[0010] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of the present invention, nor is it intended to limit the scope of the invention. Other features of the invention will become readily apparent from the following description. Attached Figure Description

[0011] The accompanying drawings are provided for a better understanding of this solution and do not constitute a limitation of the invention. Wherein: Figure 1 This is a flowchart of a multi-intent natural language understanding method according to an embodiment of the present invention; Figure 2 This is a structural block diagram of a multi-intent natural language understanding device according to an embodiment of the present invention; Figure 3 This is a schematic block diagram of an electronic device used to implement the methods of embodiments of the present invention. Detailed Implementation

[0012] The following description, in conjunction with the accompanying drawings, illustrates exemplary embodiments of the present invention, including various details to aid understanding. These details should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope of the invention. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.

[0013] This application provides a multi-intent natural language understanding method, apparatus, electronic device, and storage medium. The execution entity of the multi-intent natural language understanding method can be the multi-intent natural language understanding apparatus provided in this application, or a computer device integrating the multi-intent natural language understanding apparatus. The multi-intent natural language understanding apparatus can be implemented in hardware or software, and the computer device can be a terminal or a server.

[0014] Figure 1 This is a flowchart of a multi-intent natural language understanding method according to an embodiment of the present invention.

[0015] like Figure 1 As shown, this multi-intent natural language understanding method may include: S110, Encode and discriminatively embed the natural language utterance input by the user to obtain word-level embedding representations and intent tag embedding representations; S120, update the word-level embedding representation and the intent tag embedding representation based on the graph attention mechanism to obtain the first slot node representation and the first intent node representation; S130, perform global aggregation of each word-level embedding representation and each intent tag embedding representation to obtain a global embedding representation; S140, Based on each word-level embedding representation, each intent tag embedding representation, and the global embedding representation, a multi-objective loss function is constructed, and the minimum value of the multi-objective loss function is solved to obtain the representation of each second slot node and the representation of each second intent node; S150, perform collaborative decoding on the representations of each first slot node, each second slot node, each first intent node, and each second intent node to obtain the user's natural language processing results.

[0016] For example, natural language discourse refers to a sequence of natural language sentences or phrases used to express user needs or semantic intentions. For instance, in a scenario of intelligent customer service or voice assistant, a user inputting "Help me book a flight from Beijing to Shanghai tomorrow morning" or "Check the weather in Guangzhou today" would be considered a natural language discourse.

[0017] For example, natural language utterances input by the user in text or voice form can be received through input boxes, voice input interfaces, or dialogue interaction components provided by a graphical user interface. If the input is in voice form, it is first converted into text form by an automatic speech recognition module, and the obtained text form utterance is used as input for subsequent encoding and discriminative embedding steps.

[0018] For example, in a specific application scenario, in a dialogue system or intelligent customer service platform, text messages input by users are collected in real time through the front-end interactive interface; or historical user speech records are read from the log system. The collected speech undergoes preliminary cleaning, including preprocessing operations such as removing invalid characters and standardizing punctuation marks, to obtain the natural language speech input by the user.

[0019] For example, word-level embedding representation refers to the vector representation obtained by mapping each word in natural language discourse to a continuous vector space through a pre-defined semantic encoding model, which is used to characterize the semantic features of each word.

[0020] For example, the natural language utterance "query Shanghai weather" can be encoded to obtain vector representations corresponding to the three word units "query", "Shanghai", and "weather". Each vector reflects the semantic information of the corresponding word unit.

[0021] For example, intent tag embedding representation refers to the semantic vector representation obtained by vectorizing predefined intent category tags, used to characterize the semantic features of different intent categories. For instance, in a dialogue system, intent tags such as "check the weather," "book a flight," and "play music" can be preset, and corresponding intent tag vectors can be obtained after embedding and mapping.

[0022] For example, the first slot node representation refers to the node representation obtained after semantic interaction updates of each word-level embedding representation during the information propagation process of the graph attention mechanism. It is used to characterize the slot semantic features of each word element in the context semantic relationship. For example, in the utterance "book a flight from Beijing to Shanghai tomorrow", the word elements "Beijing", "Shanghai", and "tomorrow" can obtain corresponding first slot node representations after graph attention updates, which are used to identify slot information such as "departure city", "arrival city", and "time".

[0023] For example, the first intent node representation refers to the node representation obtained after semantic interaction update of the intent label embedding representation in the graph attention mechanism, used to characterize the semantic features of each intent node in the semantic context of the current discourse. For example, for intent labels such as "check the weather" and "book a flight", after information interaction with word nodes in the graph structure, an updated intent node representation can be obtained to reflect the intent category that the current discourse is more likely to correspond to.

[0024] For example, global embedding representation refers to the semantic vector representation obtained by performing a global aggregation operation on each word-level embedding representation and each intent tag embedding representation, which is used to characterize the overall semantic features of the entire natural language discourse.

[0025] For example, for the utterance "Help me check the weather in Shanghai today", a global semantic vector can be obtained by weighted aggregation of the features of word nodes and intent nodes, which is used to express the overall semantic information of the utterance.

[0026] For example, the second slot node representation refers to the slot node representation obtained by further optimizing the representation of each word node after constructing a multi-objective loss function and training and optimizing the model, which is used to improve the semantic discrimination ability of word slots.

[0027] For example, by using contrastive learning loss constraints, the representations of word motifs such as "Shanghai" and "Beijing" in the semantic space will be closer to "city-class slots", thus obtaining more accurate slot node representations.

[0028] For example, the second intent node representation refers to the semantic representation of the intent node obtained after optimization by a multi-objective loss function, which is used to improve the ability to distinguish between different intent categories.

[0029] For example, after training, the two intent nodes "check the weather" and "book a flight" are further separated in the semantic space, resulting in a more discriminative intent node representation.

[0030] For example, the natural language processing result refers to the semantic parsing result obtained by co-decoding the representations of each slot node and the representation of the intent node. The semantic parsing result includes the intent recognition result and the slot recognition result.

[0031] For example, for the utterance "Help me check the weather in Shanghai tomorrow", the final natural language processing result can be: Intent: check the weather; Slot: City = Shanghai, Time = tomorrow.

[0032] According to the above implementation method, the natural language utterance input by the user is first encoded and discriminatively embedded to obtain semantic representations at the lexical and intent levels. Then, a graph attention mechanism is used to semantically update the word-level embeddings and intent tag embeddings, enabling explicit semantic associations between lexical nodes and intent nodes, thus obtaining the first slot node representation and the first intent node representation. Further, by globally aggregating the various word-level embeddings and intent tag embeddings, a global embedding representation representing the overall semantic information of the utterance is obtained. Based on this, a multi-objective loss function is constructed and the model is optimized to jointly constrain the representational relationships between lexical semantics, intent semantics, and global semantics, resulting in a more discriminative second slot node representation and second intent node representation. Finally, through collaborative decoding of the multi-layer semantic node representations, intent recognition and slot recognition of the user's utterance are achieved. In this way, an effective collaborative relationship can be established between local and global semantic representations, thereby improving the accuracy of the natural language processing results.

[0033] In one implementation, the natural language utterance input by the user is encoded and discriminatively embedded to obtain word-level embedding representations and intent label embedding representations, including: encoding the natural language utterance to obtain word-level embedding representations; enhancing the features of each word-level embedding representation through a self-attention mechanism to obtain enhanced embedding representations; performing linear transformation and nonlinear activation on each enhanced embedding representation to obtain intent prediction distributions corresponding to each enhanced embedding representation; extracting intent labels corresponding to intent prediction distributions that are greater than a preset prediction threshold from each intent prediction distribution to obtain intent labels; performing nonlinear activation and linear mapping on each intent prediction distribution to obtain an intent label embedding matrix; and selecting each element in the intent label embedding matrix based on each intent label to obtain each intent label embedding representation.

[0034] For example, the acquired natural language discourse is segmented to obtain a word sequence. Subsequently, the word sequence is input into a pre-defined multi-scale discourse encoder, which includes a self-attention component and a bidirectional long short-term memory (LSTM) network component. The self-attention component models the global dependencies between words in the word sequence to obtain a first semantic feature representation. Simultaneously, the bidirectional LSM network component performs bidirectional temporal encoding on the word sequence to obtain a second semantic feature representation. Finally, the first and second semantic feature representations are concatenated to obtain word-level embedding representations that integrate global semantic information and local temporal information.

[0035] For example, each word-level embedding representation is taken as input, and a task-specific self-attention computation unit is constructed. Then, each word-level embedding representation is linearly mapped using a query matrix, a key matrix, and a value matrix to obtain the corresponding query vector, key vector, and value vector. Attention weights are calculated based on the similarity between the query vector and the key vector, and the value vectors are weighted and summed to obtain an attention output representation that incorporates contextual information. Finally, the attention output representation is fused with the word-level embedding representation to obtain an enhanced embedding representation with stronger semantic discriminative capabilities.

[0036] For example, each enhanced embedding representation is input into a pre-defined fully connected transformation layer. A trainable weight matrix is ​​used to linearly map the enhanced embedding representation, transforming its vector dimension to a dimension consistent with the size of the intent label set. Subsequently, a non-linear activation function is applied to the linear mapping result to obtain the response value of each word to different intent labels. Finally, the response values ​​are normalized to obtain the intent prediction distribution corresponding to each word.

[0037] For example, the calculation process for the intended distribution can be represented by the following functional expression: In the formula, The enhanced embedding representation corresponding to word i; is a trainable weight matrix; For activation functions; is a trainable weight matrix; Let be the distribution of intent predictions corresponding to word i.

[0038] Understandably, one lexical unit corresponds to one augmented embedding representation, and one augmented embedding representation corresponds to one intent tag.

[0039] For example, the intent prediction distribution corresponding to each word element is traversed; then, the probability values ​​of each intent in the prediction distribution are compared with a preset prediction threshold; when a certain intent probability value is greater than the preset prediction threshold, the corresponding intent tag is determined as a candidate intent tag; finally, the candidate intent tags obtained from all words are summarized and deduplicated to obtain the final intent tag set (i.e., each intent tag).

[0040] For example, when the preset prediction threshold is 0.6, if the probability of "flight booking" in the prediction distribution of the word "flight ticket" is 0.85, then "flight booking" will be identified as the intent tag, while "hotel booking" with a probability of 0.3 will not be selected.

[0041] For example, the intent prediction distribution corresponding to each lexical unit is input into a nonlinear activation function to smooth the response levels of different intents. Then, a preset linear mapping matrix is ​​used to perform vector mapping on the activated prediction distribution to transform the intent probability space into a semantic embedding space. Finally, the embedding vectors corresponding to all intents are arranged according to the intent label order to construct an intent label embedding matrix.

[0042] For example, the calculation process of the intent tag embedding matrix can be represented by the following function expression: In the formula, To predict the distribution of the intended target; For activation functions; This is a trainable weight matrix; The matrix is ​​used for embedding intent tags.

[0043] For example, the target intent tag index is determined based on the intent tag set; then, the embedding vector corresponding to the target intent tag index is located in the intent tag embedding matrix; finally, the corresponding embedding vector is extracted as the intent tag embedding representation for subsequent semantic modeling or graph structure construction.

[0044] According to the above implementation method, the natural language discourse is first encoded to obtain word-level embedding representations of each lexical unit. Then, a self-attention mechanism is used to enhance the features of each word-level embedding representation. Based on this, linear transformations and nonlinear activations are performed on each enhanced embedding representation to obtain the corresponding intent prediction distribution. Intent tags with probabilities greater than a preset prediction threshold are then selected to obtain a set of candidate intents related to the current discourse semantics. Further, an intent tag embedding matrix is ​​constructed by performing nonlinear activations and linear mappings on each intent prediction distribution. The corresponding vector representations are then extracted from the intent tag embedding matrix based on the selected intent tags, thus obtaining the embedding representations of each intent tag. In this way, while preserving the semantic information of lexical units, semantic representations of intent tags highly related to the discourse semantics are introduced, providing richer and more discriminative semantic features for subsequent semantic interaction and joint modeling, thereby improving the accuracy and robustness of multi-intent natural language understanding tasks.

[0045] In one implementation, updating each word-level embedding representation and each intent tag embedding representation based on a graph attention mechanism to obtain each first slot node representation and each first intent node representation includes: determining each slot node based on each word-level embedding representation, and determining each intent node based on each intent tag embedding representation; modeling the connection relationship between each slot node using a preset local dependency awareness unit to obtain each slot edge; constructing each intent edge based on the connection relationship between each intent node; constructing each intent and slot edge based on the connection relationship between each slot node and each intent node; constructing a multi-granularity heterogeneous semantic fusion graph based on each slot node, each intent node, each slot edge, each intent edge, and each intent and slot edge; and performing multi-round bidirectional information propagation on the multi-granularity heterogeneous semantic fusion graph using a graph attention mechanism to update each intent node and slot node to obtain each first slot node representation and each first intent node representation.

[0046] For example, the word-level embedding representation sequence output by the pre-sequence multi-scale discourse encoder is used as the initial representation of the slot nodes to form a set of word-level slot nodes VS={VS_1,…,VS_n}, where each VS_i corresponds to a semantic vector of a specific word. For example, in the input discourse “Help me book a plane ticket to Beijing tomorrow”, the word-level embedding representation corresponding to the word “plane ticket” will be initialized to VS_6.

[0047] For example, the discriminative intent labels output from the multi-intent perception and embedding generation steps are embedded as the initial representation of sentence-level intent nodes, forming an intent node set VI={VI_1,…,VI_pre}, where each VI_j corresponds to a type of intent. For example, the embedding vector of the intent "flight booking" is initialized as VI_1.

[0048] For example, a pre-defined local dependency awareness unit is used to model the connection relationship between each slot node. This unit uses a local mask function to limit the attention calculation range to capture the local dependency between neighboring words, and at the same time, it uses a self-attention mechanism to perform weighted information aggregation to obtain each slot edge.

[0049] In this example, the local dependency-aware unit is a type of unit that introduces a local masking function. The self-attention mechanism. In the formula, It is a local masking function, which is itself a matrix; is the concatenation function; i and j are the position indices of elements in the sequence, and their values ​​range from 1 to n; The distance between two position indices; The matrix elements are defined. By setting the size of the local dependency sensing unit, the attention computation range is limited to the neighborhood of each slot node. This effectively captures the local continuity constraints of the slot label sequence while reducing computational complexity, achieving a balance between efficiency and accuracy.

[0050] For example, based on the connection relationships between various intent nodes, intent edges are constructed to model the co-occurrence relationships and semantic associations between intents. Furthermore, based on the cross-level semantic relationships between various slot nodes and various intent nodes, intent-slot edges are constructed to guide and feed back coarse-grained intent information to fine-grained slots. After completing the construction of nodes and edges, slot nodes, intent nodes, slot edges, intent edges, and intent-slot edges are comprehensively constructed to form a multi-granularity heterogeneous semantic fusion graph, which can simultaneously represent local lexical relationships and global intent semantics.

[0051] For example, a graph attention mechanism is used to perform multi-round bidirectional information propagation on a multi-granularity heterogeneous semantic fusion graph. First, the intent node information is propagated to the slot node through intent-slot edge to provide global semantic guidance. At the same time, the updated slot node information is fed back to the intent node to correct the initial multi-intent prediction. The collaborative update of information is achieved through closed-loop bidirectional message passing, and finally the representations of each first slot node and each first intent node are obtained.

[0052] In this example, the process of multi-round bidirectional information propagation on a multi-granularity heterogeneous semantic fusion graph can be represented by the following function expression: ; ; ; ; ; In the formula, Let i be the first slot node; It is a local dependency sensing unit; Let i be the vector representation of slot node i; , , , and Both are trainable weight matrices; This is the vector representation of the slot node; This represents the dimension of the key in the self-attention mechanism; It is a non-linear activation function, such as a Rectified Linear Unit (ReLU) or a Sigmoid function / Sigmoid activation function. Let j be the vector representation of the intention node; This is a scoring function used to calculate the correlation or raw attention score between two node representation vectors; For the first in the adjacent domain Each intent node represents a single intent. Let i be the representation of the first intention node; Let j be the vector representation of slot node j; Let i be the vector representation of the intent node. For the first in the adjacent domain Each slot node represents a slot; and These represent the adjacency domains of the nodes, where... It refers to the set of intent nodes connected by intent edges or intent and slot edges. This refers to the set of slot nodes that are connected to the slot edge through the slot edge or intended to be connected to the slot edge; Exponential function; , and All are attention weights.

[0053] According to the above implementation method, firstly, corresponding slot nodes are determined based on each word-level embedding representation, and corresponding intent nodes are determined based on each intent tag embedding representation. Then, the connection relationships between each slot node are modeled using a preset local dependency awareness unit to characterize the local semantic dependencies between lexical units, thereby constructing each slot edge. Simultaneously, each intent edge is constructed based on the semantic association relationships between each intent node, and each intent-slot edge is constructed based on the semantic association relationships between each slot node and each intent node. Based on this, by integrating each slot node, each intent node, each slot edge, each intent edge, and each intent-slot edge, a multi-granularity heterogeneous semantic fusion graph is constructed, enabling lexical semantic information and intent semantic information to be collaboratively modeled within a unified graph structure. Furthermore, a graph attention mechanism is used to perform multi-round bidirectional information propagation on the multi-granularity heterogeneous semantic fusion graph to update the feature representations of each intent node and slot node, thereby obtaining the representations of each first slot node and each first intent node. This effectively enhances the interaction capabilities between semantic features of different granularities and improves the completeness and discriminative ability of semantic representation.

[0054] In one implementation, the global embedding representations of each word and each intent tag are globally aggregated to obtain a global embedding representation, including: treating each word-level embedding representation and each intent tag embedding representation as nodes to obtain a node set; calculating the semantic differences between nodes in the node set to obtain each relation embedding vector; performing a linear transformation on the embedding representations of each node in the node set using a trainable global weight matrix to obtain global mapping features of each node; processing each relation embedding vector, the global mapping features of each node, and the global embedding representation of the current iteration step based on the attention layer in the graph neural network to obtain attention weights for each node; and aggregating the global mapping features of each node based on the attention weights of each node to obtain a global embedding representation.

[0055] For example, the calculation process of the relation embedding vector can be represented by the following functional expression: ; In the formula, For relation embedding vectors; This is a trainable weight matrix; This is the embedded representation of the source node; This represents the embedding representation of neighbor node j in the adjacent node domain; For adjacent node fields.

[0056] It should be noted that for all nodes in the node set, each node is sequentially taken as the source node, and the remaining nodes are taken as neighbor nodes. All neighbor nodes together constitute the adjacent node domain. Substituting the embedding representations of the source node and each neighbor node into the calculation expression in the previous example yields the relation embedding vector of each node.

[0057] For example, for each node in the node set, the global mapping feature of that node is determined based on the product of the trainable weight matrix and the node's embedded representation. In this way, the global mapping features of each node can be obtained.

[0058] For example, the calculation process of the attention weight of a node can be represented by the following functional expression: In the formula, Let be the attention weight for node j; This is the global embedding representation for the current iteration step; This is a trainable weight matrix; Let be the embedding representation of node j; The global mapping feature of node j; Embed the relation vector for node g; For example, a fusion function (e.g., a linear layer or a multilayer perceptron (MLP)); For activation functions (e.g., Leaky ReLU, Leaky ReLU); This is the normalization function.

[0059] In this example, if it is the first iteration, the global embedding representation of the current iteration step can be obtained by initializing and aggregating the embedding representations of the nodes in the node set. This aggregation method can be implemented by average pooling or weighted pooling.

[0060] If it is not the first iteration, then the global embedding representation of the previous iteration step can be used as the global embedding representation of the current iteration step. For example, assuming the current iteration is the third iteration, then the global embedding representation obtained in the second iteration can be used as the global embedding representation of the third iteration step.

[0061] For example, the calculation process of aggregating the global mapping features of each node based on the attention weights of each node can be represented by the following function expression: In the formula, This is a globally embedded representation; For a set of nodes; Let j be node j in the node set; Let be the attention weight for node j; This is a trainable weight matrix; Let be the embedding representation of node j.

[0062] According to the above implementation method, firstly, each word-level embedding representation and each intent tag embedding representation are respectively used as nodes in the graph to construct a unified node set; then, by calculating the semantic differences between each node in the node set, each relation embedding vector used to represent the semantic relationship between nodes is obtained; on this basis, a linear transformation is performed on the embedding representation of each node in the node set through a trainable global weight matrix to obtain the global mapping features of each node, thereby realizing a unified mapping of the feature space of different types of nodes; further, based on the attention layer in the graph neural network, each relation embedding vector, the global mapping features of each node, and the global embedding representation of the current iteration step are jointly modeled to calculate the attention weight of each node to the global semantic representation; finally, the corresponding global mapping features are weighted and aggregated according to the attention weight of each node to obtain a global embedding representation that can represent the overall semantic structure.

[0063] In one implementation, a multi-objective loss function is constructed based on each word-level embedding representation, each intent label embedding representation, and the global embedding representation, including: determining each positive sample based on each word-level embedding representation; performing non-iterative approximation calculations on the gradients of each word-level embedding representation and the gradients of the natural language discourse based on preset hyperparameters to obtain each perturbation vector; superimposing each perturbation vector with each word-level embedding vector to obtain each adversarial negative sample; and constructing a multi-objective loss function based on the global embedding representation, each intent label embedding representation, each positive sample, and each adversarial negative sample.

[0064] For example, word-level embeddings are used as positive samples. Since there are multiple word-level embeddings, multiple positive samples can be obtained.

[0065] For example, the calculation process of the perturbation vector can be represented by a function expression: In the formula, This is the perturbation vector corresponding to the word-level embedding vector i; For hyperparameters; The gradient corresponding to the word-level embedding vector i; The gradient of natural language discourse.

[0066] In this example, user-inputted natural language utterances are fed into a pre-defined natural language understanding model (e.g., a deep neural network model consisting of a multi-scale discourse encoder, a graph attention semantic fusion module, and a collaborative decoding module). The embedding layer maps each lexical unit in the natural language utterance to its corresponding word-level embedding vector, resulting in a sequence of word-level embedding representations. Subsequently, during the model's forward propagation, an automatic differentiation mechanism tracks intermediate variables in the computation graph, ensuring that each word-level embedding representation retains its corresponding gradient cache. After the model completes the forward propagation computation for the current round, the gradient vectors corresponding to each word-level embedding representation are read from the computation graph, thus obtaining the gradient information of each word-level embedding representation.

[0067] At the same time, the semantic representation of the entire natural language discourse (e.g., the sentence-level representation obtained through average pooling or attention aggregation) is stored as a discourse-level semantic variable in the computation graph, and the gradient information corresponding to the variable is recorded through an automatic differentiation mechanism. After the current computation step is completed, the gradient vector of the semantic variable is directly read to obtain the gradient information of the natural language discourse.

[0068] For example, by superimposing each perturbation vector with each word-level embedding vector, the resulting adversarial negative sample can be expressed by the following functional expression: ,in, In the formula, For adversarial negative samples; For word-level embedding vectors; The perturbation vector; For adversarial negative samples n; For adversarial negative sample i; Let i be the word-level embedding vector; Let i be the perturbation vector.

[0069] According to the above implementation method, firstly, positive samples consistent with the original semantics are determined based on each word-level embedding representation, serving as anchor samples in contrastive learning. Then, non-iterative approximation calculations are performed on the gradients of each word-level embedding representation and the gradient of the natural language discourse based on preset hyperparameters to obtain perturbation vectors. On this basis, each perturbation vector is superimposed with its corresponding word-level embedding vector to generate adversarial negative samples with subtle semantic perturbations. Furthermore, a multi-objective loss function is constructed based on the global embedding representation, each intent label embedding representation, each positive sample, and each adversarial negative sample to jointly constrain and optimize the relationship between semantic representations of different granularities. In this way, the correlation between lexical semantics and intent semantics can be strengthened simultaneously during training, and the introduction of adversarial negative samples enhances the model's ability to distinguish subtle semantic differences, thereby effectively improving the discriminativeness and robustness of the semantic representations learned by the model, and ultimately improving the accuracy and generalization ability of the natural language understanding results.

[0070] In one implementation, a multi-objective loss function is constructed based on the global embedding representation, each intent label embedding representation, each positive sample, and each adversarial negative sample. This includes: determining a negative sample weighting term based on the global embedding representation and each adversarial negative sample; constructing an intent lexical contrastive loss based on the cosine similarity between each positive sample and each intent label embedding representation, the cosine similarity between each positive sample and each adversarial negative sample, and the negative sample weighting term; determining a first mutual information based on the dot product similarity between the global embedding representation and each intent label embedding representation; determining a second mutual information based on the dot product similarity between the global embedding representation and each adversarial negative sample; constructing a global graph contrastive loss by maximizing the first mutual information and minimizing the second mutual information; and determining the multi-objective loss function based on the intent lexical contrastive loss and the global graph contrastive loss.

[0071] For example, the intent lexical contrastive loss can be expressed as the following functional expression: ; In the formula, For intentional lexical contrast loss; Let i be a positive sample; The intention tag is embedded to represent j; For adversarial negative samples k; For negative sample weighting; is the temperature hyperparameter; n is the total number of positive samples.

[0072] For example, the global graph contrast loss can be expressed as the following functional expression: ; In the formula, For global graph contrast loss; This is a globally embedded representation; The intention tag is embedded to represent j; For adversarial negative samples k; For first mutual information; This is the second mutual information.

[0073] For example, the multi-objective loss function can be obtained by weighted summing of the intent word contrast loss and the global graph contrast loss. Specifically, a first calculation result is determined based on the product of the first weight and the intent word contrast loss; a second calculation result is determined based on the product of the second weight and the global graph contrast loss; the multi-objective loss function can be obtained by summing the first and second calculation results.

[0074] According to the above implementation method, a negative sample weighting term is determined based on the relationship between the global embedding representation and each adversarial negative sample to distinguish negative samples of different difficulties. An intent lexical contrastive loss is constructed based on the cosine similarity between each positive sample and each intent label embedding representation, the cosine similarity between each positive sample and each adversarial negative sample, and the negative sample weighting term. This enhances the semantic consistency between the lexical representation and the correct intent label and suppresses the similarity with interfering semantics. Further, a first mutual information is determined based on the dot product similarity between the global embedding representation and each intent label embedding representation, and a second mutual information is determined based on the dot product similarity between the global embedding representation and each adversarial negative sample. A global graph contrastive loss is constructed by maximizing the first mutual information and minimizing the second mutual information, thereby strengthening the correct intent information and weakening erroneous semantic associations at the global semantic level. Finally, a multi-objective loss function is determined based on the intent lexical contrastive loss and the global graph contrastive loss. By simultaneously constraining local lexical semantics and global semantic representations, the model can more effectively distinguish between true semantic relationships and interfering semantic relationships during training, thereby improving the semantic discrimination ability and overall robustness in multi-intent recognition and slot understanding tasks.

[0075] In one implementation, collaborative decoding of the representations of each first slot node, each second slot node, each first intent node, and each second intent node to obtain the user's natural language processing result includes: concatenating the representations of each first slot node and each second slot node to obtain a third slot node representation; concatenating the representations of each first intent node and each second intent node to obtain a third intent node representation; interacting with the third slot node representation and the third intent node representation based on a preset self-attention mechanism transformation layer to obtain a fourth slot node representation and a fourth intent node representation output by the transformation layer; predicting the slot labels of each word based on the third slot node representation and the fourth slot node to obtain the target slot labels of each word; using the fourth intent node representation as input to a preset multi-intent perception and embedding generator to obtain the target intent labels of each word; and determining the natural language processing result based on the target slot labels and target intent labels of each word.

[0076] For example, the calculation process represented by the third slot node can be expressed by the following function expression: In the formula, This is represented by the third slot node; This is represented by the first slot node; This is represented by the second slot node; This is represented by the first third slot node; This represents the nth slot node.

[0077] For example, the calculation process of the third intent node representation can be represented by the following function expression: In the formula, This is represented by the third intent node; This is represented by the first intent node; This is represented by the second intent node; This is represented by the first intent node; For the first Each intent node represents an intention.

[0078] For example, based on a pre-defined self-attention mechanism, the transformation layer interacts with the representation of the third slot node and the representation of the third intent node. This process can be represented by the following function expression: In the formula, This is represented by the fourth intent node; This is represented by the fourth slot node; This is a transformation layer.

[0079] For example, predicting the slot labels for each word to obtain the target slot labels for each word can be represented by the following function expression: In the formula, The target intent label for word element s; This is a trainable parameter matrix.

[0080] For example, the fourth intent node representation is used as the input to a preset multi-intent perception and embedding generator to obtain the target intent label for each word. This calculation process can be represented by the following function expression: In the formula, The target intent label for word element I; It is a trainable parameter matrix; This is the vector representation of the intent node; This is the weight matrix for the feedback connection.

[0081] For example, the target slot labels and target intent labels of each word element are used together as components of the natural language processing result.

[0082] According to the above implementation method, the representations of each first slot node and each second slot node are concatenated to obtain a third slot node representation, and the representations of each first intent node and each second intent node are concatenated to obtain a third intent node representation, thereby achieving the fusion of semantic features from different sources. Subsequently, based on a transformation layer with a preset self-attention mechanism, the third slot node representation and the third intent node representation are interactively modeled to obtain a fourth slot node representation and a fourth intent node representation, enabling sufficient information transfer and semantic alignment between slot information and intent information. Further, based on the third slot node representation and the fourth slot node representation, the slot labels of each word are predicted to obtain the target slot labels of each word, and the fourth intent node representation is used as the input of a preset multi-intent perception and embedding generator to obtain the target intent labels of each word. Finally, based on the target slot labels and target intent labels of each word, the natural language processing result is determined. In this way, the integrity and consistency of semantic expression are improved during the joint modeling process of slot filling and intent recognition, further improving the recognition accuracy and overall processing effect in multi-intent natural language understanding tasks.

[0083] Figure 2 This is a structural block diagram of a multi-intent natural language understanding device according to an embodiment of the present invention.

[0084] like Figure 2 As shown, the multi-intent natural language understanding device may include: The encoding and embedding module 510 is used to encode and discriminate the natural language utterance input by the user to obtain word-level embedding representations and intent tag embedding representations. The embedding representation update module 520 is used to update each of the word-level embedding representations and each of the intent tag embedding representations based on the graph attention mechanism, so as to obtain each of the first slot node representations and each of the first intent node representations. The global aggregation module 530 is used to perform global aggregation on each of the word-level embedding representations and each of the intent tag embedding representations to obtain a global embedding representation; The multi-objective loss function construction module 540 is used to construct a multi-objective loss function based on each of the word-level embedding representations, each of the intent tag embedding representations and the global embedding representation, and to solve for the minimum value of the multi-objective loss function to obtain each second slot node representation and each second intent node representation; The collaborative decoding module 550 is used to collaboratively decode the representations of each first slot node, each second slot node, each first intent node, and each second intent node to obtain the user's natural language processing result.

[0085] In one implementation, the global aggregation module includes: A node set unit is used to take each of the word-level embedding representations and each of the intent tag embedding representations as nodes to obtain a node set; A semantic difference calculation unit is used to calculate the semantic differences between nodes in the node set to obtain each relation embedding vector. A linear transformation unit is used to perform a linear transformation on the embedding representation of each node in the node set using a trainable global weight matrix to obtain the global mapping features of each node. The attention weight calculation unit is used to process the embedding vectors of each relation, the global mapping features of each node, and the global embedding representation of the current iteration step based on the attention layer in the graph neural network to obtain the attention weights of each node. The aggregation unit is used to aggregate the global mapping features of each node based on the attention weights of each node to obtain the global embedding representation.

[0086] In one implementation, the multi-objective loss function construction module includes: A positive sample determination unit is used to determine each positive sample based on each of the word-level embedding representations; The non-iterative approximation calculation unit is used to perform non-iterative approximation calculations on the gradients of each word-level embedding representation and the gradients of the natural language discourse based on preset hyperparameters to obtain each perturbation vector. The function construction unit is used to construct the multi-objective loss function based on the global embedding representation, each of the intent label embedding representations, each of the positive samples, and each of the adversarial negative samples.

[0087] In one implementation, the function construction unit includes: A negative sample weighting term determination subunit is used to determine negative sample weighting terms based on the global embedding representation and each of the adversarial negative samples; The intent lexical contrast loss construction subunit is used to construct the intent lexical contrast loss based on the cosine similarity between each positive sample and each intent tag embedding representation, the cosine similarity between each positive sample and each adversarial negative sample, and the negative sample weighting term. The mutual information determination subunit is used to determine a first mutual information based on the dot product similarity between the global embedding representation and each of the intent tag embedding representations; and to determine a second mutual information based on the dot product similarity between the global embedding representation and each of the adversarial negative samples. A global graph contrastive loss construction subunit is used to construct a global graph contrastive loss by maximizing the first mutual information and minimizing the second mutual information; A multi-objective loss function determination subunit is used to determine the multi-objective loss function based on the intent lexical contrast loss and the global graph contrast loss.

[0088] In one embodiment, the collaborative decoding module includes: The first splicing unit is used to splice the representations of each first slot node and each second slot node to obtain the representation of the third slot node. The second splicing unit is used to splice the representations of each first intent node and each second intent node to obtain the representation of a third intent node. An interaction unit is used to interact with the third slot node representation and the third intent node representation based on a preset self-attention mechanism transformation layer to obtain the fourth slot node representation and the fourth intent node representation output by the transformation layer. The prediction unit is used to predict the slot labels of each word element based on the representation of the third slot node and the fourth slot node, so as to obtain the target slot labels of each word element. The target intent tagging unit is used to take the fourth intent node representation as the input of a preset multi-intent perception and embedding generator to obtain the target intent tags of each word element; The natural language processing result unit is used to determine the natural language processing result based on the target slot label and target intent label of each lexical unit.

[0089] In one embodiment, the encoding and embedding module includes: The encoding unit is used to encode the natural language discourse to obtain word-level embedding representations; The feature enhancement unit is used to enhance the features of each of the word-level embedding representations through a self-attention mechanism to obtain each enhanced embedding representation; Linear transformation and nonlinear activation units are used to perform linear transformation and nonlinear activation on each of the enhanced embedding representations to obtain the intent prediction distribution corresponding to each of the enhanced embedding representations; The extraction unit is used to extract the intent labels corresponding to the intent prediction distributions that are greater than a preset prediction threshold from each intent prediction distribution, and obtain each intent label; A nonlinear activation and linear mapping unit is used to perform nonlinear activation and linear mapping on each of the intent prediction distributions to obtain an intent label embedding matrix. The selection unit is used to select each element in the intent tag embedding matrix based on each intent tag to obtain each intent tag embedding representation.

[0090] In one implementation, the embedded representation update module includes: The slot node and intent node unit is used to determine each slot node based on each of the word-level embedding representations, and to determine each intent node based on each of the intent tag embedding representations; The modeling unit is used to model the connection relationship between each slot node through a preset local dependency perception unit to obtain each slot edge; An intent edge construction unit is used to construct each intent edge based on the connection relationship between each intent node; The intent and slot edge construction unit is used to construct each intent and slot edge based on the connection relationship between each slot node and each intent node; A multi-granularity heterogeneous semantic fusion graph construction unit is used to construct a multi-granularity heterogeneous semantic fusion graph based on each of the slot nodes, each of the intent nodes, each of the slot edges, each of the intent edges, and each of the intent and slot edges. The update unit is used to perform multi-round bidirectional information propagation on the multi-granularity heterogeneous semantic fusion graph through a graph attention mechanism to update each of the intent nodes and the slot nodes, thereby obtaining the representations of each of the first slot nodes and each of the first intent nodes.

[0091] The specific functions and examples of each module and submodule of the system in this embodiment of the invention can be found in the relevant descriptions of the corresponding steps in the above method embodiments, and will not be repeated here.

[0092] The acquisition, storage, and application of user personal information involved in the technical solution of this invention all comply with the provisions of relevant laws and regulations and do not violate public order and good morals.

[0093] This invention also provides an electronic device, comprising: At least one processor; and a memory communicatively connected to said at least one processor; The memory stores instructions that can be executed by the at least one processor, which, when executed by the at least one processor, enables the at least one processor to perform the method described in any one of the embodiments of the present invention.

[0094] The beneficial effects of the electronic device in the embodiments of the present invention are equivalent to the beneficial effects of the above-described multi-intent natural language understanding method, and will not be repeated here.

[0095] This invention also provides a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause a computer to perform the method described in any one of the embodiments of this invention.

[0096] The beneficial effects of the storage medium of the present invention are equivalent to the beneficial effects of the above-described multi-intent natural language understanding method, and will not be repeated here.

[0097] Figure 3 A schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present invention is shown. Electronic device 800 is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic device 800 may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the invention described and / or claimed herein.

[0098] like Figure 3 As shown, the electronic device 800 includes a computing unit 801, which can perform various appropriate actions and processes based on a computer program stored in a read-only memory (ROM) 802 or a computer program loaded from a storage unit 808 into a random access memory (RAM) 803. The RAM 803 may also store various programs and data required for the operation of the electronic device 800. The computing unit 801, ROM 802, and RAM 803 are interconnected via a bus 804. An input / output (I / O) interface 805 is also connected to the bus 804.

[0099] Multiple components in electronic device 800 are connected to I / O interface 805, including: input unit 806, such as keyboard, mouse, etc.; output unit 807, such as various types of displays, speakers, etc.; storage unit 808, such as disk, optical disk, etc.; and communication unit 809, such as network card, modem, wireless transceiver, etc. Communication unit 809 allows electronic device 800 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0100] The computing unit 801 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the various methods and processes described above, such as the multi-intent natural language understanding method. For example, in some embodiments, the multi-intent natural language understanding method can be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and / or installed on the electronic device 800 via ROM 802 and / or communication unit 809. When the computer program is loaded into RAM 803 and executed by the computing unit 801, one or more steps of the multi-intent natural language understanding method described above can be performed. Alternatively, in other embodiments, the computing unit 801 can be configured to perform the multi-intent natural language understanding method by any other suitable means (e.g., by means of firmware).

[0101] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload-programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.

[0102] The program code used to implement the methods of the present invention can be written in any combination of one or more programming languages. This program code can be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code can be executed entirely on the machine, partially on the machine, as a standalone software package partially on the machine and partially on a remote machine, or entirely on a remote machine or server.

[0103] In the context of this invention, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media can include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0104] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0105] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.

[0106] Computer systems can include clients and servers. Clients and servers are generally located far apart and typically interact via communication networks. Client-server relationships are created by computer programs running on the respective computers and having a client-server relationship with each other. Servers can be cloud servers, servers in distributed systems, or servers incorporating blockchain technology.

[0107] It should be understood that the various forms of processes shown above can be used to reorder, add, or delete steps. For example, the steps described in this invention can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this invention can be achieved, and this is not limited herein.

[0108] The specific embodiments described above do not constitute a limitation on the scope of protection of this invention. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the principles of this invention should be included within the scope of protection of this invention.

Claims

1. A multi-intent natural language understanding method, characterized in that, include: The user-input natural language utterance is encoded and discriminatively embedded to obtain word-level embedding representations and intent tag embedding representations. Based on the graph attention mechanism, update each of the word-level embedding representations and each of the intent tag embedding representations to obtain each of the first slot node representations and each of the first intent node representations; The word-level embedding representations and the intent tag embedding representations are globally aggregated to obtain a global embedding representation; Based on each of the word-level embedding representations, each of the intent tag embedding representations, and the global embedding representation, a multi-objective loss function is constructed, and the minimum value of the multi-objective loss function is solved to obtain the representations of each second slot node and each second intent node. The user's natural language processing result is obtained by co-decoding the representations of each first slot node, each second slot node, each first intent node, and each second intent node.

2. The method according to claim 1, characterized in that, The step of globally aggregating each of the word-level embedding representations and each of the intent tag embedding representations to obtain a global embedding representation includes: Each of the word-level embedding representations and each of the intent tag embedding representations are respectively used as nodes to obtain a node set; Calculate the semantic differences between nodes in the node set to obtain the embedding vectors of each relation; By using a trainable global weight matrix, a linear transformation is performed on the embedding representation of each node in the node set to obtain the global mapping features of each node. Based on the attention layer in the graph neural network, the embedding vectors of each relation, the global mapping features of each node, and the global embedding representation of the current iteration step are processed to obtain the attention weights of each node. Based on the attention weights of each node, the global mapping features of each node are aggregated to obtain the global embedding representation.

3. The method according to claim 1, characterized in that, The construction of a multi-objective loss function based on each of the word-level embedding representations, each of the intent tag embedding representations, and the global embedding representation includes: Based on each of the aforementioned word-level embedding representations, each positive sample is determined; Based on preset hyperparameters, the gradients of each word-level embedding representation and the gradient of the natural language discourse are calculated non-iteratively to obtain each perturbation vector. Each of the aforementioned perturbation vectors is superimposed with each of the aforementioned word-level embedding vectors to obtain each adversarial negative sample; The multi-objective loss function is constructed based on the global embedding representation, the intention label embedding representations, the positive samples, and the adversarial negative samples.

4. The method according to claim 3, characterized in that, The construction of the multi-objective loss function based on the global embedding representation, each of the intent label embedding representations, each of the positive samples, and each of the adversarial negative samples includes: Based on the global embedding representation and each of the adversarial negative samples, the negative sample weighting term is determined; Based on the cosine similarity between each positive sample and each intent tag embedding representation, the cosine similarity between each positive sample and each adversarial negative sample, and the negative sample weighting term, an intent lexical contrast loss is constructed. A first mutual information is determined based on the dot product similarity between the global embedding representation and each of the intent label embedding representations; and a second mutual information is determined based on the dot product similarity between the global embedding representation and each of the adversarial negative samples. A global graph contrast loss is constructed by maximizing the first mutual information and minimizing the second mutual information. The multi-objective loss function is determined based on the intent lexical contrast loss and the global graph contrast loss.

5. The method according to claim 1, characterized in that, The step of collaboratively decoding the representations of each first slot node, each second slot node, each first intent node, and each second intent node to obtain the user's natural language processing result includes: The representations of each first slot node and each second slot node are concatenated to obtain the representation of the third slot node; The representations of each first intent node and each second intent node are concatenated to obtain the representation of the third intent node. Based on a preset self-attention mechanism, the transformation layer interacts with the third slot node representation and the third intent node representation to obtain the fourth slot node representation and the fourth intent node representation output by the transformation layer. Based on the third slot node representation and the fourth slot node, the slot labels of each word are predicted to obtain the target slot labels of each word. The fourth intent node is used as the input to a preset multi-intent perception and embedding generator to obtain the target intent label of each word element; The natural language processing result is determined based on the target slot label and target intent label of each lexical unit.

6. The method according to claim 1, characterized in that, The encoding and discriminative embedding of the user-input natural language utterance to obtain word-level embedding representations and intent tag embedding representations includes: The natural language discourse is encoded to obtain word-level embedding representations; Each word-level embedding representation is enhanced using a self-attention mechanism to obtain enhanced embedding representations. By performing linear transformations and nonlinear activations on each of the enhanced embedding representations, the intent prediction distribution corresponding to each of the enhanced embedding representations is obtained; Extract the intent labels corresponding to the intent prediction distributions that are greater than a preset prediction threshold from each intent prediction distribution to obtain each intent label; The intent label embedding matrix is ​​obtained by performing nonlinear activation and linear mapping on each of the intent prediction distributions. Based on each intent tag, each element in the intent tag embedding matrix is ​​selected to obtain the embedding representation of each intent tag.

7. The method according to claim 1, characterized in that, The step of updating each word-level embedding representation and each intent tag embedding representation based on the graph attention mechanism to obtain each first slot node representation and each first intent node representation includes: Each slot node is determined based on each of the aforementioned word-level embedding representations, and each intent node is determined based on each of the aforementioned intent tag embedding representations; The connection relationship between each slot node is modeled by a preset local dependency perception unit to obtain each slot edge; Based on the connection relationships between the intent nodes, construct each intent edge; Based on the connection relationship between each slot node and each intention node, construct each intention and slot edge; Based on each slot node, each intent node, each slot edge, each intent edge, and each intent and slot edge, a multi-granularity heterogeneous semantic fusion graph is constructed. The graph attention mechanism is used to perform multi-round bidirectional information propagation on the multi-granularity heterogeneous semantic fusion graph to update each of the intent nodes and slot nodes, thereby obtaining the representations of each of the first slot nodes and each of the first intent nodes.

8. A multi-intent natural language understanding device, characterized in that, include: The encoding and embedding module is used to encode and discriminate the natural language utterances input by the user, and obtain word-level embedding representations and intent tag embedding representations. The embedding representation update module is used to update each of the word-level embedding representations and each of the intent tag embedding representations based on the graph attention mechanism, so as to obtain each first slot node representation and each first intent node representation; The global aggregation module is used to globally aggregate the word-level embedding representations and the intent tag embedding representations to obtain a global embedding representation; A multi-objective loss function construction module is used to construct a multi-objective loss function based on each of the word-level embedding representations, each of the intent tag embedding representations, and the global embedding representation, and to solve for the minimum value of the multi-objective loss function to obtain each second slot node representation and each second intent node representation; The collaborative decoding module is used to collaboratively decode the representations of each first slot node, each second slot node, each first intent node, and each second intent node to obtain the user's natural language processing result.

9. An electronic device, characterized in that, include: At least one processor; and a memory communicatively connected to the at least one processor; The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

10. A non-transitory computer-readable storage medium storing computer instructions, characterized in that, The computer instructions are used to cause the computer to perform the method according to any one of claims 1-7.