A local feature enhancement method based on an attention mechanism and a storage medium
By introducing recursive gated convolution and self-attention mechanism into the encoder of the Transformer model, the shortcomings of self-attention mechanism in local feature interaction and information building are addressed, thereby improving the accuracy of speech recognition and model performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- GUANGZHOU YIFANG INFORMATION TECH CO LTD
- Filing Date
- 2023-08-31
- Publication Date
- 2026-06-19
Smart Images

Figure CN117172287B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence, and more particularly to a method, apparatus, terminal, and storage medium based on an attention mechanism. Background Technology
[0002] The Transformer is an encoder-decoder model based on an attention mechanism, proposed in the paper "Attention is All You Need." It is characterized by its ability to better capture long-range dependencies and its superior parallelism. Due to its excellent model performance and high flexibility, the Transformer has achieved significant results in the field of Automatic Speech Recognition (ASR). By introducing the Transformer, researchers in ASR can better handle the conversion between speech and text, improving the accuracy and robustness of speech recognition.
[0003] The Transformer primarily consists of an encoder and a decoder. Its core mechanism uses self-attention to calculate the relationships between each word in a sentence and other words, thus helping the model better understand the semantic context. The introduction of multi-head attention, where each head focuses on a different position within the sentence, enhances the self-attention mechanism's ability to express the interactions between words within the sentence. A feedforward neural network introduces a non-linear transformation to the encoder, improving the model's fitting ability. The decoder receives the encoder's output data while simultaneously receiving the output from the previous decoder layer, helping the current node to focus on the content that needs to be emphasized.
[0004] However, the self-attention mechanism in the Transformer model has shortcomings in terms of local feature interactions and the construction of local feature information. Therefore, it is necessary to enhance the self-attention mechanism to effectively capture both global and local feature interactions simultaneously. Summary of the Invention
[0005] To address the aforementioned technical problems, the first aspect of this invention discloses a local feature enhancement method based on an attention mechanism, the method comprising:
[0006] S1 constructs a Transformer model, in which the encoder includes a connected enhanced self-attention module and a feedforward neural network module; the enhanced self-attention module adopts a recursive gated convolution mechanism and a self-attention mechanism;
[0007] S2 defines feature X as the input of the encoder. After extracting local feature interaction information through a recursive gated convolution mechanism, a self-attention mechanism is used to extract global feature interaction information between features.
[0008] S3 performs fusion processing of local feature interaction information and global feature interaction information.
[0009] In a further embodiment, step S2 also includes the following step:
[0010] The feature X is mapped to the query matrix Q, the key matrix K, and the value matrix V using a linear function;
[0011] The self-attention score matrix A between the query matrix Q and the key matrix K is calculated by matrix multiplication to obtain global feature interaction information between features.
[0012] The value matrix V is used as input to a recursive gated convolution to obtain local feature interaction information.
[0013] In a further implementation, after obtaining the self-attention score matrix A, the self-attention score matrix A is normalized and scored using the Softmax function to obtain a global feature information interaction matrix to distinguish important information from marginal information.
[0014] In a further implementation, when the value matrix V is used as input to a recursive gated convolution, local feature interaction information is obtained through the following method:
[0015] The tensor dimension of the value matrix V is expanded to 2D using a linear function, and then divided into tensors of different sizes, namely... ;
[0016] Take first The tensor extracts local feature interaction information through depthwise separable convolution, and then combines it with another... Tensors are used for information exchange, and then the exchanged information is linearly mapped to... Then it is convolved with depthwise separable convolutions. The tensors interact with each other, the results are mapped to D, and then interact with the D tensor after depthwise separable convolution. The results of the interaction are then processed by a linear function to extract features.
[0017] A further implementation method expands the tensor dimension of the value matrix V to 2D using the following formula:
[0018] ;
[0019] Where M and N represent adjacent features.
[0020] A further implementation involves extracting local feature interaction information by performing depthwise separable convolution on the tensor using the following formula, and then multiplying it with adjacent features: .
[0021] A further implementation method involves extracting local feature interaction information using the following formula, doubling the dimensionality of the result each time: .
[0022] In a further implementation, when fusing local feature interaction information and global feature interaction information, the normalized self-attention score matrix A is multiplied with the local feature interaction information to map the importance of global information onto the local feature interaction information.
[0023] In a further implementation, after obtaining the global feature information interaction matrix, the elements of the global feature information interaction matrix are pruned, and the three smallest values in each column are set to 0.
[0024] A second aspect of the present invention discloses a computer storage medium storing computer instructions, which, when invoked, are used to execute some or all of the steps in the attention-based local feature enhancement method disclosed in the first aspect of the present invention.
[0025] Compared with the prior art, the embodiments of the present invention have the following beneficial effects:
[0026] In this embodiment of the invention, by improving the self-attention module of the encoder in the Transformer model to an enhanced self-attention module, and incorporating recursive gated convolution and self-attention mechanisms, the Transformer model, when applied to speech recognition, can extract local feature interaction information through recursive gated convolution and extract global feature interaction information between all features through self-attention. By fusing local and global feature interaction information, the importance of global information is mapped to local feature interaction information. Compared with the original self-attention mechanism, the enhanced self-attention mechanism fuses local feature interaction information while preserving the original importance distribution of global information, thus improving the accuracy of speech recognition. Attached Figure Description
[0027] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0028] Figure 1 This is a flowchart illustrating a local feature enhancement method based on an attention mechanism disclosed in an embodiment of the present invention.
[0029] Figure 2 This is a schematic diagram of the structure of a computer storage medium disclosed in an embodiment of the present invention;
[0030] Figure 3This is a structural diagram of the Transformer network disclosed in an embodiment of the present invention;
[0031] Figure 4 This is a schematic diagram of the process for obtaining local feature interaction information disclosed in an embodiment of the present invention. Detailed Implementation
[0032] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0033] The terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this invention are used to distinguish different objects, not to describe a specific order. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, apparatus, product, or end that includes a series of steps or units is not limited to the listed steps or units, but may optionally include steps or units not listed, or may optionally include other steps or units inherent to these processes, methods, products, or ends.
[0034] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of the invention. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.
[0035] This invention improves the self-attention module of the encoder in the Transformer model by incorporating a recursive gated convolution mechanism and a self-attention mechanism. This allows the Transformer model, when applied to speech recognition, to extract local feature interaction information through recursive gated convolution and extract global feature interaction information between all features through the self-attention mechanism. By fusing local and global feature interaction information, the importance of global information is mapped to local feature interaction information. Compared to the original self-attention mechanism, the enhanced self-attention mechanism integrates local feature interaction information while preserving the original importance distribution of global information, thus improving the accuracy of speech recognition. These aspects are explained in detail below.
[0036] Example 1, please refer to Figure 1, Figure 1 This is a flowchart illustrating a local feature enhancement method based on an attention mechanism disclosed in an embodiment of the present invention. Figure 1 As shown, this attention-based local feature enhancement method can include the following operations:
[0037] S1 constructs a Transformer model, in which the encoder includes a connected enhanced self-attention module and a feedforward neural network module; the enhanced self-attention module adopts a recursive gated convolution mechanism and a self-attention mechanism;
[0038] As is understandable, the Transformer model consists of two components: an encoder and a decoder. The encoder employs an enhanced self-attention mechanism, while the decoder uses the original self-attention mechanism. Each encoder contains two modules: an enhanced self-attention (ESA) module and a feedforward neural network (FFN) module. Each module employs residual connections and layer normalization to further improve model performance. Figure 3 As shown, the attention mechanism plays an indispensable role in the MHA sublayer of the Transformer model's encoder and decoder. In ASR tasks, it can establish correlations between speech-speech, speech-text, and text-text relationships. The Transformer model in this embodiment includes the original Transformer model and improved variants such as Conformer and ST-Transformer.
[0039] S2 defines feature X as the input of the encoder. After extracting local feature interaction information through a recursive gated convolution mechanism, a self-attention mechanism is used to extract global feature interaction information between features.
[0040] Understandably, features The input to the MHA sublayer is the speech or text feature, where T represents the number of frames in the speech or text. The input feature X also includes the following steps:
[0041] S21 maps feature X to query matrix Q, key matrix K, and value matrix V using three linear functions;
[0042] Its formula is expressed as follows:
[0043] ;
[0044] in, These are the parameter matrices of linear functions that generate the query matrix Q, the key matrix K, and the value matrix V, respectively, and they have learnable properties.
[0045] The S22 query matrix Q and the key matrix K are multiplied by a dot product, and then the softmax function is used to calculate the feature association matrix, i.e., the self-attention score matrix A, to obtain global feature interaction information between features. Specifically, the self-attention score matrix A[i,j] is a two-dimensional matrix probability distribution map, representing the features of the i-th frame. Features of the j-th frame The degree of correlation between features is determined by matrix dot product, where each feature interacts with others, resulting in a matrix that establishes global interaction information between features. The self-attention score matrix A is then normalized and scored using the Softmax function to obtain a global feature information interaction matrix, distinguishing important from marginal information for easier model training and optimization. After obtaining the global feature information interaction matrix, the attention mechanism uses dot product to interact each element in one sequence with each element in another sequence, resulting in an attention correlation matrix between the two sequences. The main idea of sparse attention matrices is to reduce unnecessary correlation calculations, focusing only on elements with high values (such as diagonalized related elements) and discarding other low-valued elements. After Softmaxing the column axis using the following formula:
[0046] ;
[0047] The range of output values on the horizontal axis of the attention matrix graph is within The probability distribution between features and their sum to 1. Clearly, it's easy to conclude that when the value of an element is infinitely close to 1, the probability of features combining with each other is close to a true event, indicating a strong correlation between features; when the value of an element is infinitely close to 0, the probability of features occurring simultaneously is close to 0, indicating no correlation between them, and their values can be further ignored. Therefore, after obtaining the global feature information interaction matrix through the Softmax function, the global feature information interaction matrix is pruned, setting the three smallest values in each column to 0. Pruning these weakly correlated elements further reduces the computational load and complexity of the entire model.
[0048] S23 uses the value matrix V as input to a recursive gated convolution to obtain local feature interaction information.
[0049] Understandably, recursive gated convolutions are constructed using standard convolutions, linear projections, and element-wise multiplication. Its convolution-based implementation avoids the secondary complexity of self-attention. The design of progressively increasing channel width during spatial interactions also enables it to achieve high-order interactions with finite complexity. It extends second-order interactions in self-attention to arbitrary orders to further enhance modeling capabilities, and recursive gated convolutions are compatible with various kernel sizes. It fully inherits the translational equivalence of standard convolutions, introduces beneficial inductive biases for tasks, and avoids the asymmetry inherent in local attention.
[0050] like Figure 4 As shown, when the value matrix V is used as input to a recursively gated convolution, local feature interaction information is obtained through the following method:
[0051] The tensor dimension of the value matrix V is expanded to 2D using a linear function, and then divided into tensors of different sizes, namely... ;
[0052] Take first The tensor extracts local feature interaction information through depthwise separable convolution, and then combines it with another... Tensors are used for information exchange, and then the exchanged information is linearly mapped to... Then it is convolved with depthwise separable convolutions. The tensors interact with each other, the results are mapped to D, and then interact with the D tensor after depthwise separable convolution. The results of the interaction are then processed by a linear function to extract features.
[0053] Specifically, the steps for achieving interaction through local features are as follows: the tensor dimension of the value matrix V is expanded to 2D using the following formula:
[0054] ;
[0055] Where M and N represent adjacent features.
[0056] The following formula is used to extract local feature interaction information by performing depthwise separable convolution on the tensor, and then multiplying it with adjacent features: The multiplication between features is done element-wise. In this calculation process, local feature information interacts with neighboring feature information. Each depthwise separable convolution processes a different data dimension, so these depthwise separable convolutions are all different, each with its own training weight parameters. Thus, with each recursive gated convolution operation, the recursive gated convolution weight parameters for that dimension are updated, which better extracts the interaction information between local features. Furthermore, these recursive gated convolutions do not introduce excessive parameters or computational overhead.
[0057] A further implementation method involves extracting local feature interaction information using the following formula, doubling the dimensionality of the result each time: Understandably, such as → After the above steps, the extraction of local features and the interaction of local feature information for one recursive gated convolution are completed. This local information will then be used for the extraction of local features and the interaction of local feature information in the next dimension. By extracting local features and interacting with local feature information on the value matrix V, local feature interaction information of dimension D is obtained.
[0058] S3 performs fusion processing of local and global feature interaction information. During this fusion, the normalized self-attention score matrix A is multiplied with the local feature interaction information to map the importance of global information onto the local feature interaction information, thus achieving information fusion. Dataset validation shows that compared to the basic Transformer model, the algorithm exhibits a 0.6% performance improvement on the Aishell dataset and a 1.0% performance improvement on the HKUST dataset. In AI speech recognition tasks, this improves accuracy, makes efficient use of limited physical resources, and ensures the comfort and other commercial value of speech recognition applications.
[0059] The embodiments of the present invention have at least the following beneficial effects:
[0060] (1) The Enhanced Self-Attention (ESA) module integrates the local feature interaction information extracted by the recursive gated convolution mechanism and the global feature interaction information extracted by the self-attention mechanism within the attention mechanism. This can promote long-distance information interaction between labels and multi-level local information interaction, thereby improving the model's performance. The enhanced self-attention mechanism can achieve information interaction between features without introducing significant additional computational costs, thereby improving the model's performance and extracting features better.
[0061] (2) Add element pruning operation to the enhanced self-attention mechanism. The obtained global information is pruned first to remove elements with weak correlation, thereby reducing the computational amount and computational complexity of the model.
[0062] As can be seen, this invention improves the self-attention module of the encoder in the Transformer model by replacing it with an enhanced self-attention module. It incorporates a recursive gated convolution mechanism and a self-attention mechanism, enabling the Transformer model to extract local feature interaction information through the recursive gated convolution mechanism and extract global feature interaction information between all features through the self-attention mechanism. By fusing local and global feature interaction information, the importance of global information is mapped to local feature interaction information. Compared to the original self-attention mechanism, the enhanced self-attention mechanism fuses local feature interaction information while preserving the original importance distribution of global information, thus improving the accuracy of speech recognition.
[0063] Example 2, please refer to Figure 2 , Figure 2 This is a schematic diagram of the structure of a computer storage medium disclosed in an embodiment of the present invention. Figure 2 As shown, an embodiment of the present invention discloses a computer storage medium 201, which stores computer instructions. When the computer instructions are invoked, they are used to execute the steps in the local feature enhancement method based on the attention mechanism disclosed in Embodiment 1 of the present invention.
[0064] The device embodiments described above are merely illustrative. The modules described as separate components may or may not be physically separate, and the components shown as modules may or may not be physical modules; that is, they may be located in one place or distributed across multiple network modules. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0065] Through the detailed description of the above embodiments, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, including read-only memory (ROM), random access memory (RAM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), one-time programmable read-only memory (OTPROM), electrically-erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disc storage, disk storage, magnetic tape storage, or any other computer-readable medium that can be used to carry or store data.
[0066] Finally, it should be noted that the local feature enhancement method based on the attention mechanism and the storage medium disclosed in the embodiments of the present invention are merely preferred embodiments of the present invention and are only used to illustrate the technical solutions of the present invention, not to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. An attention mechanism-based local feature enhancement method applied to speech recognition, characterized in that, The method includes the following steps: S1 constructs a Transformer model, in which the encoder includes a connected enhanced self-attention module and a feedforward neural network module; the enhanced self-attention module adopts a recursive gated convolution mechanism and a self-attention mechanism; S2 defines feature X as the input of the encoder. After extracting local feature interaction information through a recursive gated convolution mechanism, a self-attention mechanism is used to extract global feature interaction information between features. The feature X is a speech feature or a text feature. S3 performs fusion processing of local feature interaction information and global feature interaction information; Step S2 also includes the following steps: The feature X is mapped to the query matrix Q, the key matrix K, and the value matrix V using a linear function; The self-attention score matrix A between the query matrix Q and the key matrix K is calculated by matrix multiplication to obtain global feature interaction information between features. The value matrix V is used as the input of the recursive gated convolution to obtain local feature interaction information; after obtaining the self-attention score matrix A, the self-attention score matrix A is normalized and scored by the Softmax function to obtain the global feature information interaction matrix to distinguish important information from edge information. When the value matrix V is used as input to a recursively gated convolution, local feature interaction information is obtained through the following method: The tensor dimension of the value matrix V is expanded to 2D by a linear function and divided into tensors of different sizes, respectively, D, D, D, D); Take first The tensor of D is used to extract local feature interaction information through depthwise separable convolution, and then combined with another... Information is exchanged using D tensors, and then the exchanged information is linearly mapped to... D, then convolved with depthwise separable convolutions. Information is exchanged with the D tensor, the result is mapped to D, and then information is exchanged with the D tensor after depthwise separable convolution. The information exchange result is then processed by a linear function to extract features. When fusing local and global feature interaction information, the normalized self-attention score matrix A is multiplied with the local feature interaction information to map the importance of global information onto the local feature interaction information.
2. The attention mechanism based local feature enhancement method of claim 1, wherein, The tensor dimension of the value matrix V is expanded to 2D using the following formula: (x), x e (T, 2D); Where M and N represent adjacent features, x represents an element in feature X, and T represents the frame number of element x in feature X. 3.The attention mechanism based local feature enhancement method of claim 2, wherein, The local feature interaction information is extracted by depth separable convolution on the tensor through the following formula, and multiplied with the adjacent features: .
4. The attention mechanism based local feature enhancement method of claim 3, wherein, The following formula is used to extract features from local feature interaction information, and the dimensionality of the result is doubled each time: .
5. The attention mechanism based local feature enhancement method of claim 1, wherein, After obtaining the global feature information interaction matrix, the elements of the global feature information interaction matrix are pruned, and the three smallest values in each column are set to 0.
6. A computer storage medium, wherein the computer-readable storage medium stores a computer program, characterized in that, When the computer program is executed by the processor, it implements the local feature enhancement method based on the attention mechanism as described in any one of claims 1-5.