Speech processing device, speech processing method, and speech processing program
By replacing or eliminating self-attention mechanisms in Conformer models with FFT or optimized data exchange, the computational inefficiencies in processing long audio sequences are addressed, enabling efficient and context-aware speech processing with improved accuracy and consistency across various tasks.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- NIPPON TELEGRAPH & TELEPHONE CORP
- Filing Date
- 2024-12-12
- Publication Date
- 2026-06-24
AI Technical Summary
Conventional speech and language processing technologies face inefficiencies due to high computational complexity, particularly when handling long audio sequences like lectures, which consist of hundreds of thousands to millions of frames, as they rely on self-attention mechanisms that perform inner product calculations twice, resulting in computational complexity proportional to the sequence length.
The proposed solution involves replacing or deleting the self-attention mechanism in Conformer models with alternative mechanisms such as Fast Fourier Transform (FFT), Flash Attn, or removing it entirely, reducing computational complexity to O(LlogL+DlogD) or O(L) by approximating self-attention with FFT or optimizing data exchange between processor and memory, respectively.
This approach enables efficient processing of long audio sequences by reducing computational load, allowing for more human-like audio processing that considers the entire context, improving speech recognition, summarization, and intent estimation tasks with higher accuracy and consistency.
Smart Images

Figure 2026103761000001_ABST
Abstract
Description
Technical Field
[0001] The present invention relates to an audio processing device, an audio processing method, and an audio processing program.
Background Art
[0002] In speech language processing, Conformer, which is a derivative of Transformer, is widely used. Also, self-supervised learning (SSL), which has contributed to the development of deep learning, is known. HuBERT, which performs self-supervised learning using a large amount of audio, is used as a pre-trained model in many tasks such as speech recognition and speech translation.
[0003] HuBERT incorporates Conformer. Also, Conformer adopts a self-attention mechanism to learn sequential data.
Prior Art Documents
Non-Patent Documents
[0004]
Non-Patent Document 1
Non-Patent Document 2
Non-Patent Document 3
[0005] However, conventional technologies have a problem in that they cannot perform speech and language processing efficiently.
[0006] Conformer utilizes a self-attention mechanism to learn sequential information. Specifically, the self-attention mechanism uses a distributed representation E∈R of input data with sequence length L and dimension D. L×D For this, the process is carried out as shown in equation (1).
[0007]
number
[0008] W Q,K,V∈R^(D×D) are the weights for mapping E to obtain the Query, Key, and Value, respectively. A is the result of self-attention. In this case, the self-attention mechanism performs the inner product calculation in the sequence direction twice, so the time is O(L) for the sequence length L of the data. 2 D) The computational complexity is on the order of the order of the order of the order of the sequence length. In other words, the self-attention mechanism generates computational complexity on the order of the order of the order of the sequence length. Therefore, in Conformer, it is computationally difficult to handle long audio, such as lectures, which consist of hundreds of thousands to millions of frames.
[0009] Therefore, the objective of the present invention is to efficiently perform speech and language processing. [Means for solving the problem]
[0010] To solve the aforementioned problems, the voice processing device of the present invention is characterized by having a calculation unit that performs calculations related to voice using a second model obtained by replacing or deleting a self-attention mechanism included in a first model that takes voice data as input. [Effects of the Invention]
[0011] According to the present invention, speech and language processing can be performed efficiently. [Brief explanation of the drawing]
[0012] [Figure 1] Figure 1 shows an example of the configuration of the learning device according to the first embodiment. [Figure 2] Figure 2 is a diagram illustrating the configuration of the model before replacement. [Figure 3] Figure 3 is a flowchart showing the processing steps of the model before replacement. [Figure 4] Figure 4 shows an example of a Conformer before replacement. [Figure 5] Figure 5 shows an example of a Conformer after replacement. [Figure 6] Figure 6 shows an example of a Conformer after replacement. [Figure 7] FIG. 7 is a diagram showing an example of the Conformer after replacement. [Figure 8] FIG. 8 is a diagram showing the correct text and the text of the speech recognition result. [Figure 9] FIG. 9 is a diagram showing the results of the experiment. [Figure 10] FIG. 10 is a diagram showing the results of the experiment. [Figure 11] FIG. 11 is a diagram showing the results of the experiment. [Figure 12] FIG. 12 is a diagram showing a configuration example of a computer that executes a learning program.
MODE FOR CARRYING OUT THE INVENTION
[0013] Hereinafter, embodiments (embodiments) for carrying out the present invention will be described with reference to the drawings. The present invention is not limited to the embodiments.
[0014] [First Embodiment] The configuration of the learning device according to this embodiment will be described using FIG. 1. FIG. 1 is a diagram showing a configuration example of the learning device according to the first embodiment. The learning device 10 performs learning of a model used for tasks of speech language processing such as speech recognition and speech translation. Further, the learning device 10 may execute a task (inference) using the learned model. Further, the learning device 10 is an example of a speech processing device.
[0015] As shown in FIG. 1, the learning device 10 includes a communication unit 11, an input unit 12, an output unit 13, a storage unit 14, and a control unit 15.
[0016] The communication unit 11 performs data communication with other devices via a network. For example, the communication unit 11 is a NIC (Network Interface Card). The input unit 12 is an interface for receiving input of data. The output unit 13 is an interface for outputting data.
[0017] The memory unit 14 is a storage device such as an HDD (Hard Disk Drive), SSD (Solid State Drive), or optical disc. Alternatively, the memory unit 14 may be a rewritable semiconductor memory such as RAM (Random Access Memory), flash memory, or NVSRAM (Non-Volatile Static Random Access Memory). The memory unit 14 stores the OS (Operating System) and various programs executed by the learning device 10. The memory unit 14 also stores the model information 141.
[0018] Model information 141 consists of parameters for constructing the model. Examples of model information 141 include the weights and biases of a neural network.
[0019] The control unit 15 controls the entire learning device 10. The control unit 15 is, for example, an electronic circuit such as a CPU (Central Processing Unit), MPU (Micro Processing Unit), or GPU (Graphics Processing Unit), or an integrated circuit such as an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array).
[0020] Furthermore, the control unit 15 has internal memory for storing programs and control data that define various processing procedures, and executes each process using the internal memory. In addition, the control unit 15 functions as various processing units when various programs are run. For example, the control unit 15 has a calculation unit 151 and an update unit 152.
[0021] The calculation unit 151 inputs input data into a model constructed based on the model information 141 and calculates output data. For example, the input data is audio (for example, data representing the waveform of the audio). Also, for example, the output data is distributed representation or text based on distributed representation.
[0022] Furthermore, the update unit 152 updates the model parameters, i.e., the model information 141, so that the output data calculated by the calculation unit 151 is optimized. For example, the update unit 152 updates the parameters using the backpropagation method of a neural network.
[0023] In this embodiment, a model capable of reducing computational complexity is realized by replacing the self-attention mechanism included in Conformer. First, the configuration of model 20 before the replacement will be explained using Figure 2. Figure 2 is a diagram illustrating the configuration of the model before the replacement.
[0024] As shown in Figure 2, Model 20 has an input layer 21 and an encoder 22. Speech is input to the input layer 21. The encoder 22 outputs a distributed representation based on the speech. The distributed representation can be converted to text. That is, according to Model 20, text is obtained that is a transcription of the input speech.
[0025] The encoder 22 has N layers of Conformer layers. The Conformer layers include a feedforward layer 221, a self-aware mechanism 222, a CNN (Convolutional Neural Network) layer 223, a feedforward layer 224, and a Layer Normalization layer 225.
[0026] Figure 3 is a flowchart showing the processing steps of the model before replacement. Note that the main component of each process in Figure 3 may be the calculation unit 151. As shown in Figure 3, the input layer 21 extracts acoustic features from the input audio (step S101). Next, the input layer 21 performs down sampling of the acoustic features (step S102).
[0027] In the Conformer layer, the downsampled acoustic features are subjected to N iterative processing (steps S103, S109).
[0028] First, the feedforward layer 221 transforms the input value and inputs the resulting value to the self-attention mechanism 222 (step S104). The self-attention mechanism 222 performs sequence modeling on the input value using self-attention and inputs the result to the CNN layer 223 (step S105).
[0029] The CNN layer 223 performs a convolution operation and inputs the result to the feedforward layer 224 (step S106). The feedforward layer 221 transforms the input value and inputs the resulting value to the layer normalization layer 225 (step S107). The layer normalization layer 225 normalizes the input value (step S108).
[0030] In this embodiment, learning and inference are performed using a model in which the Conformer's self-attention mechanism has been replaced. The replaced model reduces the computational complexity in speech and language processing, enabling long speech to be processed in a single step. The Conformer's self-attention mechanism may be removed instead of being replaced.
[0031] First, Figure 4 shows a more concrete example of Model 20, which has been explained so far. Figure 4 is a diagram showing an example of the Conformer before replacement. Note that the input layer will be omitted in subsequent diagrams as appropriate.
[0032] As shown in Figure 4, Conformer 30a has an Attention layer 31a, a Linear layer 32, a Convolutional layer 33, an MLP (multi-layer perceptron) layer 34, and a LayerNorm layer 35. The Attention layer 31a is a self-attention mechanism. The Linear layer 32 and the MLP layer 34 are feedforward layers. The Convolutional layer 33 is a CNN layer. The LayerNorm layer 35 is a Layer Normalization layer.
[0033] The following describes several replaced models as examples. In each embodiment, the Attention layer 31a in Figure 4 is replaced with a different mechanism. That is, the calculation unit 151 uses the replaced model, which is obtained by replacing or deleting the self-attention mechanism included in the original model that takes speech data as input, to perform calculations for speech-related tasks.
[0034] [Example 1] Figure 5 shows an example of a modified Conformer. As shown in Figure 5, in Example 1, Conformer 30b is modified in which the Attention layer 31a of Conformer 30a is replaced with an FFT layer 311b and a LayerNorm layer 312b. In other words, the modified model is a model in which the self-attention mechanism of the original model is replaced with an FFT. In Example 1, the computational complexity is reduced by approximation. The LayerNorm layer 312b normalizes the output of the FFT.
[0035] Here, the processing of the self-attention mechanism can be broadly divided into two parts. The first part is represented by equation (2).
[0036]
number
[0037] (2) Equation (2) shows the score of self-attention α∈R L×L This is a dot product calculation to determine α. α is a vector whose elements are between 0 and 1, and whose sum of elements in the sequence direction is 1.
[0038] The second part is represented by equation (3).
[0039]
number
[0040] In equation (3), α and Value V = W V A is calculated using E.
[0041] Therefore, the self-attention mechanism can be considered as a discrete convolution function. In Example 1, based on FNet (see Non-Patent Literature 2), the Conformer's self-attention mechanism is replaced with a two-dimensional Fast Fourier Transform (FFT), which is a type of convolution function. That is, the calculation result of the self-attention mechanism is approximated by the calculation result of the FFT.
[0042] This approximation reduces the computational complexity of the Conformer to the order of O(LlogL+DlogD). Furthermore, because the FFT performs convolution on the entire sequence, it always considers the contextual information of the entire audio.
[0043] On the other hand, CNNs, which are well-known in the field of deep learning, are known as local convolution functions that depend on the window width K. For example, in Conformer, a window width of K=30 or around that value is commonly used. Therefore, the Conformer in Example 1 learns sequential information using both a global convolution function (Convolution layer 33) and a local convolution function (FFT layer 311b).
[0044] [Example 2] Figure 6 shows an example of a Conformer after replacement. As shown in Figure 6, the Conformer 30c of Example 2 has the Attention layer 31a of Conformer 30a replaced with a Flash Attn (Flash-Attention) layer 31c.
[0045] The Flash Attn layer 31c does not reduce the theoretical computational complexity, but rather improves the computational speed and efficiency of the self-attention mechanism through a low-level implementation optimized for specific hardware.
[0046] The Flash Attn layer focuses on hardware data I / O, dividing data into sizes that can be exchanged between the processor (e.g., GPU (Graphics Processing Unit)) and memory (e.g., SRAM (Static Random Access Memory)) at one time, and achieving high-speed and memory-efficient processing by repeatedly performing dot product calculations at high speed and efficiency. Therefore, the computational complexity of the Flash Attn layer is O(L). 2 Although it is on the order of D, a high-speed and memory-efficient self-awareness mechanism is realized.
[0047] In Example 2, the modified model replaces the self-attention mechanism of the original model with a mechanism that divides the data to be processed into sizes that can be exchanged between the device's processor and memory at once, and processes each divided data.
[0048] [Example 3] Figure 7 shows an example of a Conformer after replacement. As shown in Figure 7, the Conformer 30d of Example 3 has the Attention layer 31a of Conformer 30a replaced with the Identity layer 31d.
[0049] The Conformer has a CNN in each Layer (repetition) and can learn short contextual information. In Example 3, it was assumed that each CNN could learn enough sequence information necessary for speech processing and that long contextual information was not needed, so the self-attention mechanism was removed from the Conformer (Conv. only). The Identity layer 31d is either a layer that does nothing or a layer that does nothing. In other words, the input to the Identity layer 31d is passed directly to the Linear layer 32. That is, the replaced model is the original model with the self-attention mechanism removed.
[0050] In Example 3, the computational load is reduced because the self-attention mechanism is eliminated.
[0051] [Effects of the first embodiment] As described in Examples 1-3, this embodiment enables efficient speech and language processing by reducing or accelerating the computational load.
[0052] Furthermore, Examples 1-3 are applicable to the self-supervised model HuBERT. In this case, the model before replacement is HuBERT, a self-supervised model that incorporates a Conformer. For example, using Example 1, by replacing the self-attention mechanism of the HuBERT model with an FFT and performing training, it is possible to construct LongHuBERT, which can handle longer speech. Such LongHuBERT can then be used as a pre-trained model for various downstream tasks (e.g., speech recognition, translation, summarization).
[0053] Furthermore, conventional self-attention mechanisms have difficulty handling long speeches, so methods are sometimes employed that detect silent intervals in long speeches and divide them into shorter utterances for processing. However, this method can only be applied to monotonic tasks where speech input and recognition output are nearly synchronized, such as speech recognition. In addition, it has problems such as being affected by errors in estimating silent intervals and failing to consider the context of the entire utterance.
[0054] On the other hand, this embodiment allows long audio clips that previously had to be split to be processed all at once. This makes it possible to achieve more human-like audio processing that considers what a person is saying from beginning to end. Furthermore, this embodiment makes it possible to utilize long contextual information throughout the entire audio clip, which could not be considered in the past.
[0055] Specifically, according to the embodiment, it becomes possible to generate words or IDs with high consistency that take the entire speech into consideration.
[0056] The following describes an experiment related to this embodiment. Figure 8 shows the correct answer and the speech recognition result. Ground Truth is the correct answer. Utt. ASR is the speech recognition result at the utterance level. Doc. ASR is the speech recognition result at the document level. As shown in Figure 8, the speech recognition result at the utterance level varies even for the same word, whereas the speech recognition result at the document level can produce a consistent output.
[0057] For example, in Utt. ASR, "reignbeaux" is recognized as "rainboex" or "rainbowx." On the other hand, in Doc. ASR, "reignbeaux" is consistently recognized as "reign beaux."
[0058] Figure 9 shows the experimental results. Figure 9 shows the speech recognition results when each model was trained using only utterance-based speech data (Utt. Only) and both utterance-based and document-based speech data (Utt. & Doc.). The metric is the word error rate. Conformer, Flash, Fnet, and Conv. only correspond to the prior art, Example 2, Example 1, and Example 3, respectively.
[0059] As shown in Figure 9, document-based speech processing can improve performance even in tasks that do not traditionally require long contextual information, such as speech recognition.
[0060] The ability to process long audio files in a single step is expected to improve performance in tasks such as speech summarization and intent estimation, which require extensive contextual information. The table in Figure 10 shows the evaluation using METEOR (MTR) and BERT scores, which measure the degree of agreement with the correct text in the speech summarization task. Figure 10 shows the experimental results.
[0061] In the example shown in Figure 10, in addition to speech summarization using HuBERT, the evaluation was also compared with a Cascade summarization model that utilizes speech transcription and text summarization models. Oracle is a Cascade model that uses ground truth transcription, and ASR is a Cascade model that uses utterance-level transcription results. Because HuBERT uses a normal self-attention mechanism, it can only handle about 100 seconds of audio at a time, but LongHuBERT, to which this embodiment is applied, can summarize 20 minutes of audio, thus improving performance.
[0062] Next, Figure 11 shows the evaluation in the speech intent classification task. Figure 11 is a diagram showing the results of the experiment.
[0063] The IC in Figure 11 represents the speech intent comprehension rate. Here again, considering the entire document can significantly improve the speech intent comprehension rate (Doc. IC). Thus, this embodiment can improve the processing accuracy in speech language processing tasks that require long contextual information.
[0064] [System configuration, etc.] Furthermore, the components of each part shown in the diagram are functional concepts and do not necessarily need to be physically configured as shown. In other words, the specific forms of distribution and integration of each device are not limited to those shown in the diagram, and all or part of them can be functionally or physically distributed and integrated in any unit according to various loads and usage conditions. Moreover, all or any part of the processing functions performed by each device can be realized by a CPU and the program executed on that CPU, or by hardware using wired logic.
[0065] Furthermore, among the processes described in the embodiments described above, all or part of the processes described as being performed automatically can be performed manually, or all or part of the processes described as being performed manually can be performed automatically by known methods. In addition, the processing procedures, control procedures, specific names, and information including various data and parameters shown in the above document and drawings can be arbitrarily changed unless otherwise specified.
[0066] [program] The aforementioned learning device 10 can be implemented by installing a program (learning program) as packaged software or online software on a desired computer. For example, by having the learning device run the above program, the learning device can be made to function as learning device 10. The learning device referred to here includes mobile communication terminals such as smartphones, mobile phones and PHS (Personal Handyphone System), as well as terminals such as PDA (Personal Digital Assistant).
[0067] Figure 12 shows an example configuration of a computer running a learning program. Computer 1000 has, for example, memory 1010 and a CPU 1020. Computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. These components are connected by a bus 1080.
[0068] Memory 1010 includes ROM (Read Only Memory) 1011 and RAM (Random Access Memory) 1012. ROM 1011 stores, for example, a boot program such as BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1100. For example, a removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120. The video adapter 1060 is connected to, for example, a display 1130.
[0069] The hard disk drive 1090 stores, for example, the OS 1091, application programs 1092, program modules 1093, and program data 1094. That is, the programs that define each process executed by the learning device 10 are implemented as program modules 1093 in which executable code is written. The program modules 1093 are stored, for example, in the hard disk drive 1090. For example, a program module 1093 for executing a process similar to the functional configuration of the learning device 10 is stored in the hard disk drive 1090. Note that the hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
[0070] Furthermore, the data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, memory 1010 or hard disk drive 1090. The CPU 1020 then reads the program module 1093 and program data 1094 stored in memory 1010 or hard disk drive 1090 into RAM 1012 as needed and executes them.
[0071] Furthermore, the program module 1093 and program data 1094 are not limited to being stored in the hard disk drive 1090; for example, they may be stored in a removable storage medium and read by the CPU 1020 via a disk drive 1100 or the like. Alternatively, the program module 1093 and program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). The program module 1093 and program data 1094 may then be read by the CPU 1020 from the other computer via a network interface 1070. [Explanation of Symbols]
[0072] 10 Learning device 11 Communications Department 12 Input section 13 Output section 14 Storage section 15 Control Unit 20 Models 21 Input Layer 22 encoders 30a, 30b, 30c, 30d Conformer 31a Attention layer 31c Flash Attn Layer 31d Identity layer 32 Linear layers 33 Convolution layer 34 MLP layers 35, 312c LayerNorm layer 141 Model Information 151 Calculation Department 152 Update section 221, 224 Feed Forward Layers 222 Self-attention mechanism 223 CNN 225 Layer Normalization Layer 311b FFT layer
Claims
1. A calculation unit that performs calculations related to speech using a second model obtained by replacing or deleting the self-attention mechanism included in a first model that takes speech data as input. A sound processing device characterized by having the following features.
2. The first model is a voice processing device according to claim 1, characterized in that it has a Conformer including a self-attention mechanism and a convolutional layer.
3. The speech processing device according to claim 2, characterized in that the first model is HuBERT, which is a self-supervised model that incorporates the Conformer.
4. The speech processing device according to any one of claims 1 to 3, characterized in that the second model is a model in which the self-attention mechanism of the first model is replaced with an FFT.
5. The second model is characterized in that the self-attention mechanism of the first model is replaced with a mechanism that divides the data to be processed into sizes that can be exchanged between the processor and memory of the device at one time, and processes each divided data.
6. The voice processing device according to any one of claims 1 to 3, characterized in that the second model is a model of the first model in which the self-attention mechanism is removed.
7. A speech processing method performed by a speech processing device, A computation process that performs calculations related to speech using a second model obtained by replacing or deleting the self-attention mechanism included in a first model that takes speech data as input. A method for processing sound, characterized by including [a specific component].
8. A computation step in which a task related to speech is calculated using a second model obtained by replacing or deleting the self-attention mechanism included in a first model that takes speech data as input. A speech processing program characterized by causing a computer to execute it.