Artificial intelligence system that captures context through expanded self-attention

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By combining extended self-attention mechanism with restricted self-attention, the computational complexity problem of neural networks in processing long input sequences is solved, achieving fast and accurate output, which is suitable for applications such as machine translation, language modeling and automatic speech recognition.

CN117043786BActive Publication Date: 2026-06-19MITSUBISHI ELECTRIC CORP

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: MITSUBISHI ELECTRIC CORP
Filing Date: 2021-11-30
Publication Date: 2026-06-19

Application Information

Patent Timeline

30 Nov 2021

Application

19 Jun 2026

Publication

CN117043786B

IPC: G06N3/045; G06N3/0455; G06N3/0464; G06N3/08; G10L15/16; G06F40/58; G06N3/047; G06N3/048

AI Tagging

Application Domain

Natural language translation Speech recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN117043786B_ABST

Patent Text Reader

Abstract

An artificial intelligence (AI) system is disclosed. The AI system includes a processor that processes an input frame sequence using a neural network, the neural network including an expanded self-attention module trained to compute an output sequence by: transforming each input frame into a corresponding query frame, a corresponding key frame, and a corresponding value frame, resulting in a sequence of key frames, a sequence of value frames, and a sequence of query frames with the same order; and performing attention computation for each query frame with respect to a position-constrained portion of the key frame and value frame sequences combined with the expanded sequences of the key frames and value frames, the expanded sequences of the key frames and value frames being extracted by processing different frames in the key frame and value frame sequences using a predetermined extraction function. Furthermore, the processor presents an output sequence.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure generally relates to artificial intelligence (AI), and more specifically, to AI systems that capture context through dilated self-attention. Background Technology

[0002] Today, attention mechanisms have become a central component in many neural network (NN) architectures used in various artificial intelligence (AI) applications, including machine translation, speech processing, language modeling, automatic speech recognition (ASR), and computer vision. Furthermore, self-attention mechanisms are also widely used neural network components. Self-attention mechanisms allow inputs to interact with each other (“self-interaction”) and determine which inputs should receive more attention from (“attention”) in order to optimally compute the output for a given task. The output of a neural network component using this self-attention mechanism is an aggregation of these interactions.

[0003] Attention-based architectures (such as transformer architectures) have been successfully applied in various domains where all model components utilize attention. Increasing the number of model parameters allows for the use of deeper and wider architectures to further improve results. Attention-based architectures handle inputs of varying lengths (also known as "input sequence lengths"). Typically, the computational complexity of attention-based architectures depends on the input sequence length. Moreover, the computational complexity of self-attention mechanisms increases quadratically with increasing input sequence length. This can be problematic for applications such as, but not limited to, automatic speech recognition (ASR), where the input sequence length of utterances can be relatively long. The increased computational complexity of neural networks leads to lower processing performance, such as increased processing time, slower processing speed, and increased storage space.

[0004] To address the computational complexity issue in neural networks, restricted self-attention mechanisms can be used. However, this restricted self-attention mechanism ignores long-range information relative to queries associated with the current query frame. Therefore, the output of this mechanism may be degraded.

[0005] Therefore, a technological solution is needed to overcome the limitations mentioned above. More specifically, it needs to provide high-quality output while minimizing computational costs (time and space requirements). Summary of the Invention

[0006] Some implementations are based on the understanding of attention mechanisms, which are methods of reading information from an input sequence using query frames (i.e., query vectors). In this mechanism, the input sequence acts as a memory. Furthermore, in extended self-attention mechanisms, query frames computed based on the input sequence are used to query information based on themselves. In example implementations, the input sequence may correspond to a sequence of observation vectors (frames) extracted from a speech utterance containing a sequence of speech sound events. The self-attention mechanism can transform the frames in such an input sequence into a sequence of key frames, a sequence of value frames, and a sequence of query frames. In some implementations, neighboring frames of the query frame corresponding to a frame position in the input sequence may belong to sound events similar to the sound events of the query frame, where detailed information may be needed to identify their logical relationship with one or more of the key frames, value frames, and query frames. Furthermore, distant information (such as frames in the input sequence far from the query frame) may be related to the context in which the input sequence is identified. Therefore, neighboring frames may have dependencies, while distant frames are related to the tracking context, which may require less detailed information.

[0007] In some example implementations, such as in machine translation or language modeling, individual words are represented by observation vectors in the input sequence, where close-by words in the input sequence are more likely to have dependencies, while only a few distant words or phrases may be relevant to the semantic context and grammar of the sentence being tracked, which may require less detailed information.

[0008] In some other example implementations, in an Automatic Speech Recognition (ASR) system, neighboring frames (or nearby frames) of a query frame may belong to the same phoneme, syllable, and word, requiring detailed information to identify their consistency. On the other hand, long-range information relates to the context of sounds and words in the utterance and adapts to speaker or recording characteristics, which typically requires less fine-grained information. In some implementations, a transformer-based neural network can be used in an end-to-end ASR system. The transformer-based neural network can be trained simultaneously with a frame-level classification objective function. In one example implementation, the transformer-based neural network can be trained simultaneously with a connectionist temporal classification (CTC) objective. The transformer-based neural network can utilize both encoder-decoder attention and self-attention. Encoder-decoder attention can control attention to the sequence of input values using a query vector based on the state of the decoder of the transformer-based neural network. The sequence of input values is the sequence of encoder neural network states. Both attention types in the transformer-based neural network can be based on a scaled dot product attention mechanism. The CTC objective implemented in a transformer-based neural network can combine the advantages of both tag synchronization and time synchronization models, while also enabling stream transmission recognition in encoder-decoder-based ASR systems.

[0009] Some implementations are based on the understanding that long-range information about the query associated with the current query frame can be excluded in a restricted self-attention mechanism. This restricted self-attention mechanism allows attention to neighboring or nearby frames at high resolution. That is, in a restricted self-attention mechanism, the past and future contexts relative to the current query frame are limited based on a predefined number of lookback and lookforward frames. However, long-range information can still be useful for providing accurate results.

[0010] Some implementations rely on the understanding of recursive processing of long-range information to compute a summary frame up to the current query frame in order to determine the past (left) context of the query. In the recursive process, the summary frame is updated with new input frames as the query moves forward. This process is performed until the last query frame has been processed. This iterative updating of information leads to inaccurate determination of long-range context as the recursive process progresses, because the original information from the long-range frames is attenuated in the summary frame. Furthermore, the recursive process cannot be parallelized to accelerate the computation of the summary frame.

[0011] To avoid this attenuation of long-range information and to obtain equal access to long-range information about the past (left context) and future (right context) of the current query frame, some implementations aim to accurately generalize long-range context without using recursive methods. To this end, some implementations aim to provide an expansion mechanism in addition to restricted self-attention. The combination of the expansion mechanism and restricted self-attention is called expanded self-attention. In the expansion mechanism, both the value frame sequence and the key frame sequence derived from the input sequence are extracted and stored in the value expansion sequence and the key expansion sequence, respectively. The expansion mechanism can use parallel computation to simultaneously compute frames in the key expansion sequence and the value expansion sequence. Compared to the key frame sequence and the value frame sequence, the key expansion sequence and the value expansion sequence can have a lower frame rate.

[0012] Therefore, the expanded self-attention, which combines restricted self-attention with an expansion mechanism, performs self-attention at full resolution on nearby frames within the look-ahead and look-back ranges of the query window, and at reduced resolution on distant frames that may be outside the restricted window. In some example implementations, the expansion mechanism of the expanded self-attention subsamples or generalizes the key frame sequence and value frame sequence in the input sequence. The generalized key frames and value frames can be used as expansion sequences. These key frame expansion sequences and value frame expansion sequences can correspond to a frame rate lower than that of the input sequence. These key frame expansion sequences and value frame expansion sequences can be appended to the restricted key frame sequence and restricted value frame sequence generated by the restricted self-attention mechanism. In this way, the complete context of the input sequence is captured at high (full) resolution and partially at lower resolution, thereby providing accurate self-attention output. The high-resolution and low-resolution information can be compressed, which consumes less memory and less computation time to process input sequences for applications related to machine translation, language modeling, speech recognition, etc.

[0013] Some implementations are based on the understanding that relevant information from the sequence input (i.e., the input sequence) can be extracted or compressed within frame blocks using different frame rate reduction methods, such as subsampling and pooling methods for extracting or compressing relevant information within chunks of frames. Examples of frame reduction methods may include, but are not limited to, mean-pooling, max-pooling, attention-based pooling, etc.

[0014] In some implementations, attention-based pooling methods are used to extract or compress relevant information within frame chunks. Attention-based pooling methods use trained embedding vectors to obtain one or more query vectors that are weighted and averaged to compute the chunk.

[0015] In some implementations, block processing, subsampling, and compression techniques can be used to extract relevant information from the input sequence. In such methods, distant frames with respect to the current query frame are processed at lower resolution, while neighboring frames with respect to the current query frame are processed at high (full) resolution. Distance and proximity information can be combined to obtain a compressed form of the input sequence. Different implementations use different predetermined extraction functions to extract information from all available / relevant key and value frames. These different extraction functions use one or a combination of the extraction techniques mentioned above to combine restricted self-attention with other useful information captured by expanded self-attention.

[0016] Furthermore, some implementations are based on the understanding that the computational complexity of self-attention mechanisms increases quadratically with the length of the input sequence. Therefore, some implementations aim to mitigate the quadratic increase in computational cost of self-attention with the input sequence length. According to these implementations, the computational cost of constrained self-attention in the extended self-attention system increases only linearly with the input sequence length. The computational cost for attention-expanded sequences is smaller than that for self-attention based on the entire sequence. M times, of which, M Indicates the block size for subsampling or pooling operations.

[0017] Therefore, compared with the whole sequence-based self-attention mechanism, the overall complexity of the extended self-attention mechanism is significantly smaller, while still capturing the complete context of the input sequence at different resolutions.

[0018] Therefore, one embodiment discloses an artificial intelligence (AI) system for jointly interpreting an input by exploring the interdependencies of inputs in an input sequence, the AI system comprising: at least one processor; and a memory storing instructions that, when executed by the at least one processor, cause the AI system to: accept an input frame sequence; process the input frame sequence using a neural network including at least one extended self-attention module trained to compute a corresponding output sequence from the input frame sequence by: transforming each input frame in the input frame sequence into a corresponding query frame, a corresponding key frame, and a corresponding value frame, resulting in a key frame sequence, a value frame sequence, and a query frame sequence having a similar order; and performing attention computation for each query frame in the query frame sequence with respect to a constrained portion of the key frame sequence and the value frame sequence based on the position of the query frame in the query frame sequence, combined with an extended sequence of key frames and an extended sequence of value frames, the extended sequence of key frames and the extended sequence of value frames being extracted by processing different frames in the key frame sequence and the value frame sequence using a predetermined extraction function; and rendering the output sequence.

[0019] The currently disclosed embodiments will be further explained with reference to the accompanying drawings. The drawings are not necessarily to scale; rather, it is intended to emphasize that they are generally placed within the framework of illustrating the principles of the currently disclosed embodiments. Attached Figure Description

[0020] [ Figure 1 ]

[0021] Figure 1 This is a block diagram illustrating a network environment for implementing an artificial intelligence (AI) system according to some embodiments of the present disclosure.

[0022] [ Figure 2 ]

[0023] Figure 2 Based on some embodiments of this disclosure Figure 1 A block diagram of an AI system is illustrated in the example.

[0024] [ Figure 3A ]

[0025] Figure 3A This is a graphical representation depicting the attention mechanism of an extended self-attention module of an AI system according to an example embodiment of the present disclosure.

[0026] [ Figure 3B ]

[0027] Figure 3BThis is a graphical representation depicting the attention mechanism of an extended self-attention module of an AI system according to another exemplary embodiment of this disclosure.

[0028] [ Figure 4 ]

[0029] Figure 4 This is a block diagram of an extended self-attention module of an AI system according to some embodiments of the present disclosure.

[0030] [ Figure 5 ]

[0031] Figure 5 This is a block diagram illustrating a transformer-based neural network of an AI system according to some embodiments of the present disclosure.

[0032] [ Figure 6A ]

[0033] Figure 6A It is a graphical representation of a compressed set of key-value frames in an input sequence according to an exemplary embodiment of the present disclosure.

[0034] [ Figure 6B ]

[0035] Figure 6B It is a graphical representation of a compressed set of key-value frames in an input sequence according to another exemplary embodiment of this disclosure.

[0036] [ Figure 6C ]

[0037] Figure 6C The output sequence of the AI system according to some embodiments of the present disclosure is illustrated.

[0038] [ Figure 7A ]

[0039] Figure 7A Attention-based pooling is illustrated according to some embodiments of this disclosure.

[0040] [ Figure 7B ]

[0041] Figure 7B An attention-based extension with post-processing is illustrated according to some embodiments of the present disclosure.

[0042] [ Figure 7C ]

[0043] Figure 7C Self-attention via attention-based pooling multi-resolution expansion is illustrated in some embodiments of this disclosure.

[0044] [ Figure 8A ]

[0045] Figure 8A This is a block diagram of an AI system in an automatic speech recognition (ASR) system according to some embodiments of the present disclosure.

[0046] [ Figure 8B ]

[0047] Figure 8B This is a block diagram of an AI system in an automatic machine translation (AMT) system according to some embodiments of the present disclosure.

[0048] [ Figure 9 ]

[0049] Figure 9 Exemplary scenarios for implementing AI systems according to some other exemplary embodiments of this disclosure are illustrated.

[0050] [ Figure 10 ]

[0051] Figure 10 A general block diagram of an AI system according to some example embodiments of the present disclosure is shown. Detailed Implementation

[0052] In the following description, numerous specific details are set forth for purposes of explanation in order to provide a thorough understanding of this disclosure. However, those skilled in the art will understand that this disclosure can be practiced without requiring these specific details. In other instances, apparatus and methods are shown only in block diagram form to avoid obscuring the disclosure.

[0053] As used in this specification and claims, the terms “for example,” “e.g.,” and “such as,” as well as the verbs “comprising,” “having,” “including,” and other verb forms thereof, when used in conjunction with a list of one or more components or other items, are interpreted as open-ended, meaning that the list is not considered to exclude other additional components or items. The term “based on” means at least partially based on. Furthermore, it should be understood that the language and terminology used herein are for descriptive purposes and should not be construed as limiting. Any headings used in this specification are merely for convenience and have no legal or restrictive effect.

[0054] Figure 1This is a block diagram illustrating a network environment 100 for implementing an artificial intelligence (AI) system 102 according to some embodiments of the present disclosure. The network environment 100 is described as including a user 106 associated with a user device 108. In an exemplary scenario, the user 106 provides input, such as input 110, to the user device 108. The user device 108 may receive input 110 as an acoustic signal or as spoken speech. The user device 108 may include applications, such as automatic speech recognition (ASR) or automatic machine translation (AMT) applications hosted by a server 104. Input 110 may be provided to the server 104 via a network 116. The server 104 may be configured to process input 110 to perform various operations (such as operations related to ASR and AMT applications). In an exemplary embodiment, the user 106 may provide input 110 as audio input (also referred to as audio input 110) related to a technical problem that a technology solution provider 112 can solve. The technology solution provider 112 may include, but is not limited to, human representatives, virtual robots, and interactive voice response (IVR) systems. Server 104 receives audio input 110 from user device 108 via network 116 and transmits the audio input to technology solution provider 112.

[0055] Furthermore, network 116 may include appropriate logic, circuitry, and interfaces configured to provide multiple network ports and multiple communication channels for sending and receiving data. Each network port may correspond to a virtual address (or physical machine address) used for sending and receiving communication data. For example, the virtual address may be Internet Protocol version 4 (IPv4) (or an IPv6 address), while the physical address may be a Media Access Control (MAC) address. Network 116 may be associated with an application layer used to implement communication protocols based on one or more communication requests from at least one of the one or more communication devices. Communication data may be sent or received via communication protocols. Examples of such wired and wireless communication protocols may include, but are not limited to, Transmission Control Protocol and Internet Protocol (TCP / IP), User Datagram Protocol (UDP), Hypertext Transfer Protocol (HTTP), File Transfer Protocol (FTP), ZigBee, EDGE, Infrared (IR), IEEE 802.11, 802.16, cellular communication protocols, and / or Bluetooth (BT) communication protocols.

[0056] Examples of network 116 may include, but are not limited to, wireless channels, wired channels, and combinations of wireless and wired channels. Wireless or wired channels may be associated with a network standard defined by one of the following: Local Area Network (LAN), Personal Area Network (PAN), Wireless Local Area Network (WLAN), Wireless Local Area Network (WAN), Wireless Wide Area Network (WWAN), Long Term Evolution (LTE) network, Common Old-Style Telephone Service (POTS), and Metropolitan Area Network (MAN). Additionally, wired channels can be selected based on bandwidth standards. For example, fiber optic channels can be used for high-bandwidth communication. Furthermore, coaxial cable-based or Ethernet-based communication channels can be used for medium-bandwidth communication.

[0057] In some implementations, the audio input 110 can be very long. In this case, the computational complexity of the server 104 can be very high. Therefore, the server 104 may not be able to process the audio input 110 accurately and / or in a timely manner, which may result in inaccurate output. Moreover, processing very long audio input 110 may take time, leading to delayed responses to user input. Furthermore, the server may also suffer from backlog because it spends more time processing very long audio input 110.

[0058] Therefore, AI system 102 can be used to generate high-quality output for audio input 110 (with low computational cost), which will be discussed in the following section. Figure 2 Let me explain.

[0059] Figure 2 Based on some embodiments of this disclosure Figure 1 A block diagram of an AI system 102 is illustrated exemplarily. The AI system 102 jointly interprets an input by exploring the interdependencies between inputs in an input sequence. The AI system 102 includes a processor 202, a memory 204, and an input / output (I / O) interface 210. The memory 204 has a neural network 206 that includes extended self-attention modules 208. In some embodiments, the neural network 206 may include multiple layers of extended self-attention modules.

[0060] In an example implementation, I / O interface 210 is configured to receive an input sequence, which may correspond to an audio input (such as audio input 110) having a time dimension. Furthermore, processor 202 is configured to execute instructions stored in memory 204. Execution of the stored instructions causes AI system 102 to accept an input frame sequence that represents an ordered sequence of features describing information about an input signal (such as audio input 110). Additionally, the input frame sequence is processed using neural network 206, which includes a self-attention module 208 trained to compute an expansion of the corresponding output sequence based on the input frame sequence.

[0061] Some implementations are based on the following approach: the input signal may include a sequence of input frames, which is transformed into a key sequence, a value sequence, and a query sequence. Each query frame in the query sequence is searched on the key sequence to compute the relation of each key frame relative to a query frame. Each key frame is associated with a value frame that encodes features about each input frame. The estimated relations of each key frame relative to the query frame are used to assign weighting factors to each value frame to compute a weighted average of the value frame sequence and the output for the query search. For example, if each input frame in the input frame sequence corresponds to a word in a word sequence (i.e., a sentence), the estimated relations of each key frame relative to the query frame would represent the relation of the word associated with that query frame to all other words in the sentence.

[0062] Furthermore, some implementations are based on the premise that the relationship between a query and a key, as well as the relationship between a key and a value, are distinguishable. That is, as the network learns, the attention mechanism can learn to reshape the relationship between search terms and the words providing context.

[0063] Therefore, in some embodiments, the processor 202 processes the input sequence via the neural network 206 through the following steps: transforming each input frame in the input frame sequence into a corresponding query frame, a corresponding key frame, and a corresponding value frame, thereby resulting in a key frame sequence, a value frame sequence, and a query frame sequence having the same order. In some embodiments, the position of the query frame in the query frame sequence corresponds to its position in the key frame sequence and the value frame sequence.

[0064] Furthermore, the processor 202 performs attention computation on each query frame in the query frame sequence via the neural network 206, considering a combination of a portion of the key frame sequence and value frame sequence with the key frame expansion sequence and value frame expansion sequence. The portion of the key frame sequence and value frame sequence is determined based on the position or location of the query frame within the query frame sequence.

[0065] To this end, the extended self-attention module 208 is trained to compute an output sequence based on a learned transformation and attention computation sequence from an input frame sequence. The attention computation sequence allows mapping a value frame sequence to the output using a current query frame and a key frame sequence. In some example implementations, the extended self-attention module 208 provides an attention mechanism to read information from the input sequence using the current query frame, which corresponds to a query vector.

[0066] The extended self-attention module 208 further compares different query frames in the query frame sequence with different representations of the key frame sequence. The comparison between different query frames and different representations of the key frames generates different weight distributions on the different representations of the value frame sequence. These different weight distributions are used to calculate a weighted average of the different representations of the value frames, which forms the output sequence of the output. The different representations of the key frame sequence and the value frame sequence are formed by combining subsequences of the key frames and value frames with compressed or subsampled sequences of the key frames and value frames. In some embodiments, the subsequences of the key frames and value frames can be selected based on the position of the current query frame and the compressed or subsampled sequences of the key frames and value frames.

[0067] In some implementations, neighboring frames of the current query frame in the input sequence can be used to provide information related to the current query frame (hereinafter referred to as "association information"). Association information from neighboring frames may include elements similar to or with elements of the current query frame. In such cases, detailed information may be required to identify the logical relationship between the elements of the association information and the elements of the current query frame. Therefore, some implementations are based on the understanding that frames adjacent to the current query frame are more likely to have dependencies.

[0068] Furthermore, frames in the input sequence that are far from the current query frame can provide long-range information relevant to the context of identifying the input sequence. Therefore, some implementations are based on the understanding that distant neighboring frames may be relevant to the tracking context used to interpret the input sequence.

[0069] For example, in speech recognition, neighboring frames of the current query frame may correspond to the same phoneme, syllable, or word. Long-range information can be related to the context of recognizing sounds and words in the utterance, as well as adapting to speaker or recording characteristics, which typically requires less fine-grained information. In machine translation, neighboring words of the current query frame are more likely to have dependencies, while only a few long-range words or phrases may be related to the semantic context and grammar of the sentence being tracked, which may require less detailed information.

[0070] To determine the context and syntax of the input sequence, the AI system 102 generates keyframe expansion sequences and valueframe expansion sequences. For this purpose, the processor 202 performs non-recursive sequence compression of the keyframe and valueframe sequences. Specifically, the processor 202 utilizes an expanded self-attention module 208 to simultaneously process the input frame sequence, including both keyframe and valueframe sequences, across the entire time dimension. In this case, the output of each frame is independent of the others. Therefore, the raw information carried by each frame is processed, and the accurate context and syntax of the input sequence are determined.

[0071] Furthermore, in some implementations, non-recursive sequence compression of the keyframe and valueframe sequences is achieved by applying extraction (e.g., compression) techniques to all frames of the keyframe and valueframe sequences in parallel. This reduces the computational complexity of self-attention processing and extends it to sequence processing with different attention resolutions. Therefore, the dilated mechanism can efficiently generalize the different features of individual frames in the keyframe and valueframe sequences and provides a reduction in the computational complexity of the neural network 206. The neural network 206 trained using this dilated mechanism provides low computational complexity for generating the output sequence. Therefore, the processing speed of the processor 202 utilizing this neural network increases, resulting in a fast response time for the AI system 102. Thus, the AI system 102 disclosed in this disclosure facilitates output in a faster manner. Furthermore, the processor 202 presents the output sequence via I / O interface 210.

[0072] In an embodiment where the neural network 206 includes multiple neural network layers with expanded self-attention modules, the expansion mechanism is executed independently for each expanded self-attention module at each layer of the neural network.

[0073] In another embodiment, the expanded self-attention module 208 sequentially applies multiple expansion mechanisms to generate multiple expansion sequences for keys and values in the processing pipeline. Specifically, the processor 202 sequentially generates a first expansion sequence for key frames and a first expansion sequence for value frames, as well as a second expansion sequence for key frames and a second expansion sequence for value frames, in the processing pipeline. In this case, the first expansion sequence of key frames and the first expansion sequence of value frames, presented by the first expansion mechanism with a first block size, form the input of a second expansion mechanism with a second block size, so as to present the second expansion sequence of key frames according to the first expansion sequence of key frames, and the second expansion sequence of value frames according to the first expansion sequence of value frames. In this way, expansion sequences with different frame rates (i.e., different resolutions) can be obtained.

[0074] Figure 3AThis is a schematic representation 300 illustrating the principles of the attention mechanism used in some example implementations. For example, the extended self-attention module 208 can use the principles of the attention mechanism to read information from the input sequence based on the current query frame, such as query 302. In the example implementation, in the attention mechanism of the extended self-attention module 208, the input of the source sequence 308 is initially transformed into key frames and value frames. The key frames and value frames may include key frames 304A, 304B, 304C, and 304D (also referred to as keys 304A to 304D) and corresponding value frames 306A, 306B, 306C, and 306D (also referred to as values 306A to 306D). The source sequence 308 may correspond to a feature sequence derived from the audio input 110.

[0075] In the example implementation, the extended self-attention module 208 determines the similarity between query 302 and each of the keys 304A to 304D. This similarity is used to calculate the attention score for each of the values 306A to 306D.

[0076] In some example implementations, attention scores can be normalized based on a softmax function to compute the attention weight distribution. For this purpose, the extended self-attention module 208 of the neural network 206 utilizes a softmax function, such that the unnormalized scores of the extended self-attention module 208 are mapped to a probability distribution on the value frame sequence. The softmax function is... K A vector of real values becomes a vector whose sum is 1. K The softmax function is a function of a vector of real values. Input values can be positive, negative, zero, or greater than one, but softmax transforms them into values between 0 and 1 so that they can be interpreted as probabilities summing to 1. Therefore, the input to the softmax function can be the dot product score between query 302 and keys 304A to 304D, which is used to determine the attention score. The corresponding values 306A, 306B, 306C, and 306D are weighted according to the normalized attention score; that is, each value in 306A, 306B, 306C, and 306D is multiplied by the normalized attention score. Furthermore, the weighted values 306A to 306D are summed. The extended self-attention module 208 determines the output vector, such as the attention value 310, based on the sum of the weighted values 306A to 306D. (Refer to...) Figure 3B Further describe the attention scores for each of the corresponding values 306A, 306B, 306C, and 306D.

[0077] Figure 3BThis is a graphical representation 312 depicting the attention mechanism used by the extended self-attention module 208 of the AI system 102 according to another exemplary embodiment of the present disclosure. In an exemplary scenario, query 302 is selected by the AI system 102 from input sequence 314. Input sequence 314 includes input frames. In the exemplary embodiment, each input frame may correspond to a word in input sequence 314, such as: word 314A(w1), 314B(w2), 314C(w3), and 314D(w4). Input sequence 314 corresponds to source sequence 308. Input word 314C(w3) is selected as query 302. Furthermore, the AI system 102 generates a set of keys 316A, 316B, 316C, and 316D (also referred to as key sequences 316A to 316D) and a corresponding set of values 320A, 320B, 320C, and 320D (also referred to as value sequences 320A to 320D) for input sequence 314. The key sequences 316A to 316D and the value sequences 320A to 320D correspond to the key sequences 304A to 304D and the value sequences 306A to 306D.

[0078] The extended self-attention module 208 determines attention scores for the value sequences 320A to 320D. Specifically, attention score 318A is determined for value 320A, attention score 318B for value 320B, attention score 318C for value 320C, and attention score 318D for value 320D. (See also...) Figure 4 The self-attention mechanism provided by the extended self-attention module 208 is further described.

[0079] Figure 4This is a block diagram of an expanded self-attention module 208 of an AI system 102 according to an example embodiment of the present disclosure. In one embodiment, the expanded self-attention module 208 includes a self-attention layer configured to perform expanded self-attention. Alternatively, in some embodiments, the neural network 206 may include multiple layers of expanded self-attention modules, wherein each layer may correspond to the expanded self-attention module 208. For example, in one embodiment, the expanded self-attention module 208 includes an expanded self-attention layer 402 and a feedforward neural network (FFN) module 404. Alternatively, in different embodiments, the self-attention module has different combinations of self-attention layers, residual layers, feedforward layers, and other task-specific layers. The expanded self-attention subnet 402 learns information relationships in pairs. For example, the expanded self-attention layer 402 learns logical relationships in input frames of a source sequence 308 for applications such as automatic speech recognition (ASR) applications, automatic machine translation (AMT) applications, etc. Both the extended self-attention layer 402 and the feedforward neural network module 404 are followed by "add & normalize" layers 403 and 405, respectively. Add & Normalize layer 403 first adds the input of the extended self-attention layer 402 to its output using residual connections, and then applies layer normalization. Similarly, Add & Normalize layer 405 first adds the input of the feedforward neural network module 404 to its output, and then also applies layer normalization.

[0080] In the example scenario, the expanded self-attention layer 402 receives signals from... The input sequence is represented S ,in, L and C This represents the sequence length and frame / vector dimension. The extended self-attention subnet 402 is transformed via a linear transformation... S Transform into a key sequence ( K ), query sequence ( Q ) and value sequences ( V (Through value sequence) V The self-attention output (such as attention value 310) sequence is calculated by weighted summation, where the sequence is obtained by... Q Each query frame and sequence in V Attention weights are derived by comparing values from different frames within the key sequence. K Query sequence Q and value sequences V The attention output (e.g., attention value 310) can be calculated using a scaled dot product as shown in equation (1):

[0081] V, (1)

[0082] in, It consists of queries, keys, and values, and among them, Representing dimension, Indicates the sequence length. d q = d k ,and n k = n v .

[0083] In the example application, input sequence 314 corresponds to the word embedding sequence of the sentence "She eats a green apple". A lookup table trained together with other neural network modules is used to transform the individual words in input sequence 314 (i.e., "she", "eat", "green", and "apple") into embedding vectors. In the self-attention mechanism used by the exemplary implementation of the extended self-attention module 208, the vectors are multiplied by a matrix to create a query 302, a key (e.g., keys 316A to 316D), and a value (e.g., values 320A to 320D) for each word embedding in word embeddings 314A to 314D. Attention scores (e.g., attention scores 318A to 318D) are calculated by taking the dot product of the query 302 and the key sequence 316A to 316D for the corresponding input sequence 314. For example, the attention score for the first word "she" is calculated by comparing it with all other words in the sentence using the dot product between the query corresponding to the word "she" and the keys 316A to 316D for all words. After normalizing the attention scores using the softmax function so that their sum is 1, the estimated attention weights can be "she": 0.05, "eat": 0.8, "green": 0.05, "apple": 0.1, where the weights after the colons indicate the weights of the individual words. These weights are first applied to the value sequences 320A to 320D, and then the value sequences 320A to 320D are summed to form the output vector. Similarly, the attention scores of the remaining words in the input sequence (i.e., "eat", "green", and "apple") are calculated by comparing the corresponding query with all the keys in the sentence of keys 316A to 316D and summing the value sequences 320A to 320D with the corresponding weights. In this way, the extended self-attention module 208 transforms the input sequence 314 into the output sequence using the query frame sequence, the key frame sequences 316A to 316D, and the value frame sequences 320A to 320D. The extended self-attention module 208 compares different query frames in the query frame sequence with different representations of key frames and value frames to generate an output sequence of outputs.

[0084] To combine restricted self-attention with expanded self-attention, in some implementations, different representations of key frames and value frames are formed using subsets of key frames 316A to 316D and corresponding value frames 320A to 320D. The subsets of key frames and value frames can be selected based on the position of the current query frame 302. Additionally, an expansion mechanism can be applied to the key frame sequence and value frame sequence to compute expanded key frame sequences and expanded value frame sequences.

[0085] In some implementations, the extended self-attention layer 402 corresponds to multi-head attention used in the transformer-based neural network of the AI system 102, which will be referred to below. Figure 5 Describe it.

[0086] Figure 5 This is a block diagram illustrating a transformer-based neural network 500 of an AI system 102 according to some embodiments of the present disclosure. In some example embodiments, the transformer-based neural network 500 may utilize an attention-based encoder-decoder neural network, such as encoder 502 and decoder 504. In such an attention-based encoder-decoder neural network, the decoder state can be used as a query (e.g., query 302) for controlling the attention to the encoder state sequence of encoder 502. The encoder state sequence may correspond to the output sequence of encoder 502. The transformer-based neural network 500 may also utilize an extended self-attention module 208. In some embodiments, the transformer-based neural network 500 may include multiple extended self-attention modules. In some example embodiments, the transformer-based neural network 500 utilizes encoder-decoder based attention, such as for encoder 502 and decoder 504, and extended self-attention. The attention calculation for both may be based on scaled dot product attention, where attention is based on the above... Figure 4 The calculation is performed using equation (1) as described in the description. Furthermore, the transformer-based neural network 500 may include multiple layers of expansion-based self-attention neural network modules.

[0087] In some example implementations, the transformer-based neural network 500 uses a multi-head attention mechanism, wherein

[0088] (2)

[0089] (3)

[0090] in, It is the input to the multi-head attention (MHA) layer, such as multi-head attention 512 of encoder 502, multi-head attention 526 and multi-head attention 530 of decoder 504. Indicates the totald h The first i The output of each attention head, and , , ,as well as It is a trainable weight matrix, typically = ,as well as Concat f Indicates along the size The cascade of feature dimensions.

[0091] In some example implementations, encoder 502 may include a two-layer convolutional neural network (CNN) module (ENCCNN) (included in 508) and a self-attention module 511 ( ENCSA ) or a stack of extended self-attention modules 208:

[0092] (4)

[0093] (5)

[0094] in, PE It is a sinusoidal positional encoding, and This represents an input sequence of 314, for example, acoustic input features such as 80-dimensional log-mel spectral energy plus additional features for pitch information. Both CNN layers of ENCCNN can use a stride size of 2, a kernel size of 3×3, and the ReLU activation function. Therefore, with the feature sequence... Compared to the frame rate, striding can extend the output sequence. The frame rate is reduced by 4 times. The ENCSA module of equation (5) is composed of E It consists of several layers, of which the first layer is... e Each layer (of which, e = 1、……、 E ) are multi-head expansion self-attention layers (e.g., multi-head attention 512) and feedforward neural network layers ( FF (e.g., feedforward layer 516) composite:

[0095] , (6)

[0096] , (7)

[0097] in, Norm (As shown in 514) indicates layer normalization. In some example implementations, the feedforward neural network consists of internal dimensions. and external dimensions This consists of two linear neural network layers, which can be separated by the following rectified linear unit (ReLU) activation function:

[0098] (8)

[0099] in, , , ,as well as These are trainable weight matrices and bias vectors. The transformer-based neural network 500 can provide a transformer objective function, which is defined as:

[0100] (9)

[0101] Among them, the label sequence , tag subsequence and encoder output sequence .item The transformer-decoder model can be represented as:

[0102] , (10)

[0103] And for d = 1, ..., D,

[0104] , (11)

[0105] , (12)

[0106] , (13)

[0107] , (14)

[0108] in, D This indicates the number of decoder layers in decoder 504.

[0109] It is the input label sequence Convert into a trainable embedding vector sequence The function, where, It indicates the beginning of a sentence. PE This indicates the position code. It is achieved by applying a fully connected neural network. Furthermore, a softmax distribution is applied to the output to predict the label. It is a function of the posterior probability.

[0110] Position codes 510 and 524 are dimensions. The sinusoidal positional encoding (PE) is added to sequences with similar dimensions. and ,and and It can be written as:

[0111] , (15)

[0112] , (16)

[0113] in, and yes and Location and dimension indexes.

[0114] In some implementations, the transformer-based neural network 500 can be jointly trained with a frame-by-frame classification objective function (e.g., connectivity temporal classification (CTC) loss). The objective function of CTC is:

[0115] , (17)

[0116] in, This indicates the use of CTC's transition rules (e.g., transitions between labels and insertion of whitespace labels) to sequence labels. Y A one-to-many mapping that extends to the set of all frame-level tag sequences. This represents the frame-level label sequence. Finally, the multi-target loss function is given by the following equation:

[0117] (18)

[0118] This loss function Used for training, where, It is the control objective function and The weighted hyperparameters between them.

[0119] Furthermore, after the multi-head attention layer and feedforward layer in encoder 502 and decoder 504 is an "addition and normalization" layer, which first adds the input of the corresponding layer to its output using residual connections, and then applies layer normalization. For example, multi-head attention 512 is connected to feedforward layer 516 via "addition and normalization" layer 514. In a similar manner, multi-head attention 530 is connected to feedforward layer 534 via "addition and normalization" layer 532. Feedforward layer 516 applies two linear transformations to the output of "addition and normalization" layer 514, where the linear transformations are separated by activation functions (e.g., rectified linear units (ReLU)). The output of feedforward layer 516 is sent through another "addition and normalization" layer, which again applies residual connections to the output, followed by layer normalization. Encoder layers 512, 514, 516, and 518 are repeated first. E The output of the last encoder layer 518 (without shared parameters) is then passed to the multi-head attention layer 530 of the decoder 504. As additional input, the multi-head attention layer 530 receives the previous decoder output token 520, which is used to compute the decoder state by processing the decoder state via layers 522, 524, 526, and 528. Layer 522 converts the previous output token 520 into an embedding vector, which is then fed to the multi-head attention layer 526 after positional encoding is added in layer 524. The output of layer 526 is further processed as previously discussed using the "addition and normalization" layer. The output of the multi-head attention layer 530 is provided to the feedforward layer 534 via the "addition and normalization" layer 532. The output of the feedforward layer 534 is further processed by another "addition and normalization" layer 536. Decoder layers 526, 528, 530, 532, 534, and 536 are applied in this order. D Next (without shared parameters), where the output of layer 536 is input to layer 526 after the first application. Finally, the decoder layer is applied. D After this, the output of layer 536 is forwarded to linear layer 538 (a fully connected neural network layer), which projects the output vector of decoder 504 onto the scores of each output token in the output tokens. The output of linear layer 538 is then fed to softmax layer 540 to convert the decoder scores into probabilities of each output token of decoder 504.

[0120] In some implementations, the self-attention module 511 of the transformer-based encoder neural network 500 is replaced with an extended self-attention module 208. In such a configuration, in the transformer-based... EInstead of self-attention, extended self-attention with multiple heads is performed at each layer of the encoder layer of the converter, so as to perform self-attention at multiple resolutions and save computational costs.

[0121] Some implementations are based on the understanding that long-range information relative to the current query frame (e.g., query 302) can be useful in providing accurate results. To this end, the extended self-attention module 208 provides a self-attention mechanism that allows attention to neighboring frames of the current query frame, which include long-range information relevant to the accurate context of the captured input sequence 314. In some implementations, the extension mechanism of the extended self-attention module 208 can generalize long-range information including relevant information. Some implementations are based on the understanding that relevant information from the input sequence can be extracted or compressed within frame blocks, which will refer to... Figures 6A to 6C Further explanation.

[0122] Figure 6A This is a graphical representation 600 depicting an input frame sequence 602 for self-attention according to an exemplary embodiment of the present disclosure. Input frame sequence 602 may correspond to input sequence 314. In some exemplary embodiments, a query frame, such as the current query frame 604, is obtained from the input sequence. In full-sequence-based self-attention 606, attentional connections are allowed for all neighboring frames of the current query frame 604, such as... Figure 6A As shown. However, concatenating all neighboring frames with the current query frame 604 can increase computational complexity. Therefore, restricted self-attention 608 can be used to reduce computational complexity.

[0123] In constrained self-attention 608, neighboring frames surrounding the current query frame 604 are used for self-attention. These neighboring frames may correspond to past and future contextual information relative to the current query frame 604. In some embodiments, the extended self-attention module 208 may be configured to execute a selection function that selects a subset of input frames (e.g., neighboring frames of the current query frame 604) from the input sequence 602 based on the position of the current query frame 604 to form part of the representation of the input sequence 602. The selection function accepts the position of the input as a parameter and returns neighboring frames 610A and 610B in the input sequence 602. The selection function may also accept values for look-ahead size and look-back size to form the window size around the current query frame for selecting neighboring frames 610A and 610B. In some embodiments, the window may be a time-constrained window. The selection function may restrict the use of neighboring frames of the full-resolution input sequence 602. In some embodiments, the selection function may correspond to a constrained window 610 for selecting a subset of input frames. In an example implementation, the selected subset of input frames may correspond to a fixed number of retrospective frames 610A and look-ahead frames 610B. Retrospective frames 610A may include the past (left) context of the input sequence 602 relative to the query 604, while look-ahead frames 610B may include the future (right) context.

[0124] However, the restricted self-attention 608 excludes long-range information relative to the current query frame 604. Excluding long-range information in the restricted self-attention 608 may worsen the results. Therefore, the restricted self-attention 608 can be combined with an expansion mechanism to provide expanded self-attention, which will be discussed below. Figure 6B Describe it.

[0125] Figure 6B This is a graphical representation 612 depicting a compressed input sequence 602 according to another exemplary embodiment of this disclosure. In some embodiments, the expanded self-attention module 208 provides an expansion mechanism 612 in combination with the constrained self-attention 608.

[0126] In the expansion mechanism 612, the input sequence 602 can be generalized to form a compressed sequence of key frames and value frames (e.g., key sequences 316A to 316D and value sequences 320A to 320D). For example, the key frame and value frame sequences in the input sequence 602 can be divided into key block sequences and value block sequences, such as blocks 616A, 616B, 616C, 616D, 616E, 616F, 616G, and 616H (also referred to as blocks 616A to 616H), via at least one processor (such as processor 202), wherein each key block includes multiple key frames, and each value block includes multiple value frames. In some example embodiments, the expanded self-attention module 208 is configured to divide the key frame and value frame sequences at a predetermined frequency. Furthermore, each block in blocks 616A to 616H is summarized into extended frames 618A, 618B, 618C, 618D, 618E, 618F, 618G, and 618H (also referred to as summarized 618A to 618H). Extended frames 618A to 618H provide a compressed form (extended sequence 620) corresponding to the set of keys 316A to 316D and corresponding values 320A to 320D. In the example implementation, the first e The expansion mechanism at each encoder layer first divides each layer into layers of length 1. N The key Sum (See equation (3)) split into its respective lengths. M of Non-overlapping bond blocks Sum value blocks , making

[0127] for l = 1、……、 L ,

[0128] ,

[0129] ,

[0130] Among them, for those with i = 1、……、 d h The attention head for indexing, if the last chunk and Having less than M If there are several frames, they can be filled with zeros.

[0131] Furthermore, the at least one processor applies predetermined functions to each key block in the key block and each value block in the value block to compress multiple key frames in the key block into a smaller predetermined number of key frames of the same dimension for the key frame expansion sequence, and to compress multiple value frames in the value block into a smaller predetermined number of value frames of the same dimension for the value frame expansion sequence. In some embodiments, parallel computing is used to compress at least some key blocks in the key block and some value blocks in the value block simultaneously, thereby achieving high processing speed of the processor.

[0132] Examples of predefined functions include, but are not limited to, sampling functions, average pooling functions (also known as mean pooling), max pooling functions, attention-based pooling, and pooling based on convolutional neural networks (CNNs), or a combination thereof.

[0133] More specifically, subsampling or pooling techniques, which are predetermined functions, are applied to each block to generate expanded sequences that are respectively appended to the restricted key frame sequence and the restricted value frame sequence by means of the following modified equation (3). and :

[0134]

[0135] And for n = 1、 ……、N ,

[0136] ,

[0137] ,

[0138] in, and This indicates a time-constrained approach (corresponding to window size). The number of lookback and lookforward frames, and Concat t This indicates a cascading arrangement along the time dimension (frames).

[0139] In some implementations, a subsampling-based dilatation mechanism selects the first frame from each block to form the dilatation sequence. and In an alternative implementation, a pooling method is applied to summarize the information content of each block, such as a sampling function, mean pooling (MP), max pooling, CNN-based pooling, attention-based pooling (AP), or attention-based pooling with post-processing (AP + PP).

[0140] In some implementations, a CNN-based dilation mechanism is applied, wherein CNN-based pooling applies convolutions with trained weights and kernel sizes similar to the chunk size to the key frame sequence and the value frame sequence.

[0141] In some implementations, a max-pooling-based expansion mechanism is applied, wherein the max-pooling function selects a single key frame with the maximum energy from multiple key frames in a key block and a corresponding frame from multiple value frames in a value block.

[0142] In some implementations, an expansion mechanism based on a sampling function is applied, wherein the sampling function selects a single frame from multiple key frames in a key block and a corresponding frame from multiple value frames in a value block. In the subsampling function and the max-pooling function, a single key frame is selected from the key frame block, and a corresponding value frame is selected from the value frame block, ignoring information contained in other frames.

[0143] In some implementations, an expansion mechanism based on mean pooling is applied, wherein, for l = 1、……、 L The frames in each block of key frames and value frames are averaged into a mean vector according to the following formula:

[0144]

[0145] Among them, the symbol [ V , K The symbol ] indicates the processing of value frames or key frames. This symbol continues to be used in the following formula. The resulting mean vector sequence for the key frame sequence and value frame sequence is used to form the expansion sequence. .

[0146] In a preferred embodiment, attention-based pooling (AP) can be applied to summarize the information content of each block in the keyframe and valueframe, which will refer to... Figure 7A and Figure 7B Further description.

[0147] Figure 6C The output sequences output by the AI system 102 according to some embodiments of this disclosure are illustrated. (Refer to...) Figure 6A and Figure 6B To describe Figure 6CThe at least one processor (such as processor 202) computes an output sequence 622 by combining a portion of the keyframe and valueframe sequences within a confined window 610 with an expansion sequence 620 (i.e., a keyframe expansion sequence and a value frame expansion sequence). This expansion sequence is determined by non-recursive sequence compression of the keyframe and valueframe sequences to reduce the computational complexity of the self-attention processing. The expansion sequence 620 corresponds to a distant frame to which context is added to the query frame, enabling any system including AI system 102 to provide accurate output with less processing time. Furthermore, the at least one processor presents the output sequence 622 via an output interface (such as I / O interface 210).

[0148] Figure 7A An attention-based pooling 700 according to some embodiments of the present disclosure is illustrated.

[0149] In attention-based pooling (AP) 700, one or more trained query vectors (such as trained query vector 706) are used to determine multiple weight distributions 704A, 704B, and 704C by paying attention to key frame blocks or value frame blocks from the input sequence 702. Therefore, attention-based pooling assigns relevance to key frames in key frame blocks or value frames in value frame blocks to derive weight distributions 704A, 704B, and 704C. Based on the multiple weight distributions 704A, 704B, and 704C, a weighted average of the key frame blocks and a weighted average of the value frame blocks are calculated.

[0150] In an example implementation using attention-based pooling, an embedding vector (e.g., trained query vector 706) is learned using an attention mechanism as follows:

[0151] ,

[0152] And for l = 1、 ……、L , ,

[0153] as well as ,

[0154] in, To indicate a query, Embed(b) will include the attention header number. b =(1、……、 B Mapping to Dimensions d k The trainable vectors (e.g., the trained query vector 706), and B This represents the total number of attention heads. Along the dimension... b Attention output To obtain an average This forms an extended sequence (e.g., extended sequence 708). .

[0155] Specifically, frames in the expanded sequence (such as expanded sequence 708) used for the key frame sequence are calculated as a weighted average of the key frame blocks. Alternatively, frames in the expanded sequence (such as expanded sequence 708) used for the value frame sequence are calculated as a weighted average of all value frames used for the value frame blocks.

[0156] In some implementations, attention-based expansion is performed in conjunction with post-processing techniques.

[0157] Figure 7B An attention-based extension with post-processing 710 is illustrated according to some embodiments of the present disclosure.

[0158] Some implementations are based on the idea that the application of post-processing techniques can improve the output of a system (e.g., AI system 102). To this end, output frames (such as multiple output frames 714) in the key block and the value block are processed according to post-processing rules to generate one or more frames 714 for the key frame expansion sequence and for the value frame expansion sequence.

[0159] To derive output frame 714 for the key frame expansion sequence and value frame expansion sequence, the post-processing rules include one or a combination of the following: saving the output frame 714 determined for the key block and the value block, combining the output frame 714 determined for the key block and the value block, and removing at least one output frame from the output frame 714 determined for the key block and the value block.

[0160] The post-processing rules can be adjusted for at least two different types of key blocks and value blocks, including: a first type of key block and value block, which is less than a threshold distance from the query frame when attention is applied; and a second type of key block and value block, which is equal to or greater than a threshold distance from the query frame when attention is applied.

[0161] In some other embodiments, the neural network (such as transformer-based neural network 500) includes multiple expansion mechanisms as part of an expanded self-attention module to generate at least two expansion sequences for key frames and value frames. In such a setup, at least one processor (such as processor 202) stores frames in the first expansion sequence for key frames and value frames that correspond to key blocks and value blocks, i.e., whose frame distance relative to the query frame is less than a predefined threshold during attention computation. Furthermore, the at least one processor also stores frames in the second expansion sequence for key frames and value frames that correspond to key blocks and value blocks, i.e., whose frame distance relative to the query frame is greater than or equal to a predefined threshold during attention computation.

[0162] In some implementations, at least one processor stores a plurality of output frames 714 determined for a first type of key block and value block; and removes at least one frame from the plurality of frames 714 in a second type of key block and value block.

[0163] In some alternative embodiments, the at least one processor stores multiple frames 714 of a first type of key and value blocks; and combines multiple frames of a second type of key and value blocks by means of average pooling and merging of multiple output frames 714 determined for the key and value blocks through neural network processing. The neural network processing includes using two linear transformations having trained parameters separated by non-linear activation functions and having a bottleneck structure, such that the first linear transformation projects the input to a smaller dimension, while the second linear transformation projects the output of the first linear transformation to the dimension of the query frame, key frame, and value frame.

[0164] For example, post-processing (PP) can be applied to To further process the attention-based pooling output, and to use the internal dimension d in and external dimensions d A two-layer feedforward neural network [v, k] is used to effectively incorporate the outputs of multiple attention heads:

[0165] ,

[0166] and ,

[0167] as well as ,

[0168] in, , , ,as well as It consists of trainable weight matrices and bias vectors, and Concat. f Represents a vector along the feature dimension Cascade (where, b = 1、……、 B The post-processing results can be... Used to form an expanded sequence .

[0169] The combination of constrained self-attention and the expansion mechanism reduces the computational complexity of self-attention for long input sequences. The computational complexity estimate here is based on the number of floating-point multiplications of vector and matrix products, which is determined by... The notation is used to describe this. For simplicity, we omit the estimation of scalar multiplication and addition, as including these operations does not significantly change the relative complexity when comparing different methods. The computational complexity of self-attention based on the entire sequence is O(n log n). ,in, N Indicate the length of the input sequence (such as input sequence 702) and d model Let represent the dimension of the attention model. It can be noted that the number of operations required for self-attention increases quadratically with the length of the input sequence. The complexity of constrained self-attention is . ,in, R It is the size of the restricted window, which is usually significantly smaller than N This results in fewer operations compared to full-sequence-based self-attention. The computational complexity of extended self-attention is O(n log n). This includes the attention cost of restricted self-attention and the additional expansion sequence, plus the computational complexity of the expansion mechanism. The computational complexity of attention-based pooling mechanisms Equivalent to a query after learning and bond blocks dot product attention The calculated attention weights can also be reused to summarize value blocks. The computational complexity of post-processing is equal to... This is used for post-processing of the attention results of key and value blocks. To reduce computational complexity, a feedforward neural network is used in the post-processing stage. FF Internal dimensions can be used d in = 16 bottleneck.

[0170] Figure 7C An example of multi-resolution expanded self-attention 718 via attention-based pooling is illustrated according to some embodiments of the present disclosure. (See also...) Figure 7A and Figure 7B To describe Figure 7C.

[0171] Execution based on attention pooling (as referenced above) Figure 7A and Figure 7B Based on the described data, multiple frames 720 (also referred to as output frames or expansion vectors) are output. The output frames 720 are multi-resolution frames, where the resolution corresponding to each frame changes based on the position of the frame relative to the current query frame 702A within a time-constrained window 702B. The at least one processor is configured to analyze the output frames to generate an expansion sequence 720. To this end, the at least one processor is configured to determine the distance between each output frame and the current query frame 702A. If the distance is greater than a threshold, the corresponding output frame is interpreted as being far from the current query frame 702A. In such cases, the at least one processor is configured to discard the output frame. Therefore, output frames 720 located at distances greater than the threshold are discarded. On the other hand, output frames 720 located at distances less than the threshold are preserved. Thus, an expansion sequence 720 is generated. Therefore, by utilizing information from distant frames corresponding to the query frame and neighboring frames within the time-constrained window 702B, the context of the current query frame 702A is determined with lower complexity. In addition, the processor is configured to generate an output sequence based on the combination of the extended sequence 720 and the corresponding neighboring frames within the current query frame 702A and the time-constrained window 702B.

[0172] Next, refer to Figure 8A and Figure 8B This describes the implementation of AI system 102 in ASR and AMT applications.

[0173] Figure 8A This is a block diagram 800 of an AI system 102 in an Automatic Speech Recognition (ASR) system 802 according to some embodiments of the present disclosure. The ASR system 802 also includes an input interface 804 and an output interface 806. The input interface 804 is configured to receive an acoustic signal representing at least a portion of a spoken utterance. A neural network 806 of the AI system 102 converts the acoustic signal into an input sequence. The neural network 206 also uses an extended self-attention module 208 to transform the input sequence into an output sequence. In some embodiments, multiple extended self-attention modules are used to transform the input sequence into the output sequence, wherein each self-attention module uses an expansion mechanism that can generalize long-range information including relevant information. The transformed input sequence is converted into a transcription of the spoken utterance. The transcription of the spoken utterance is provided as output via the output interface 806.

[0174] Figure 8BThis is a block diagram 808 of an AI system 802 in an Automatic Machine Translation (AMT) system 810 according to some embodiments of the present disclosure. The AMT system 810 also includes an input interface 812 and an output interface 814. The input interface 812 is configured to receive an input signal representing spoken utterances in a first language. For example, the spoken utterances could be English. A neural network 206 of the AI system 102 converts the input signal into an input sequence. The neural network 206 uses an extended self-attention module 208 to transform the input sequence into an output sequence. The transformed input sequence is converted into an output signal representing spoken utterances in a second language. For example, the second language could correspond to German. The output signal representing spoken utterances in the second language is provided as output via the output interface 814.

[0175] Furthermore, AI systems (such as AI system 102) including extended self-attention modules (such as extended self-attention module 208) can be used in streaming applications. In such a scenario, input frames are received continuously. Using the extended self-attention module, self-attention outputs with finite delays are generated for each input frame in the input frames. To this end, the key expansion sequence and value expansion sequence are extended as long as at least one new key frame block and one new value frame block have been received.

[0176] Similarly, the disclosed AI system, including the extended self-attention module, can be implemented in various applications such as sound event detection, audio tagging systems, and sound source separation systems.

[0177] Figure 9An exemplary scenario 900 for implementing the AI system 102 according to some other exemplary embodiments of the present disclosure is illustrated. In the exemplary scenario, user 902 may use user device 904 to provide input. User device 904 may include: smartphone, tablet computer, laptop computer, smartwatch, wearable device, desktop device, or any other electronic device. User 902 may request services from digital assistant 906. Digital assistant 906 may include virtual chatbot, interactive voice response (IVR) system, etc. User 902 may provide acoustic signals representing at least a portion of spoken utterance to ASR system 702 from user device 904 via network 116. ASR system 702 may use AI system 102 to provide transcription of spoken utterance. Furthermore, the transcription may be provided to digital assistant 906. Digital assistant 906 may operate based on the received transcription and provide services to user 902. For example, the service may correspond to hiring vehicle 908. Vehicle 908 may include: autonomous vehicle, manually driven vehicle, or semi-autonomous vehicle. Vehicle 908 may be connected to network 116. Transcription may include the pick-up and drop locations of user 902. Furthermore, user 902 may use ASR system 702 to operate vehicle 908. In some cases, ASR system 702, AMT 710, or a combination thereof, implemented together with AI system 102, may be used for operations related to the navigation system of vehicle 908.

[0178] In some cases, AI system 102 can also provide speech-to-text documentation to user 902. For example, user 902 can provide speech to user device 904. User device 904 can communicate with AI system 102 to provide a transcription of the speech. AI system 102 can then provide a text document based on this transcription. This can help the user (e.g., user 902) write text or maintain documents via voice input.

[0179] In some other cases, user 902 may be traveling in a foreign country. User 902 may not be able to communicate with people in that foreign country in the corresponding foreign language. User 902 may rent vehicle 908, and the driver of vehicle 908 may not speak user 902's native language or may not know the common language for communication with user 902. In such cases, user 902 may provide input to user device 904 for machine translation of user 902's native language into a foreign language. This input may correspond to an input signal representing spoken utterances in the native language (e.g., English). The spoken utterances in the native language can be provided from user device 904 to AMT 710 via network 116. AMT system 710 can use AI system 102 to translate the native language and provide the spoken utterances in the foreign language in a fast and efficient manner.

[0180] Figure 10 A block diagram of an AI system 1000 according to some embodiments of the present disclosure is shown. The AI system 1000 corresponds to... Figure 1 AI system 102. AI system 1000 includes: input interface 1002, processor 1004, memory 1006, network interface controller (NIC) 1014, output interface 1016, and storage device 1020. Memory 1006 is configured to store neural network 1008. Neural network 1008 includes an extended self-attention module 1010. In some example embodiments, neural network 1008 has a similar architecture including a transformer, a conformer, and an extended self-attention module 1010 as part of an encoder, decoder, or both.

[0181] The extended self-attention module 1010 is trained to transform an input sequence into a corresponding output sequence. The input sequence is transformed by comparing each input with different representations of the input sequence. When the processor 1004 executes instructions stored in the memory 1006, the extended self-attention module 1010 transforms the form of the inputs from the input sequence to form a representation of the input sequence. The representation is formed by combining a first part of the representation that depends on the position of the input in the input sequence with a second part of the representation that is independent of the position of the input. The first part varies with different inputs, while the second part is a compression of the input sequence that remains constant for all inputs from the input sequence. Furthermore, the input is transformed into a corresponding output by comparing it with the formed representation.

[0182] In some implementations, the extended self-attention module 1010 is configured to execute a selection function that selects a subset of the input sequence based on the position of the input to form a first part of a representation of the input sequence. The selection function accepts the position of the input as a parameter and returns a subset of the input sequence centered at the position of the input. The selection function may also accept the size of the input subset as another parameter.

[0183] In some other embodiments, the extended self-attention module 1010 is configured to form compression by summarizing the input sequence using a mean pooling-based method. In a preferred embodiment, the summarization of the input sequence may use attention-based pooling with or without a post-processing stage.

[0184] Output interface 1002 is configured to accept input data 1024. In some embodiments, AI system 1000 uses NIC 1014 to receive input data 1024 via network 1022. In some cases, input data 1024 may be online data received via network 1022. In other cases, input data 1024 may be recorded data stored in storage device 1020. In some embodiments, storage device 1020 is configured to store training datasets for training neural network 1008.

[0185] In some example implementations, input data 1024 may include: an acoustic signal representing at least a portion of spoken utterance, an input signal representing spoken utterance in a first language, etc. The neural network 1008 can be configured to convert the acoustic signal into an input sequence, transform the input sequence into an output sequence using an extended self-attention module 1010, and convert the output sequence into a transcription of spoken utterance. The output of the transcription can be provided to the output device 1018 via an output interface 1016. Similarly, the neural network 1008 can be configured to convert the input signal into an input sequence, transform the input sequence into an output sequence using an extended self-attention module, and convert the output sequence into an output signal representing spoken utterance in a second language. The output signal can be provided to the output device 1018 via an output interface 1016.

[0186] Various embodiments of this disclosure provide an AI system, such as an AI system 1000 providing extended self-attention. Extended self-attention improves the accuracy and modeling ability of constrained self-attention. Extended self-attention also helps reduce the computational complexity of self-attention for long input sequences. In this way, the computational cost and memory usage of speech processing systems (e.g., ASR system 702 and AMT system 710) can increase non-quadraticly, thereby improving system efficiency in a feasible manner.

[0187] The following description provides only exemplary embodiments and is not intended to limit the scope, applicability, or configuration of this disclosure. Rather, the following description of exemplary embodiments will provide those skilled in the art with a description of what can be accomplished to implement one or more exemplary embodiments. Various changes to the function and arrangement of the elements are contemplated without departing from the spirit and scope of the disclosed subject matter set forth in the appended claims.

[0188] Specific details are set forth in the following description to provide a thorough understanding of the embodiments. However, those skilled in the art will understand that these embodiments can be practiced without these specific details. For example, systems, processes, and other elements of the disclosed subject matter may be shown as components in block diagram form to avoid obscuring these embodiments with unnecessary detail. In other instances, well-known processes, structures, and techniques may be shown without unnecessary detail to avoid obscuring these embodiments. Moreover, the same reference numerals and designations in the various figures denote the same elements.

[0189] Furthermore, each implementation can be described as a process, depicted as a flowchart, flow diagram, data flow diagram, structure diagram, or block diagram. Although a flowchart can describe operations as sequential processes, many operations can be executed in parallel or simultaneously. Additionally, the order of operations can be rearranged. A process may terminate upon completion of its operations, but may have additional steps not discussed or included in the diagram. Moreover, not all operations in any particular described process may occur in all implementations. A process can correspond to a method, function, procedure, subroutine, subroutine, etc. When a process corresponds to a function, the termination of that function may correspond to the function returning to the calling function or the main function.

[0190] Furthermore, the implementation of the disclosed subject matter can be carried out at least partially, manually, or automatically. It can be performed, or at least assisted by, manual or automatic implementation using machines, hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. When implemented in software, firmware, middleware, or microcode, program code or code segments that perform the necessary tasks can be stored in a machine-readable medium. The processor can perform the necessary tasks.

[0191] The various methods or processes outlined herein can be encoded into software that can be executed on one or more processors employing any of a variety of operating systems or platforms. Furthermore, such software can be written using any of a variety of suitable programming languages and / or programming or scripting tools, and can also be compiled into executable machine language code or intermediate code that executes on a framework or virtual machine. In various implementations, the functionality of program modules can typically be combined or distributed as needed.

[0192] The embodiments of this disclosure can be embodied as methods for which examples have been provided. Actions performed as part of the method can be arranged in any suitable manner. Therefore, even though actions are shown as sequential in the exemplary embodiments, embodiments can be constructed that perform actions in a different order than illustrated, which may include performing some actions simultaneously. Furthermore, the use of common terms such as "first" and "second" in the claims to modify claim elements does not independently imply any priority, precedence, or order of a claim element exceeding another or temporary order of actions performing a method, but is merely used as a marker to distinguish one claim element having a specific name from another element having the same name (but for use in common terms), thus differentiating these claim elements.

[0193] Although this disclosure has been described with reference to specific preferred embodiments, it is to be understood that various other changes and modifications may be made within the spirit and scope of this disclosure. Therefore, the appended claims cover all such changes and modifications that fall within the true spirit and scope of this disclosure.

Claims

1. An artificial intelligence (Al) system for jointly interpreting inputs in an input sequence by exploring interdependencies of the inputs on each other, the Al system comprising: At least one processor; and a memory storing instructions that, when executed by the at least one processor, cause the AI system to: Accepts input frame sequences; The input frame sequence is processed using a neural network, the neural network including at least one extended self-attention module, the at least one extended self-attention module being trained to compute a corresponding output sequence based on the input frame sequence by: transforming each input frame in the input frame sequence into a corresponding query frame, a corresponding key frame, and a corresponding value frame, thereby resulting in a key frame sequence, a value frame sequence, and a query frame sequence having the same order; Attention calculations are performed on individual query frames in the query frame sequence, relating to a portion of the key frame sequence and the value frame sequence that is constrained based on the position of the query frame in the query frame sequence, in combination with the expanded sequence of the key frame and the expanded sequence of the value frame. The expanded sequences of the key frame and the value frame are extracted by processing different frames in the key frame sequence and the value frame sequence using a predetermined extraction function. Present the output sequence; The input frame sequence represents an ordered feature sequence describing information about audio input with a time dimension; In order to generate the expanded sequence of the key frame and the expanded sequence of the value frame through an expansion mechanism, the at least one processor is configured to: The key frame sequence and the value frame sequence are divided into a key block sequence and a value block sequence, wherein each key block includes multiple key frames, and each value block includes multiple value frames; and Apply the predetermined extraction function to each key block in the key block and to each value block in the value block, so as to: The multiple keyframes in the key group block are compressed into a smaller predetermined number of keyframes with the same dimension for the expansion sequence of the keyframes; and The multiple value frames in the value block are compressed into a smaller predetermined number of value frames with the same dimension for the expansion sequence of the value frames.

2. The AI system of claim 1, wherein, The processor is configured to use parallel computing to simultaneously compress at least some of the key blocks in the key blocks and at least some of the value blocks in the value blocks.

3. The AI system according to claim 1, wherein The predetermined extraction function is one or a combination of the following: a sampling function, an average pooling function, a max pooling function, attention-based pooling, and pooling based on a convolutional neural network (CNN), wherein the sampling function selects a single frame from the plurality of key frames in the key block and selects a corresponding frame from the plurality of value frames in the value block. The average pooling function averages the elements of the multiple key frames in the key block and the elements of the multiple value frames in the value block. Specifically, the max-pooling function selects the single keyframe with the maximum energy from the plurality of keyframes in the key group block, and selects the corresponding frame from the plurality of value frames in the value group block. The attention-based pooling combines the multiple key frames in the key block and the multiple value frames in the value block according to weights determined by applying a trained query vector to the multiple key frames in the key block. Specifically, the CNN-based pooling applies convolutions with trained weights and kernel sizes similar to the chunk size to the key frame sequence and the value frame sequence. In this process, subsampling and max pooling select a single key frame from the key frame block and the corresponding value frame from the value frame block, ignoring information contained in other frames. In this process, average pooling equally weights all key frames and all value frames within a key frame block, and... Specifically, the attention-based pooling assigns relevance to the key frames in the key frame block or the value frames in the value frame block to obtain a weight distribution, and uses the weight distribution to calculate the weighted average of all the key frames for the key frame block and the weighted average of all the value frames for the value frame block.

4. The AI system of claim 1, wherein, The predetermined extraction function is attention-based pooling, which uses a trained query vector to attend to each key block in the key block to determine the weight distribution of the plurality of key frames in the key block, and uses weights selected according to the determined weight distribution to calculate the frames in the expanded sequence of the key frame sequence as a weighted average of the plurality of key frames in the key block.

5. The AI system according to claim 1, wherein, The predetermined extraction function is attention-based pooling, which uses a trained query vector to attend to each key block in the key block to determine the weight distribution of the plurality of key frames in the key block, and uses the same determined weight distribution to compute frames in the expanded sequence of the value frame sequence as a weighted average of the plurality of value frames in the value block.

6. The AI system according to claim 1, wherein The predetermined extraction function is attention-based pooling, wherein the attention-based pooling uses multiple trained query vectors to generate multiple weight distributions by paying attention to key frame blocks or value frame blocks, and uses the multiple weight distributions to compute multiple output frames corresponding to the weighted average of the key frame blocks and the weighted average of the value frame blocks. The processor is further configured to perform post-processing on the plurality of output frames in the key block and the plurality of output frames in the value block according to post-processing rules, to generate one or more frames for the expansion sequence of the key frames and the expansion sequence of the value frames. In order to derive the expanded sequence for the key frame and the expanded sequence for the value frame, the post-processing rule includes one or a combination of the following: Save the plurality of output frames determined for the key block and for the value block. Combining the plurality of output frames determined for the key block and for the value block, and Remove at least one output frame from the plurality of output frames determined for the key block and the value block. The processor is further configured to adjust the post-processing rules for at least two different types of key blocks and value blocks, the at least two different types of key blocks and value blocks including: a first type of key blocks and value blocks, wherein the distance between the first type of key blocks and value blocks and the query frame is less than a threshold under the attention calculation; and a second type of key blocks and value blocks, wherein the distance between the second type of key blocks and value blocks and the query frame is equal to or greater than a threshold under the attention calculation.

7. The Al system of claim 6, wherein, The processor is also configured to: Save the plurality of frames determined for the key blocks and value blocks of the first type; and Remove at least one frame from the plurality of frames in the key block and value block of the second type, and The neural network processing includes using two linear transformations with trained parameters separated by a nonlinear activation function and having a bottleneck structure, such that the first linear transformation projects the input to a smaller dimension, and the second linear transformation projects the output of the first linear transformation to the dimensions of the query frame, key frame, and value frame.

8. The AI system of claim 6, wherein, The processor is also configured to: Save the plurality of frames for the key blocks and value blocks of the first type; and Through neural network processing, the multiple frames in the key block and value block of the second type are combined using one or a combination of average pooling and merging of the multiple output frames determined for the key block and the value block.

9. The Al system of claim 1, wherein, The extended self-attention module is used in streaming applications where the input frame sequence is received sequentially, and when at least one new key frame block and a new value frame block are generated, a self-attention output with finite delay is generated for each input frame in the input frames by extending the extended sequences of the key frames and the value frames. The key frames from the key frame sequence and the value frames from the value frame sequence are selected based on the position of the query frame in the query frame sequence.

10. The Al system of claim 1, wherein, The extended self-attention module transforms the input frame sequence into the query frame sequence, the key frame sequence, and the value frame sequence, such that the attention calculation compares the query frame with a portion of the key frames from the key frame sequence and with the extended sequence transformed according to the key frame sequence to generate an output based on the sequence of value frames from the value frame sequence and the extended sequence of the value frames. 11.The AI system of claim 1, wherein, The neural network includes a multi-layered extended self-attention module.

12. The AI system according to claim 1, wherein The processor is further configured to sequentially generate, in the processing pipeline, a first expansion sequence for the key frame and a first expansion sequence for the value frame via a first expansion mechanism, and to sequentially generate, in the processing pipeline, a second expansion sequence for the key frame and a second expansion sequence for the value frame via a second expansion mechanism. The first expansion sequence having a first block size corresponds to the input of a second expansion mechanism having a second block size, so as to present the second expansion sequence of the key frame according to the first expansion sequence of the key frame, and to present the second expansion sequence of the value frame according to the first expansion sequence of the value frame.

13. The AI system according to claim 11, wherein, The processor is also configured to: Save the first expansion sequence of the key frame and the frames in the first expansion sequence of the value frame that correspond to such key blocks and value blocks, i.e., the frame distance of the key blocks and value blocks relative to the query frame is less than a predefined threshold in the case of the attention calculation; as well as The frames corresponding to key blocks and value blocks in the second expansion sequence of the key frames and the second expansion sequence of the value frames are stored, i.e., the key blocks and value blocks have a frame distance relative to the query frame that is greater than or equal to a predefined threshold in the case of the attention calculation.

14. The Al system of claim 1, wherein, The neural network has a transformer or shaper architecture, which includes the extended self-attention module as part of an encoder, decoder, or both.

15. The AI system according to claim 1, wherein the AI system comprises at least a part of one or a combination of an automatic speech recognition (ASR) system, a sound event detection system, an audio tagging system, a sound source separation system, and a machine translation system.