A speech signal processing system and method based on a hierarchical attention-free model

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By using a hierarchical attention-free model, combined with depthwise separable convolution and nonlinear gating units, the computational complexity of speech analysis is reduced, the efficiency problem of Transformer in long-term speech signal processing is solved, and real-time speech analysis on mobile devices is realized.

CN117894342BActive Publication Date: 2026-06-23HUNAN UNIV

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: HUNAN UNIV
Filing Date: 2024-01-23
Publication Date: 2026-06-23

Application Information

Patent Timeline

23 Jan 2024

Application

23 Jun 2026

Publication

CN117894342B

IPC: G10L25/66; G10L25/30

AI Tagging

Application Domain

Speech analysis

Technology Topics

Attention modelEngineering

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A line structure health early warning method and system
CN122241446ANeural learning methods Complex mathematical operations Feature vectorAttention model
Network security text-to-kql query generation method and system based on pattern screening and semantic verification, and storage medium
CN121764952BDigital data information retrieval Semantic analysis Theoretical computer science Engineering
Denoising method and device for three-dimensional high-order discontinuous galerkin data in shock analysis
CN122243795AImage enhancement Neural learning methodsShock waveAttention model
Joint module convolution attention and capsule network hysteresis error modeling and compensation method
CN122260814AProgramme-controlled manipulator JointsAttention modelReduction drive
Feedback generative adversarial network with channel spatial attention mechanism for agent path planning
CN115903856Bquick searchimprove accuracy Vehicle position/course/altitude control Position/direction controlAttention modelGenerative adversarial network

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

In existing technologies, Transformer-based speech analysis methods have excessively high computational complexity and storage requirements when processing long-duration speech signals, resulting in slow system operation and making them difficult to deploy in real-time applications and resource-constrained environments.

Method used

A hierarchical attention-free model is adopted, including a speech preprocessing embedding module, a hierarchical attention-free module, and a merging unit. It uses depthwise separable convolution and residual connections, combined with nonlinear gating units, to reduce computational complexity and improve speech signal processing efficiency.

Benefits of technology

It achieves low computational complexity speech signal processing, enabling real-time analysis on mobile devices, and is suitable for efficient recognition of long-duration speech signals and applications in specific scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN117894342B_ABST

Patent Text Reader

Abstract

The application provides a speech signal processing system and method based on a layered attention-free model, which comprises a speech preprocessing embedding module, which is used for acquiring speech information of a user and extracting a feature vector of the speech information of the user; and a layered attention-free module, which comprises a plurality of attention-free layers, each layer structure comprising a plurality of AFFormer units, the AFFormer unit comprising a token mixer and a channel mixer, the token mixer comprising a plurality of parallel deep separable convolution branches, which are used for processing the received feature vector to obtain token information output; and the channel mixer comprising parallel nonlinear gating branches and linear branches, which are used for processing the received token information respectively, and obtaining optimal gating signal feature values according to the processing results of the two branches, and obtaining target information to be screened according to the gating signal feature values. The application can efficiently and accurately identify speech features to be screened.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of speech signal technology, and in particular to a speech signal processing system and method based on a hierarchical attention-free model. Background Technology

[0002] Current speech analysis methods are mostly based on deep learning frameworks, utilizing the powerful sequence modeling capabilities of the Transformer architecture to process speech signals. However, the self-attention mechanism of the Transformer in existing technologies causes the computational and storage complexity to increase quadratically with the length of the speech sequence, leading to a significant increase in computational complexity and model size, slower system speed, and prolonged response time. Therefore, it is not effective for processing long speech signals. Furthermore, to cope with the increased computational and storage complexity, higher-performance hardware is required, which limits its deployment capabilities in real-time applications and resource-constrained environments. Therefore, designing a speech signal processing method that can achieve more efficient computation and storage while maintaining detection performance, enabling real-time speech analysis on various mobile devices, has significant practical application implications. Summary of the Invention

[0003] The technical problem to be solved by this invention is: in view of the technical problems existing in the prior art, this invention provides a speech signal processing system and method based on a hierarchical attention-free model with low computational complexity and simple structure, so as to achieve efficient and accurate recognition of speech signals and practical application in specific scenarios.

[0004] To solve the above-mentioned technical problems, the technical solution proposed by this invention is as follows:

[0005] A speech signal processing system based on a hierarchical attention-free model includes:

[0006] The speech preprocessing embedding module is used to acquire the user's speech information and extract the feature vector of the user's speech information;

[0007] The hierarchical attention-free module includes multiple attention-free layers. Each layer structure includes multiple AFFormer units. Each AFFormer unit includes a token mixer and a channel mixer. The token mixer includes multiple parallel depthwise separable convolutional branches. The depthwise separable convolutional branches process the received feature vectors and superimpose the processing results of each depthwise separable convolutional branch with the feature vectors for information output. The channel mixer includes parallel nonlinear gating branches and target information linear branches. The nonlinear gating branches and target information linear branches process the received token information respectively and obtain the optimal gating signal feature value based on the processing results of the two branches. The target information to be filtered is obtained based on the gating signal feature value.

[0008] Furthermore, the speech preprocessing embedding module is connected to a projection module, which is used to obtain a low-dimensional representation of the feature vector.

[0009] Furthermore, the hierarchical attention-free module also includes a merging unit located upstream of the AFFormer unit. The merging unit receives the output of the projection module and performs downsampling processing on the output of the projection module using a specified downsampling rate to obtain multi-granularity aggregated speech features.

[0010] Furthermore, the token mixer includes parallel 7×1 depthwise separable convolutional branches and 1×1 depthwise separable convolutional branches, used to receive the multi-granularity aggregated speech features, and to perform convolution processing on the multi-granularity aggregated speech features to obtain the convolution processing results of each branch, and to superimpose the convolution results of each branch with the multi-granularity aggregated speech features to obtain token information.

[0011] Furthermore, the formula for calculating the obtained token information is as follows:

[0012] ;

[0013] in, and These represent the input and output of the token mixer, respectively, and LN represents layer normalization. This represents a 7×1 depth separable convolution branch in a one-dimensional convolution. This represents a one-dimensional convolution with a depth of 1×1 that can be separable.

[0014] Furthermore, the nonlinear gating branch of the channel mixer includes a linear transformation unit and a GELU nonlinear activation unit. The linear transformation unit performs a linear transformation on the input token information and then inputs the processing result into the GELU nonlinear activation unit to obtain the gating signal feature value.

[0015] Furthermore, the gating signal feature value obtained by the nonlinear gating branch is multiplied element-wise with the target information to be screened obtained by the linear branch to obtain the optimal gating signal feature value.

[0016] Furthermore, the formula for calculating the optimal gating signal feature value is as follows:

[0017] ;

[0018] Where W1 and W2 represent the weights of the nonlinear gated branch and the linear branch, respectively. , These represent the input and output of the channel mixer, respectively. GELU represents the Gaussian error linear unit activation function, and LN represents layer normalization.

[0019] A detection method for a speech signal processing system based on a hierarchical attention-free model includes the following steps:

[0020] Step 1. Obtain the user's voice information and extract the feature vector of the user's voice information;

[0021] Step 2. The feature vector is simultaneously input into each depthwise separable convolutional branch of the token mixer in the hierarchical attention-free module for convolution processing, and the processing results of each depthwise separable convolutional branch are superimposed with the feature vector to obtain token information output; the token information is then input into the parallel nonlinear gating branch and the target information linear branch of the channel mixer for processing, and the optimal gating signal feature value is obtained based on the processing results of the two branches, and the target information to be filtered is obtained based on the gating signal feature value.

[0022] Furthermore, in step 2,

[0023] The method for obtaining the optimal gating signal feature values includes:

[0024] Step 201. The nonlinear gating branch acquires the token information, performs linear processing on the token information, and inputs the processing result into the GELU nonlinear activation unit to obtain the gating signal;

[0025] Step 202. The target linear branch obtains the token information and performs linear processing on the token information to obtain the target information to be filtered;

[0026] Step 203. Perform element-wise multiplication on the gating signal and the target information to be screened to obtain the optimal gating signal features.

[0027] Compared with the prior art, the advantages of the present invention are as follows:

[0028] 1. This invention separates spatial convolution and pointwise convolution through depthwise separable convolution in a hierarchical attention-free module, effectively reducing computational complexity. By connecting small separable convolution kernels to local inputs, it captures fine-grained local features. Furthermore, residual connections effectively alleviate the vanishing gradient problem and enhance the additive aggregation effect of multi-path sub-network features. By combining separable convolution units with residual connections, the system minimizes parameters and computational cost while increasing the network's perceptual range to achieve effective integration of global semantic information. Automatic semantic filtering and selection are implemented through nonlinear gating units, effectively suppressing irrelevant information, making it more suitable for speech analysis needs compared to standard fully connected layers in existing technologies.

[0029] 2. This invention further constructs a two-branch channel mixer. One branch performs a linear transformation followed by GELU nonlinear activation, and its output serves as the gating signal; the other branch performs only a linear transformation, and its output is the target information to be filtered. Then, the outputs of the two branches are multiplied element-wise to achieve gating. The two branches employ different linear transformation matrices, allowing the gating signal to selectively process each target channel. The gating signal and the target representation are learned together to achieve better gating performance. Attached Figure Description

[0030] Figure 1 This is a schematic diagram of the speech signal processing system based on a hierarchical attention-free model according to an embodiment of the present invention.

[0031] Figure 2 This is a schematic diagram of the token mixer and channel mixer structure in the AFFormer module of this invention.

[0032] Figure 3 This is a flowchart of a speech signal processing method based on a hierarchical attention-free model according to an embodiment of the present invention. Detailed Implementation

[0033] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of the present invention.

[0034] Example 1

[0035] like Figure 1 As shown, this embodiment of the speech signal processing system based on a hierarchical attention-free model includes:

[0036] (1) Speech preprocessing embedding module, used to obtain the user's speech information and extract the feature vector of the user's speech information.

[0037] In this embodiment, a pre-trained model is used for acoustic feature extraction to extract features from the acquired user speech. Currently popular pre-trained speech models using self-supervised learning (SSL) include Wav2Vec 2.0, HuberT, and WavLM. Given that the preprocessed dataset is cross-lingual, and earlier versions of Wav2Vec 2.0, HuberT, and WavLM were only trained on English data, the Wav2Vec2 XLS-R pre-trained speech model is preferred for feature extraction from the acquired user speech information. The dimension of each extracted speech embedding is represented as D = L × N, where N is 1024, and L depends on the length of the entire speech signal. In this embodiment, L is set to 3200 to achieve coverage of most speech samples. Any speech features exceeding 3200 frames are truncated, while speech features shorter than 3200 frames are padded with zeros.

[0038] In this embodiment, a projection module is provided after the speech preprocessing embedding module. The projection module is used to obtain a low-dimensional representation of the feature vector.

[0039] In practical speech signal processing, the acquired speech signals often have long durations and sparse annotations, leading to overfitting issues due to high-dimensional representations. Furthermore, the long speech duration also incurs a heavy computational load. To mitigate the risk of overfitting and reduce the computational burden on subsequent AFFormer modules, a projection layer is introduced to map the high-dimensional representation of features to a low-dimensional representation. In this embodiment, the projection layer is implemented using Conv1D (a one-dimensional convolutional neural network). Conv1D learns the local correlations of features through local connections, thereby extracting local texture information from the feature vectors, preserving more discriminative information compared to linear mapping. Compared to conventional fully connected layer mapping, Conv1D has higher parameter efficiency, faster computation, and ultimately yields a compact low-dimensional representation.

[0040] In a specific application embodiment, when the speech signal corresponds to an embedded X dimension of D = 3200 × 1024, after introducing the projection layer, its output dimension is D = 3200 × 8.

[0041] (2) A hierarchical attention-free module, comprising multiple attention-free layers, each layer comprising multiple AFFormer units, each AFFormer unit comprising a token mixer and a channel mixer, the token mixer comprising multiple parallel depthwise separable convolutional branches, the depthwise separable convolutional branches being used to process the received feature vectors and superimposing the processing results of each depthwise separable convolutional branch with the feature vectors to obtain token information output; the channel mixer comprising parallel nonlinear gating branches and target information linear branches, the nonlinear gating branches and target information linear branches being used to process the received token information respectively, and obtaining the optimal gating signal feature value based on the processing results of the two branches, and obtaining the target information to be filtered based on the gating signal feature value.

[0042] To reduce the computational complexity of the Transformer variant model associated with long data and eliminate the inherent redundancy in speech data for feature aggregation, the hierarchical attention-free module in this embodiment also includes a merging unit. The merging unit is located upstream of the AFFormer unit and is used to receive the output of the projection module and downsample the output of the projection module using a specified downsampling rate to obtain multi-granularity speech features.

[0043] Furthermore, in order to enable the system to learn speech features from different granularities, each level of the hierarchical attention-free module in this embodiment includes a merging unit and multiple AFFormer units. The hierarchical attention-free module consists of multiple levels, and the output of each level serves as the input of the next level, thereby realizing degree learning of the feature vector from different granularities.

[0044] like Figure 1As shown, the user's speech information is obtained through the speech preprocessing embedding module, and the feature vector of the user's speech is extracted. The feature vector used for speech is then input into the projection module to obtain a low-dimensional representation of the feature vector. It is worth noting that here the dimensionality of the feature is reduced, not the sequence dimension. The output of the projection module is input into the hierarchical attention-free module to obtain the target information to be filtered. In a specific application embodiment, the hierarchical attention-free module includes three levels, and the number of levels can be increased according to actual needs. Each level contains one merging unit and one or two AFFormer units. The merging unit uses Conv1D to reduce the computational complexity in the Transformer variant model associated with long data and effectively eliminates the inherent redundancy in the speech data for feature aggregation. First, for long-term speech sequences, the computational complexity of the standard Transformer is high. The Conv1D in the merging module of this embodiment can efficiently aggregate local speech features. By adjusting the convolution kernel size and stride, the time series length is adaptively reduced, thereby reducing the computational load of subsequent modules. Second, since speech signals have a multi-granularity hierarchical structure, such as phonemes, words, sentences, etc., different granularity speech levels have different semantic information. The Conv1D module in the merging module can learn to extract speech features of different granularities, thereby achieving the aggregation of multi-granularity features.

[0045] Specifically, by designing multi-stage merging units, each unit employs different downsampling rates, such as 4x or 2x, to reduce the time series length. This method reduces computational cost and allows for the learning of speech features at different levels of abstraction. The downstream AFFormer module can learn long-term dependent speech semantic representations based on the obtained multi-granularity speech features. Through multi-granularity and multi-level feature aggregation in the merging module, computational efficiency and the discriminativeness of speech representation can be effectively balanced. In specific experiments, the first, second, and third merging units downsampled the data with sampling factors of 4, 2, and 2, respectively, producing output dimensions D = 800×8, 400×8, and 200×8, effectively reducing the time series length and significantly reducing the computational cost of subsequent modules.

[0046] In this embodiment, the token mixer based on the MSDW structure includes parallel 7×1 depth-separable convolutional branches and 1×1 depth-separable convolutional branches, which are used to receive multi-granularity aggregated speech features and perform convolution processing on the multi-granularity aggregated speech features to obtain the convolution processing results of each branch respectively. The convolution processing results of each branch are then superimposed with the multi-granularity speech features to obtain token information.

[0047] like Figure 2As shown in (a), in a specific application embodiment, a token mixer is designed using the MSDW structure. The token mixer based on the MSDW structure includes a two-branch convolutional topology, one branch being... Depthwise separable convolution; another branch is Depthwise separable convolution. By separating spatial convolution and pointwise convolution, depthwise separable convolution effectively reduces the number of parameters and computational cost. Small separable convolutional kernels connect only local inputs to capture fine-grained local features. Meanwhile, inverse residual connections alleviate the vanishing gradient problem and enhance the additive aggregation effect of features from multi-path sub-networks. The combination of separable convolutional modules and residual connections minimizes parameters and computational cost while expanding the network's perceptual context, achieving effective integration of global semantic information. This design scheme integrates the advantages of depthwise separable convolution and residual connections, making the network both efficient and capable of context modeling. Furthermore, by abandoning the self-attention module, it avoids expensive computations.

[0048] In this embodiment, the formula for calculating the token information is:

[0049] (1)

[0050] in, and These represent the input and output of the token mixer, respectively, and LN represents layer normalization. This represents a 7×1 depth separable convolution branch in a one-dimensional convolution. This represents a one-dimensional convolution with a depth of 1×1 that can be separable.

[0051] In this embodiment, the nonlinear gating branch of the channel mixer includes a linear transformation unit and a GELU nonlinear activation unit. The linear transformation unit performs a linear transformation on the input token information and then inputs the processing result into the GELU nonlinear activation unit to obtain the gating signal feature value.

[0052] In this embodiment, the gating signal feature value obtained by the nonlinear gating branch is multiplied element-wise with the target information to be screened obtained by the linear branch. The calculation formula for obtaining the optimal gating signal feature value is as follows:

[0053] (2)

[0054] Where W1 and W2 represent the weights of the nonlinear gated branch and the linear branch, respectively. , These represent the input and output of the channel mixer, respectively. GELU represents the Gaussian error linear unit activation function, and LN represents layer normalization.

[0055] like Figure 2As shown in (b), in a specific application embodiment, the channel mixer GEGLU contains two branches: one branch performs a linear transformation followed by nonlinear activation of GELU, and its output serves as the gating signal; the other branch performs only a linear transformation, and its output is the target information to be filtered. The outputs of the two branches are then multiplied element-wise to achieve gating. Specifically, utilizing the characteristic that the nonlinear activation of GELU has continuous derivatives greater than 0, it ensures that the gradient will not vanish or explode during backpropagation of the gating signal branch, which is beneficial for the learning of the gating logic. The two branches use different linear transformation matrices, allowing the gating signal to specifically process each target channel. By learning the gating signal together with the target representation, a better gating effect is obtained. Furthermore, due to the nonlinear activation of GELU, the gating signal has the ability to judge the importance of target information. Dimensions with smaller values in the obtained gating signal indicate that the corresponding target information is irrelevant or redundant. Element-wise multiplication can effectively suppress or filter out this redundant target information, while retaining the relevant target information with larger gating values.

[0056] To verify the beneficial effects of this invention in practical use, the above-described system was used to process speech data, and the specific experimental results are shown in Table 1. The experimental data shows that, based on different token mixers, the Gaussian error-based nonlinear gating unit generally produces good results, with the multi-scale depth separable module + Gaussian error-based nonlinear gating unit achieving the best performance. It is worth noting that the number of multiply-accumulate operations of the multi-scale depth separable module + Gaussian error-based nonlinear gating unit is approximately 1 / 20 of that of the standard Transformer, making it highly efficient in model computation.

[0057] Table 1 Comparison of model accuracy, model size, and model runtime complexity

[0058]

[0059] Example 2

[0060] Clearly, this system can process any speech signal, and is particularly suitable for processing speech signals that are long in duration and sparsely annotated. This embodiment applies a speech signal processing system based on a hierarchical attention-free model to identify and analyze the speech characteristics of Alzheimer's patients, but this does not limit the scope of application of this system.

[0061] Alzheimer's disease (AD) is a common neurodegenerative disease. While it is incurable, early screening and diagnosis not only provide more treatment options but also help slow disease progression and improve patients' quality of life. However, accurate detection in the early stages of the disease presents a significant challenge due to patients' lack of relevant health knowledge. Early symptoms of Alzheimer's disease include poor logical thinking and frequent forgetting of words, indicating language impairment. Therefore, this system can be used for speech feature detection in AD. Specifically, it includes the following:

[0062] 1. Speech Preprocessing Embedding Module. To obtain rich Alzheimer's disease detection features from speech, a pre-trained model is used for acoustic feature extraction. The pre-trained speech model used in this embodiment is the XLS-R model.

[0063] 2. Projection Module. Considering that spontaneous speech from Alzheimer's patients is often lengthy and sparsely annotated, high-dimensional representations frequently lead to overfitting. Furthermore, lengthy speech inevitably incurs a heavy computational load. By introducing a projection layer, the high-dimensional representation is mapped to a low-dimensional representation, effectively reducing the risk of overfitting and decreasing the computational load on the subsequent AFFormer module.

[0064] 3. Merging Module. A multi-stage merging module is designed to combine the outputs of the projection module. Each module reduces the time series length at different downsampling rates (4x, 2x, etc.). This reduces computational cost and allows for learning speech features at different levels of abstraction. The downstream AFFormer module can then learn long-term dependent speech semantic representations based on these multi-granularity speech features. Through multi-granularity, multi-level feature aggregation in the merging module, a balance can be effectively struck between computational efficiency and the discriminative power of speech representation.

[0065] 4. AFFormer Module. Inspired by Metaformer, this module designs the most suitable token mixer and channel mixer for AD detection tasks, replacing the computationally complex self-attention module in the standard Transformer.

[0066] 4.1 Token Mixer: A token mixer is designed using the MSDW architecture to capture fine-grained local features of speech signals and achieve efficient integration of global semantic information.

[0067] 4.2 Channel Mixer: Use GEGLU to build a channel mixer, such as... Figure 2As shown in (b), GEGLU comprises two branches, each employing a different linear transformation matrix, allowing the gating signal to selectively process each target channel. The gating signal is learned along with the target representation to achieve better gating performance. Element-wise multiplication effectively suppresses or filters out redundant target information while retaining relevant target information with large eigenvalues.

[0068] Based on the relevant target information with large acquired feature values, it can be determined whether the subject of the voice signal collection is an AD patient.

[0069] like Figure 3 As shown, the detection method of the speech signal processing system based on the hierarchical attention-free model in this embodiment includes the following steps:

[0070] Step 1. Obtain the user's voice information and extract the feature vector of the user's voice information;

[0071] Step 2. The feature vector is simultaneously input into each depthwise separable convolutional branch of the token mixer in the hierarchical attention-free module for convolution processing. The processing results of each depthwise separable convolutional branch are superimposed with the feature vector to obtain the token information output. The token information is then input into the parallel nonlinear gating branch and the target information linear branch of the channel mixer for processing. The optimal gating signal feature value is obtained based on the processing results of the two branches. The target information to be screened is obtained based on the gating signal feature value.

[0072] In this embodiment, the method for obtaining the optimal gating signal feature value includes:

[0073] Step 201. The nonlinear gating branch obtains the token information, performs linear processing on the token information, and inputs the processing result into the GELU nonlinear activation unit to obtain the gating signal;

[0074] Step 202. Obtain the token information by the target linear branch, and perform linear processing on the token information to obtain the target information to be filtered;

[0075] Step 203. Perform element-wise multiplication of the gating signal with the target information to be screened to obtain the optimal gating signal features.

[0076] The method described in this embodiment can be executed by a single device, such as a computer or server, or it can be applied in a distributed scenario where multiple devices cooperate to complete the task. In a distributed scenario, one of the multiple devices may execute only one or more steps of the method described in this embodiment, and the multiple devices interact to complete the method. The processor can be implemented using a general-purpose CPU, microprocessor, application-specific integrated circuit, or one or more integrated circuits, and is used to execute relevant programs to implement the method described in this embodiment. The memory can be implemented using read-only memory (ROM), random access memory (RAM), static storage devices, and dynamic storage devices. The memory can store the operating system and other applications. When the method described in this embodiment is implemented through software or firmware, the relevant program code is stored in the memory and called and executed by the processor.

[0077] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention in any way. Although the present invention has been disclosed above with reference to preferred embodiments, it is not intended to limit the invention. Therefore, any simple modifications, equivalent changes, and alterations made to the above embodiments based on the technical essence of the present invention without departing from the scope of the present invention should fall within the protection scope of the present invention.

Claims

1. A speech signal processing system based on a hierarchical attention-free model, characterized in that, include: The speech preprocessing embedding module is used to acquire the user's speech information and extract the feature vector of the user's speech information; The hierarchical attention-free module includes multiple attention-free layers. Each layer structure includes multiple AFFormer units. Each AFFormer unit includes a token mixer and a channel mixer. The token mixer includes multiple parallel depthwise separable convolutional branches. The depthwise separable convolutional branches process the received feature vectors and superimpose the processing results of each depthwise separable convolutional branch with the feature vectors to obtain token information output. The channel mixer includes parallel nonlinear gating branches and target information linear branches. The nonlinear gating branches and target information linear branches process the received token information respectively and obtain the optimal gating signal feature value based on the processing results of the two branches. The target information to be filtered is obtained based on the gating signal feature value. The speech preprocessing embedding module is followed by a projection module, which is used to obtain a low-dimensional representation of the feature vector. The hierarchical attention-free module further includes a merging unit located upstream of the AFFormer unit. The merging unit receives the output of the projection module and performs downsampling processing on the output of the projection module using a specified downsampling rate to obtain multi-granularity aggregated speech features.

2. The speech signal processing system based on a hierarchical attention-free model according to claim 1, characterized in that, The token mixer includes parallel 7×1 depthwise separable convolutional branches and 1×1 depthwise separable convolutional branches, used to receive the multi-granularity aggregated speech features and perform convolution processing on the multi-granularity aggregated speech features to obtain the convolution processing results of each branch. The convolution processing results of each branch are superimposed with the multi-granularity aggregated speech features to obtain token information.

3. The speech signal processing system based on a hierarchical attention-free model according to claim 2, characterized in that, The formula for calculating the token information is as follows: ； in, and These represent the input and output of the token mixer, respectively, and LN represents layer normalization. This represents a 7×1 depth separable convolution branch in a one-dimensional convolution. This represents a one-dimensional convolution with a depth of 1×1 that can be separable.

4. The speech signal processing system based on a hierarchical attention-free model according to claim 1, characterized in that, The nonlinear gating branch of the channel mixer includes a linear transformation unit and a GELU nonlinear activation unit. The linear transformation unit performs a linear transformation on the input token information and then inputs the processing result into the GELU nonlinear activation unit to obtain the gating signal feature value.

5. The speech signal processing system based on a hierarchical attention-free model according to claim 4, characterized in that, The gating signal feature value obtained by the nonlinear gating branch is multiplied element-wise with the target information to be screened obtained by the linear branch to obtain the optimal gating signal feature value.

6. The speech signal processing system based on a hierarchical attention-free model according to claim 5, characterized in that, The formula for calculating the optimal gating signal eigenvalue is as follows: ； Where W1 and W2 represent the weights of the nonlinear gated branch and the linear branch, respectively. , These represent the input and output of the channel mixer, respectively. GELU represents the Gaussian error linear unit activation function, and LN represents layer normalization.

7. A detection method for a speech signal processing system based on a hierarchical attention-free model as described in any one of claims 1 to 6, characterized in that, Includes the following steps: Step 1. Obtain the user's voice information and extract the feature vector of the user's voice information; Step 2. The feature vector is simultaneously input into each depthwise separable convolutional branch of the token mixer in the hierarchical attention-free module for convolution processing, and the processing results of each depthwise separable convolutional branch are superimposed with the feature vector to obtain token information output; the token information is then input into the parallel nonlinear gating branch and the target information linear branch of the channel mixer for processing, and the optimal gating signal feature value is obtained based on the processing results of the two branches, and the target information to be screened is obtained based on the gating signal feature value.

8. The detection method of the speech signal processing system based on the hierarchical attention-free model according to claim 7, characterized in that, In step 2, the method for obtaining the optimal gating signal feature value includes: Step 201. The nonlinear gating branch acquires the token information, performs linear processing on the token information, and inputs the processing result into the GELU nonlinear activation unit to obtain the gating signal; Step 202. The target information linear branch obtains the token information, and performs linear processing on the token information to obtain the target information to be filtered; Step 203. Perform element-wise multiplication on the gating signal and the target information to be screened to obtain the optimal gating signal feature value.