Audio Feature Extraction Method and Apparatus

By grouping and regularizing the cepstral coefficients of audio signals, the problem of noise suppression in existing audio feature extraction technologies is solved, thus improving the accuracy of speech recognition.

CN116825124BActive Publication Date: 2026-06-30ZHEJIANG HUACHUANG VISION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHEJIANG HUACHUANG VISION TECH CO LTD
Filing Date
2023-06-26
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing audio feature extraction methods cannot effectively suppress noise in the presence of noise, resulting in a high misrecognition rate in speech recognition.

Method used

By grouping the cepstral coefficients of the target audio signal and performing global and local regularization on the cepstral coefficients in each sub-band, including global and local regularization, the energy distribution within the sub-band is balanced and noise is suppressed.

Benefits of technology

It effectively suppresses noise in audio signals, improves the signal-to-noise ratio of features, and reduces the false recognition rate of speech recognition.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116825124B_ABST
    Figure CN116825124B_ABST
Patent Text Reader

Abstract

This invention provides an audio feature extraction method and apparatus. The method includes: acquiring a cepstral coefficient set of a target audio signal, wherein the cepstral coefficient set records the cepstral coefficients of the target audio signal; grouping the cepstral coefficients in the cepstral coefficient set to obtain a first sub-band set, wherein the first sub-band set includes multiple sub-bands, and each sub-band includes multiple cepstral coefficients from the cepstral coefficient set; performing regularization processing on the cepstral coefficients in each sub-band of the first sub-band set to obtain a target sub-band set, wherein the sub-bands in the target sub-band set correspond one-to-one with the sub-bands in the first sub-band set; and determining the audio features of the target audio signal based on the target sub-band set. This invention solves the problem in related technologies where effective noise suppression is impossible during audio feature extraction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The present invention relates to the field of speech recognition, and more specifically, to an audio feature extraction method and apparatus. Background Technology

[0002] Audio signals are a crucial signal type, widely used in speech recognition, speaker recognition, and emotion recognition. Feature extraction from audio signals is a critical step in these applications. Currently, mainstream speech feature extraction methods include MFCC, PLP, LPCC, and neural networks. These methods extract features based on the frequency and cepstral domains of the signal, achieving good results. However, these methods also have some limitations. For example, the effectiveness of extracted features deteriorates in the presence of noise, leading to a high false recognition rate when using the extracted features for speech recognition. Therefore, related technologies suffer from the problem of not being able to effectively suppress noise when extracting audio features.

[0003] There is currently no effective solution to the problem of ineffective noise suppression when extracting audio features in related technologies. Summary of the Invention

[0004] This invention provides an audio feature extraction method and apparatus to at least solve the problem of ineffective noise suppression when extracting audio features in related technologies.

[0005] According to an embodiment of the present invention, an audio feature extraction method is provided, comprising: acquiring a cepstral coefficient set of a target audio signal, wherein the cepstral coefficient set records the cepstral coefficients of the target audio signal; grouping the cepstral coefficients in the cepstral coefficient set to obtain a first sub-band set, wherein the first sub-band set includes multiple sub-bands, and each sub-band includes multiple cepstral coefficients from the cepstral coefficient set; performing regularization processing on the cepstral coefficients in each sub-band of the first sub-band set to obtain a target sub-band set, wherein the sub-bands in the target sub-band set correspond one-to-one with the sub-bands in the first sub-band set; and determining the audio features of the target audio signal based on the target sub-band set.

[0006] In an exemplary embodiment, regularization is performed on the cepstral coefficients in each subband of the first subband set to obtain a target subband set. This includes: performing the following operations on the cepstral coefficients in each subband of the first subband set, wherein the subband in which the operation is performed is the current subband: performing global regularization on the cepstral coefficients in the current subband to obtain a current first subband, wherein the cepstral coefficients in the current first subband correspond one-to-one with the cepstral coefficients in the current subband; and performing local regularization on the cepstral coefficients in the current first subband to obtain a current target subband, wherein the cepstral coefficients in the current target subband correspond one-to-one with the cepstral coefficients in the current first subband.

[0007] In an exemplary embodiment, performing global regularization on the cepstral coefficients in the current sub-band to obtain the current first sub-band includes: determining the variance of all cepstral coefficients included in the current sub-band to obtain a first variance; determining a first regularization factor based on the first variance; and performing global regularization on the cepstral coefficients in the current sub-band based on the first regularization factor to obtain the current first sub-band.

[0008] In an exemplary embodiment, performing global regularization processing on the cepstral coefficients in the current sub-band according to the first regularization factor to obtain the current first sub-band includes: determining the quotient of each cepstral coefficient in the current sub-band with the first regularization factor as the cepstral coefficient in the current first sub-band.

[0009] In one exemplary embodiment, performing local regularization on the cepstral coefficients in the current first sub-band to obtain the current target sub-band includes: grouping the cepstral coefficients in the current first sub-band to obtain a second sub-band set, wherein the second sub-band set includes multiple sub-bands; performing local regularization on the cepstral coefficients in each sub-band of the second sub-band set to obtain a third sub-band set, wherein the third sub-band set includes multiple sub-bands; and reconstructing each sub-band in the third sub-band set to obtain the current target sub-band.

[0010] In one exemplary embodiment, local regularization is performed on the cepstral coefficients in each subband of the second subband set to obtain a third subband set. This includes: performing the following operations on the cepstral coefficients in each subband of the second subband set, wherein the subband in which the operation is performed is the current second subband: performing a smoothing operation on the cepstral coefficients in the current second subband to obtain a current third subband, wherein the cepstral coefficients in the current third subband correspond one-to-one with the cepstral coefficients in the current second subband; and performing local regularization on the cepstral coefficients in the current third subband to obtain a current fourth subband, wherein the third subband set includes the current fourth subband.

[0011] In an exemplary embodiment, smoothing the cepstral coefficients in the current second sub-band to obtain the current third sub-band includes: determining the average value and standard deviation of each set of cepstral coefficients in the current second sub-band, wherein each set of cepstral coefficients includes n cepstral coefficients, the i-th cepstral coefficient among the n cepstral coefficients represents the characteristics of the i-th audio frame among n audio frames in the target frequency range, the target audio signal includes the n audio frames, n is an integer greater than or equal to 1, and i is an integer greater than or equal to 1; if there is a target standard deviation less than or equal to a preset threshold among the multiple standard deviations corresponding to the multiple sets of cepstral coefficients, all cepstral coefficients in the target set of cepstral coefficients corresponding to the target standard deviation are replaced with the average value of the target set of cepstral coefficients to obtain the current third sub-band.

[0012] In an exemplary embodiment, performing local regularization processing on the cepstral coefficients in the current third sub-band to obtain the current fourth sub-band includes: determining the variance of the cepstral coefficients corresponding to n audio frames in the current third sub-band to obtain a target variance set, wherein the target audio signal includes the n audio frames, and the variance in the target variance set corresponds one-to-one with the n audio frames, where n is an integer greater than or equal to 1; determining the local regularization factors corresponding to the n audio frames according to the target variance set to obtain n local regularization factors, wherein the n audio frames correspond one-to-one with the n local regularization factors; and performing local regularization processing on the cepstral coefficients in the current third sub-band according to the n local regularization factors to obtain the current fourth sub-band.

[0013] In an exemplary embodiment, the current fourth sub-band is obtained by performing local regularization processing on the cepstral coefficients in the current third sub-band according to the n local regularization factors, which includes: determining the cepstral coefficients in the current fourth sub-band as the quotient of the cepstral coefficients corresponding to each audio frame in the current third sub-band and the local regularization factors corresponding to the audio frames in the n local regularization factors.

[0014] In an exemplary embodiment, obtaining a set of cepstral coefficients of a target audio signal includes: dividing the target audio signal into n audio frames, wherein there is partial overlap between any adjacent audio frames in the n audio frames; obtaining a subset of cepstral coefficients corresponding to each of the n audio frames to obtain n subsets of cepstral coefficients, where n is an integer greater than or equal to 1 and i is an integer greater than or equal to 1, wherein the set of cepstral coefficients includes the n subsets of cepstral coefficients.

[0015] In an exemplary embodiment, obtaining the cepstral coefficient subset corresponding to the m-th audio frame among the n audio frames includes: windowing the m-th audio frame, where m is an integer greater than or equal to 1; performing a Fast Fourier Transform on the windowed m-th audio frame to obtain the spectral information of the m-th audio frame; determining the power spectrum of the m-th audio frame based on the spectral information of the m-th audio frame; and performing a Discrete Cosine Transform on the power spectrum of the m-th audio frame to obtain the cepstral coefficient subset corresponding to the m-th audio frame.

[0016] In an exemplary embodiment, determining the audio features of the target audio signal based on the target sub-band set includes: combining each target sub-band in the target sub-band set to form a first cepstral coefficient matrix; performing nonlinear transformation and filtering on the first cepstral coefficient matrix to obtain a target cepstral coefficient matrix; converting the target cepstral coefficient matrix into Mel-frequency cepstral coefficients; and determining the audio features of the target audio signal based on the Mel-frequency cepstral coefficients.

[0017] In an exemplary embodiment, grouping the cepstral coefficients in the cepstral coefficient set to obtain a first sub-band set includes: grouping the cepstral coefficients corresponding to each of the n audio frames to obtain r groups of cepstral coefficients corresponding to each audio frame; determining the first sub-band set based on the r groups of cepstral coefficients corresponding to each audio frame, wherein the j-th sub-band in the first sub-band set includes the j-th group of cepstral coefficients in the r groups of cepstral coefficients corresponding to each audio frame, where n is an integer greater than or equal to 1, r is an integer greater than or equal to 1, and j is an integer greater than or equal to 1.

[0018] According to another embodiment of the present invention, an audio feature extraction device is also provided, comprising: an acquisition module, configured to acquire a set of cepstral coefficients of a target audio signal, wherein the set of cepstral coefficients records the cepstral coefficients of the target audio signal;

[0019] A grouping module is used to group the cepstral coefficients in the cepstral coefficient set to obtain a first sub-band set, wherein the first sub-band set includes multiple sub-bands, and each sub-band includes multiple cepstral coefficients in the cepstral coefficient set;

[0020] The regularization module is used to perform regularization processing on the cepstral coefficients in each subband of the first subband set to obtain the target subband set, wherein the subbands in the target subband set correspond one-to-one with the subbands in the first subband set;

[0021] The determination module is used to determine the audio characteristics of the target audio signal based on the target sub-band set.

[0022] According to yet another embodiment of the present invention, a computer-readable storage medium is also provided, wherein a computer program is stored therein, wherein the computer program is configured to perform the steps in any of the above method embodiments when executed.

[0023] According to yet another embodiment of the present invention, an electronic device is also provided, including a memory and a processor, wherein the memory stores a computer program and the processor is configured to run the computer program to perform the steps in any of the above method embodiments.

[0024] This invention groups the cepstral coefficient set of the target audio signal to obtain a first sub-band set, and performs regularization processing on the cepstral coefficients in each sub-band of the first sub-band set to balance the energy distribution within the sub-band. This effectively suppresses noise in the audio signal, solves the problem of ineffective noise suppression when extracting audio features in related technologies, and achieves the effect of enhancing the signal-to-noise ratio of the extracted features. Attached Figure Description

[0025] Figure 1 This is a block diagram of the mobile terminal hardware structure of the audio feature extraction method according to an embodiment of the present invention;

[0026] Figure 2 This is a flowchart of an audio feature extraction method according to an embodiment of the present invention;

[0027] Figure 3 This is a schematic diagram of audio signal division according to an embodiment of the present invention;

[0028] Figure 4 This is a schematic diagram illustrating the grouping of cepstral coefficients in the cepstral coefficient set according to an embodiment of the present invention;

[0029] Figure 5 This is a schematic diagram illustrating the grouping of cepstral coefficients in the current first sub-band according to an embodiment of the present invention;

[0030] Figure 6 This is a structural block diagram of an audio feature extraction device according to an embodiment of the present invention. Detailed Implementation

[0031] The embodiments of the present invention will be described in detail below with reference to the accompanying drawings and examples.

[0032] It should be noted that the terms "first," "second," etc., in the specification, claims, and drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence.

[0033] The methods and embodiments provided in this application can be executed on a mobile terminal, computer terminal, or similar computing device. Taking running on a mobile terminal as an example, Figure 1 This is a block diagram of the mobile terminal hardware structure of the audio feature extraction method according to an embodiment of the present invention. Figure 1 As shown, a mobile terminal may include one or more ( Figure 1 Only one is shown in the diagram. A processor 102 (which may include, but is not limited to, a microprocessor MCU or a programmable logic device FPGA, etc.) and a memory 104 for storing data are also shown. The mobile terminal may further include a transmission device 106 for communication functions and an input / output device 108. Those skilled in the art will understand that... Figure 1 The structure shown is for illustrative purposes only and does not limit the structure of the mobile terminal described above. For example, the mobile terminal may also include components that are more... Figure 1 The more or fewer components shown, or having the same Figure 1 The different configurations shown.

[0034] The memory 104 can be used to store computer programs, such as application software programs and modules, like the computer program corresponding to the audio feature extraction method in this embodiment of the invention. The processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, thereby implementing the above-described method. The memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include memory remotely located relative to the processor 102, and these remote memories can be connected to the mobile terminal via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

[0035] The transmission device 106 is used to receive or send data via a network. Specific examples of the network described above may include a wireless network provided by the mobile terminal's communication provider. In one example, the transmission device 106 includes a Network Interface Controller (NIC), which can connect to other network devices via a base station to communicate with the Internet. In another example, the transmission device 106 may be a Radio Frequency (RF) module, used for wireless communication with the Internet.

[0036] This embodiment provides an audio feature extraction method. Figure 2 This is a flowchart of an audio feature extraction method according to an embodiment of the present invention, such as... Figure 2 As shown, the process includes the following steps:

[0037] Step S202: Obtain the cepstral coefficient set of the target audio signal, wherein the cepstral coefficient set records the cepstral coefficients of the target audio signal;

[0038] In this embodiment, the target audio signal is the audio signal to be identified, such as an audio signal. To extract the target audio signal, it is first necessary to extract the cepstral coefficients of the target audio signal, i.e., the cepstral coefficient set mentioned above.

[0039] The cepstral coefficients mentioned above are the spectral information obtained by performing a Fourier transform on the audio signal, and then performing a Discrete Cosine Transform (DCT) on this spectral information. Therefore, cepstral coefficients are actually a way to represent the characteristics of an audio signal in the frequency domain; one cepstral coefficient represents the characteristics of an audio signal in a frequency domain.

[0040] In an optional embodiment, obtaining the cepstral coefficient set of the target audio signal includes: dividing the target audio signal into n audio frames, wherein there is partial overlap between any adjacent audio frames in the n audio frames; obtaining the cepstral coefficient subsets corresponding to the n audio frames respectively, to obtain n cepstral coefficient subsets, where n is an integer greater than or equal to 1 and i is an integer greater than or equal to 1, wherein the cepstral coefficient set includes the n cepstral coefficient subsets.

[0041] In this embodiment, when determining the cepstral coefficient set of the target audio signal, the target audio signal is first divided into multiple frames, i.e., n audio frames. When dividing the audio signal, the frame length of each audio frame can be set to a preset length, for example, the frame length of each audio frame is set to 20ms. Usually, the frame length of each audio frame is set to between 20 and 30ms.

[0042] After setting the frame length, the target audio signal is divided into segments according to the frame length in chronological order. There is some overlap between adjacent frames, and the size of the overlap can be preset. For example, it can be set that there is 50% overlap between two adjacent audio frames.

[0043] Optionally, the audio signal needs to be preprocessed before it is framed, such as removing DC components and pre-emphasizing.

[0044] Figure 3 This is a schematic diagram of audio signal segmentation according to an embodiment of the present invention. Curve x(t) represents the target audio signal. The target audio signal is divided into n frames in chronological order. Each frame has a frame length of 20ms. There is a 50% overlap between any two adjacent frames. That is, the last 10ms of the previous frame overlaps with the first 10ms of the next frame, which is the same audio signal.

[0045] After dividing the target audio signal into n audio frames, the cepstral coefficients corresponding to each audio frame are determined respectively, that is, the subset of cepstral coefficients. Among them, each audio frame corresponds to a subset of cepstral coefficients, that is, there are n subsets of cepstral coefficients. The set composed of the n subsets of cepstral coefficients is determined as the cepstral coefficient set of the target audio signal. That is, the cepstral coefficient set of the above target audio signal includes n subsets of cepstral coefficients.

[0046] In an optional embodiment, obtaining the subset of cepstral coefficients corresponding to the m-th audio frame among the n audio frames includes: performing windowing processing on the m-th audio frame, where m is an integer greater than or equal to 1; performing fast Fourier transform on the windowed m-th audio frame to obtain the spectral information of the m-th audio frame; determining the power spectrum of the m-th audio frame according to the spectral information of the m-th audio frame; performing discrete cosine transform on the power spectrum of the m-th audio frame to obtain the subset of cepstral coefficients corresponding to the m-th audio frame.

[0047] In this embodiment, when obtaining the subsets of cepstral coefficients corresponding to the n audio frames respectively, a subset of cepstral coefficients is determined for each of the n audio frames. For any one audio frame (corresponding to the above m-th audio frame), a series of cepstral coefficients are obtained, denoted as c[0] m , c[1]<00000​​​​​​​​​​​​​​​​​​​Step S204: Group the cepstral coefficients in the set of cepstral coefficients to obtain a first sub-band set, where the first sub-band set includes multiple sub-bands, and each sub-band includes multiple cepstral coefficients in the set of cepstral coefficients.

[0049] In this embodiment, the cepstral coefficients in the set of cepstral coefficients are grouped to obtain multiple sub-bands, that is, the first sub-band set.

[0050] In an optional embodiment, grouping the cepstral coefficients in the set of cepstral coefficients to obtain a first sub-band set includes: grouping the cepstral coefficients corresponding to each of the n audio frames to obtain r groups of cepstral coefficients corresponding to each audio frame; determining the first sub-band set according to the r groups of cepstral coefficients corresponding to each audio frame, where the jth sub-band in the first sub-band set includes the jth group of cepstral coefficients in the r groups of cepstral coefficients corresponding to each audio frame, n is an integer greater than or equal to 1, r is an integer greater than or equal to 1, and j is an integer greater than or equal to 1.

[0051] Sub-band division is usually performed on cepstral coefficients, that is, the cepstral coefficients of each frame are divided into multiple sub-bands, which is actually grouping the cepstral coefficients of each frame in order. For example, there are n audio frames, where a series of cepstral coefficients corresponding to the mth audio frame are represented as c[0] m , c[1] m , ……, c[i] m , ……, c[N - 1] m , where 0 ≤ i < N, i is an integer, N is a positive integer, and m is any integer within [1, n].

[0052] That is, each audio frame corresponds to N cepstral coefficients, and c[i] m represents the ith cepstral coefficient among the N cepstral coefficients corresponding to the mth audio frame in the n audio frames. c[i] m and c[i] [[ID=2,5]] k represent the characteristic information of different audio frames within the same frequency range, where 1 ≤ m ≤ n, 1 ≤ k ≤ n, that is, m and k are any two integers within [1, n], and m and k can be the same or different.

[0053] Figure 4 is a schematic diagram of grouping the cepstral coefficients in the set of cepstral coefficients according to an embodiment of the present invention, as Figure 4As shown, taking N=100, n=20, and r=5 as an example, the target audio signal is divided into 20 audio frames, each with 100 cepstral coefficients. The cepstral coefficients of each audio frame are grouped sequentially, either evenly or in groups of 5, each containing 20 consecutive cepstral coefficients. The first sub-band set is determined based on the 5 groups of cepstral coefficients corresponding to each of the n audio frames. The first sub-band set has 5 sub-bands, and each sub-band contains one group of cepstral coefficients from each audio frame. Specifically, the j-th sub-band in the first sub-band set includes the j-th group of cepstral coefficients from the r groups of cepstral coefficients corresponding to each audio frame, meaning the j-th sub-band includes n groups of cepstral coefficients. Taking the first sub-band in the first sub-band set as an example, the first group of cepstral coefficients from each audio frame is determined as the first sub-band.

[0054] The first sub-band in the first sub-band set includes the cepstral coefficients of the first audio frame (c[0]1 to c

[19] 1), the first cepstral coefficients of the second audio frame (c[0]2 to c

[19] 2), ..., and the first cepstral coefficients of the nth audio frame (c[0]1 to c

[19] 2). n To c

[19] n .

[0055] Optionally, before partitioning the cepstral coefficient set, it is necessary to select appropriate subband data and subband size. The optimal number and size of subbands may be affected by a variety of factors, including the characteristics of the audio signal, the background noise, and the speech recognition model used.

[0056] Here are three methods for selecting subbands: 1. Frequency distribution of the audio signal: The frequency distribution of audio signals is usually not uniform; some frequency ranges may have richer information, while others are relatively sparse. Therefore, the size of the subband can be determined based on the frequency distribution of the audio signal. For example, ranges with richer frequency distribution can be divided into smaller subbands, and ranges with sparser frequency distribution can be divided into larger subbands. 2. Background noise: Adjust the subband division scheme according to the frequency distribution of noise. For example, if a certain frequency range has high noise, this range can be divided into smaller subbands to better suppress noise. 3. Speech recognition model used: Different speech recognition models may be sensitive to different features, so the subband division scheme can be adjusted according to the model used.

[0057] Step S206: Regularize the cepstral coefficients in each subband of the first subband set to obtain the target subband set, wherein the subbands in the target subband set correspond one-to-one with the subbands in the first subband set;

[0058] In this embodiment, when the cepstral coefficients in each subband of the first subband set are regularized, a new subband is obtained by regularizing a subband. This new subband is a subband in the target subband set. Therefore, the subbands in the target subband set correspond one-to-one with the subbands in the first subband set.

[0059] In an optional embodiment, regularization is performed on the cepstral coefficients in each subband of the first subband set to obtain a target subband set. This includes: performing the following operations on the cepstral coefficients in each subband of the first subband set, where the subband in which the operation is performed is the current subband: performing global regularization on the cepstral coefficients in the current subband to obtain a current first subband, wherein the cepstral coefficients in the current first subband correspond one-to-one with the cepstral coefficients in the current subband; and performing local regularization on the cepstral coefficients in the current first subband to obtain a current target subband, wherein the cepstral coefficients in the current target subband correspond one-to-one with the cepstral coefficients in the current first subband.

[0060] In this embodiment, when performing regularization on the cepstral coefficients in each subband of the first subband set, the regularization is performed on a subband-by-subband basis, with different subbands being regularized independently. Therefore, the same operation is performed on each subband.

[0061] The process of regularizing the cepstral coefficients in one of the subbands (i.e., the current subband) is divided into two regularization processes: global regularization and local regularization. Global regularization is performed on the cepstral coefficients in the current subband to obtain the current first subband. Local regularization is performed on the cepstral coefficients in the current first subband to obtain the current target subband. The target subband set includes the current target subband.

[0062] In an optional embodiment, global regularization is performed on the cepstral coefficients in the current sub-band to obtain the current first sub-band, including: determining the variance of all cepstral coefficients included in the current sub-band to obtain a first variance; determining a first regularization factor based on the first variance; and performing global regularization on the cepstral coefficients in the current sub-band based on the first regularization factor to obtain the current first sub-band.

[0063] In this embodiment, regularization of the cepstral coefficients involves dividing each cepstral coefficient by a regularization factor to obtain new cepstral coefficients. Global regularization is then performed on the cepstral coefficients in the current sub-band, which means dividing all cepstral coefficients in the current sub-band by the same regularization factor (corresponding to the first regularization factor mentioned above). Specifically, global regularization of the cepstral coefficients in the current sub-band based on the first regularization factor to obtain the current first sub-band includes: determining the quotient of each cepstral coefficient in the current sub-band with the first regularization factor as the cepstral coefficient in the current first sub-band.

[0064] The first regularization factor is determined based on the variance of all cepstral coefficients in the current subband. First, the variance of all cepstral coefficients in the current subband is calculated, i.e., the first variance. After determining the first variance, the corresponding first regularization factor is determined based on the first variance.

[0065] Optionally, there is a correspondence between the first regularization factor and the first variance. This correspondence can be set in advance; for example, the larger the first variance, the larger the first regularization factor. Alternatively, the first variance can be directly determined as the first regularization factor.

[0066] In an optional embodiment, performing local regularization on the cepstral coefficients in the current first sub-band to obtain the current target sub-band includes: grouping the cepstral coefficients in the current first sub-band to obtain a second sub-band set, wherein the second sub-band set includes multiple sub-bands; performing local regularization on the cepstral coefficients in each sub-band of the second sub-band set to obtain a third sub-band set, wherein the third sub-band set includes multiple sub-bands; and reconstructing each sub-band in the third sub-band set to obtain the current target sub-band.

[0067] In this embodiment, when performing local regularization on the cepstral coefficients in the current first sub-band, the current first sub-band is divided into smaller sub-bands, resulting in a second sub-band set. That is, the sub-bands in the second sub-band set are obtained by grouping the current first sub-band. Grouping the current first sub-band involves performing multi-scale wavelet decomposition on the cepstral coefficients in the current first sub-band, typically using methods such as Discrete Wavelet Transform (DWT) or Continuous Wavelet Transform (CWT) based on Daubechies wavelet (a type of wavelet function). Instead of average grouping, multi-scale decomposition is used, decomposing the current sub-band into multiple sub-bands of different scales, each containing cepstral coefficients at different frequency intervals. The wavelet transform calculation method achieves high accuracy while maintaining fast processing speed. Furthermore, the parameters of the wavelet transform can be adjusted according to different task requirements to obtain the optimal processing effect.

[0068] Figure 5This is a schematic diagram illustrating the grouping of cepstral coefficients in the current first sub-band according to an embodiment of the present invention, as shown below. Figure 5 As shown, with Figure 4 Taking the first sub-band as an example, let's assume that the first sub-band is decomposed into three sub-bands to obtain the second sub-band set. The second sub-band set includes... Figure 5 Subband 1, subband 2, and subband 3 are included in the second subband set. Each subband in the second subband set includes a set of cepstral coefficients for each of the n audio frames.

[0069] Local regularization is performed on the cepstral coefficients in each subband of the second subband set to obtain new cepstral coefficients and form a new subband, namely the subband in the third subband set. The subband in the third subband set is reconstructed into a subband, namely the current target subband, using wavelet inverse transform. The current target subband contains better feature information and less noise.

[0070] In an optional embodiment, local regularization is performed on the cepstral coefficients in each subband of the second subband set to obtain a third subband set. This includes: performing the following operations on the cepstral coefficients in each subband of the second subband set, where the subband in which the operation is performed is the current second subband: performing a smoothing operation on the cepstral coefficients in the current second subband to obtain a current third subband, wherein the cepstral coefficients in the current third subband correspond one-to-one with the cepstral coefficients in the current second subband; and performing local regularization on the cepstral coefficients in the current third subband to obtain a current fourth subband, wherein the third subband set includes the current fourth subband.

[0071] In this embodiment, each sub-band in the third sub-band set is treated as a unit, and local regularization is performed independently between different sub-bands. Therefore, the same operation is performed on each sub-band.

[0072] The local regularization process for each subband in the third subband set (corresponding to the current second subband mentioned above) involves two steps: smoothing the current second subband to obtain the current third subband; and performing local regularization on the current third subband to obtain the current fourth subband, which is the subband in the third subband set.

[0073] In an optional embodiment, smoothing the cepstral coefficients in the current second sub-band to obtain the current third sub-band includes: determining the average value and standard deviation of each set of cepstral coefficients in the current second sub-band, wherein each set of cepstral coefficients includes n cepstral coefficients, the i-th cepstral coefficient among the n cepstral coefficients represents the characteristics of the i-th audio frame among n audio frames in the target frequency range, the target audio signal includes the n audio frames, n is an integer greater than or equal to 1, and i is an integer greater than or equal to 1; if there is a target standard deviation less than or equal to a preset threshold among the multiple standard deviations corresponding to the multiple sets of cepstral coefficients, all cepstral coefficients in the target set of cepstral coefficients corresponding to the target standard deviation are replaced with the average value of the target set of cepstral coefficients to obtain the current third sub-band.

[0074] In this embodiment, when smoothing the cepstral coefficients in the current second sub-band, the current second sub-band includes multiple sets of cepstral coefficients. Assuming the target audio signal is divided into n audio frames, any set of cepstral coefficients includes n cepstral coefficients. These n cepstral coefficients correspond one-to-one with the n audio frames, and each cepstral coefficient represents a feature within the target frequency range. That is, different cepstral coefficients in a set represent the features of different audio frames within the same target frequency range. Figure 5 Taking neutron band 1 as the current second subband as an example, subband 1 includes three sets of cepstral coefficients, of which the first set of cepstral coefficients are c[0]1, c[0]2, ..., c[0]2. n The second set of cepstral coefficients are c[1]1, c[1]2, ..., c[1] n The third set of cepstral coefficients are c[2]1, c[2]2……c2] n .

[0075] When performing smoothing, the cepstral coefficients of different groups are smoothed independently. That is, the mean and standard deviation of each group of cepstral coefficients are determined, and the standard deviation of each group of cepstral coefficients is compared with the preset one. If the standard deviation of the target group of cepstral coefficients (i.e., the target standard deviation) is less than or equal to the preset threshold, all cepstral coefficients in the target group of cepstral coefficients are replaced with the mean of the target group of cepstral coefficients.

[0076] Smoothing operations can smooth out noise and highlight more meaningful features.

[0077] In an optional embodiment, local regularization processing is performed on the cepstral coefficients in the current third sub-band to obtain the current fourth sub-band, including: determining the variance of the cepstral coefficients corresponding to n audio frames in the current third sub-band to obtain a target variance set, wherein the target audio signal includes the n audio frames, and the variance in the target variance set corresponds one-to-one with the n audio frames, where n is an integer greater than or equal to 1; determining the local regularization factors corresponding to the n audio frames according to the target variance set to obtain n local regularization factors, wherein the n audio frames correspond one-to-one with the n local regularization factors; and performing local regularization processing on the cepstral coefficients in the current third sub-band according to the n local regularization factors to obtain the current fourth sub-band.

[0078] In this embodiment, local regularization of the current third subband is performed by dividing the cepstral coefficients in the current third subband by the corresponding regularization factor. When performing local regularization, the regularization factors corresponding to different cepstral coefficients may be different.

[0079] Optionally, the local regularization processing is done on a frame-by-frame basis, that is, different regularization factors correspond to different frames in the current third sub-band. In other words, each audio frame corresponds to a local regularization factor, and the cepstral coefficient corresponding to each audio frame is divided by the local regularization factor corresponding to that audio frame. That is, the quotient of the cepstral coefficient corresponding to each audio frame in the current third sub-band and the local regularization factor corresponding to the audio frame among the n local regularization factors is determined as the cepstral coefficient in the current fourth sub-band.

[0080] The local regularization factor for each audio frame is determined by the sum of the variance of that audio frame and the variances of the audio frames to which it belongs. That is, when the target audio signal is divided into n audio frames, the variance of the cepstral coefficients corresponding to each audio frame in the current third sub-band is calculated to obtain n variances, which is the target variance set. The sum of the variances in the target variance set is the target variance sum. The variance of each audio frame is divided by the target variance sum to obtain the local regularization factor of the corresponding audio frame.

[0081] Local regularization balances the energy distribution of different frames within a sub-band, which can better suppress noise and improve the signal-to-noise ratio. Local regularization is a time-based local adaptive regularization that uses the variance of all frames within the sub-band as a reference for the regularization factor, which can better preserve the differences between frames and make the feature extraction results more repeatable and robust.

[0082] Step S208: Determine the audio characteristics of the target audio signal based on the target sub-band set.

[0083] In this embodiment, determining the audio features of the target audio signal based on the target sub-band set includes: combining each target sub-band in the target sub-band set to form a first cepstral coefficient matrix; performing nonlinear transformation and filtering on the first cepstral coefficient matrix to obtain a target cepstral coefficient matrix; converting the target cepstral coefficient matrix into Mel cepstral coefficients; and determining the audio features of the target audio signal based on the Mel cepstral coefficients.

[0084] The target subband set is combined to form a new cepstral coefficient matrix. A nonlinear transformation, such as using a sigmoid or Gaussian function, is applied to the cepstral coefficient matrix. The transformed matrix is ​​then filtered, for example, using a high-pass or low-pass filter, to suppress noise and enhance features. The filtered cepstral coefficients are converted to Mel-Cepstral coefficients. Principal component analysis (PCA) is performed on the Mel-Cepstral coefficients to reduce dimensionality. Feature selection is then performed on the dimensionality-reduced feature vectors to select the most representative feature subset. The extracted audio signal features are processed based on the cepstral coefficient matrix, allowing for integration with other speech signal processing techniques (such as deep learning and speech recognition) to improve the algorithm's scalability and versatility.

[0085] The filtering process can be either low-pass or high-pass filtering. Optionally, combining Gaussian functions and low-pass filters during nonlinear transformation and filtering can more accurately capture the frequency domain features of the audio signal and retain useful information while removing noise. After Mel-frequency cepstral coefficient transformation, principal component analysis for dimensionality reduction and feature selection can reduce feature dimensionality, lower computational complexity, and improve the efficiency and accuracy of subsequent recognition model training.

[0086] Through the above embodiments, by grouping the cepstral coefficient set of the target audio signal to obtain the first sub-band set, and performing regularization processing on the cepstral coefficients in each sub-band of the first sub-band set, the energy distribution within the sub-band is balanced, which can effectively suppress noise in the audio signal. This solves the problem in related technologies that cannot effectively suppress noise when extracting audio features, and achieves the effect of enhancing the signal-to-noise ratio of the extracted features.

[0087] Furthermore, the above steps enhance the discriminative power of cepstral features. Discriminative power refers to the separability between features, that is, how effectively each feature can distinguish different audio signals. In the context of speech recognition, if a feature has high discriminative power, it can effectively distinguish different speech signals, or different speakers, different languages, different speech content, etc.

[0088] Furthermore, the feature extraction methods involved in the above embodiments do not require a large amount of training data and model training process, so they can be directly applied to unknown speech signal data, avoiding the overfitting problem that may occur during the training process.

[0089] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods according to the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, or network device, etc.) to execute the methods described in the various embodiments of the present invention.

[0090] This embodiment also provides an audio feature extraction device. Figure 6 This is a structural block diagram of an audio feature extraction device according to an embodiment of the present invention, such as... Figure 6 As shown, the device includes:

[0091] The acquisition module 6502 is used to acquire the cepstral coefficient set of the target audio signal, wherein the cepstral coefficient set records the cepstral coefficients of the target audio signal;

[0092] Grouping module 604 is used to group the cepstral coefficients in the cepstral coefficient set to obtain a first sub-band set, wherein the first sub-band set includes multiple sub-bands, and each sub-band includes multiple cepstral coefficients in the cepstral coefficient set;

[0093] The regularization module 606 is used to perform regularization processing on the cepstral coefficients in each subband of the first subband set to obtain a target subband set, wherein the subbands in the target subband set correspond one-to-one with the subbands in the first subband set;

[0094] The determining module 608 is used to determine the audio characteristics of the target audio signal based on the target sub-band set.

[0095] In an optional embodiment, the regularization module includes: a global regularization submodule, configured to perform the following operations on the cepstral coefficients in each subband of the first subband set, wherein the subband in which the operation is performed is the current subband: performing global regularization on the cepstral coefficients in the current subband to obtain the current first subband, wherein the cepstral coefficients in the current first subband correspond one-to-one with the cepstral coefficients in the current subband; and a local regularization submodule, configured to perform local regularization on the cepstral coefficients in the current first subband to obtain the current target subband, wherein the cepstral coefficients in the current target subband correspond one-to-one with the cepstral coefficients in the current first subband.

[0096] In an optional embodiment, the global regularization submodule includes: a first determining unit, configured to determine the variance of all cepstral coefficients included in the current subband to obtain a first variance; a second determining unit, configured to determine a first regularization factor based on the first variance; and a global regularization unit, configured to perform global regularization processing on the cepstral coefficients in the current subband based on the first regularization factor to obtain the current first subband.

[0097] In an optional embodiment, the global regularization unit is used to perform global regularization processing on the cepstral coefficients in the current subband by: determining the quotient of each cepstral coefficient in the current subband with the first regularization factor as the cepstral coefficient in the current first subband.

[0098] In an optional embodiment, the aforementioned local regularization submodule includes: a grouping unit, configured to group the cepstral coefficients in the current first subband to obtain a second subband set, wherein the second subband set includes multiple subbands; a local regularization unit, configured to perform local regularization processing on the cepstral coefficients in each subband of the second subband set to obtain a third subband set, wherein the third subband set includes multiple subbands; and a reconstruction unit, configured to reconstruct each subband in the third subband set to obtain the current target subband.

[0099] In an optional embodiment, the aforementioned local regularization unit includes: a smoothing subunit, configured to perform the following operation on the cepstral coefficients in each subband of the second subband set, wherein the subband in which the operation is performed is the current second subband: performing a smoothing operation on the cepstral coefficients in the current second subband to obtain a current third subband, wherein the cepstral coefficients in the current third subband correspond one-to-one with the cepstral coefficients in the current second subband; and a local regularization subunit, configured to perform local regularization processing on the cepstral coefficients in the current third subband to obtain a current fourth subband, wherein the third subband set includes the current fourth subband.

[0100] In an optional embodiment, the smoothing subunit is used to smooth the cepstral coefficients in the current second subband in the following manner: determining the mean and standard deviation of each set of cepstral coefficients in the current second subband, wherein each set of cepstral coefficients includes n cepstral coefficients, the i-th cepstral coefficient of the n cepstral coefficients represents the characteristics of the i-th audio frame in the n audio frames within the target frequency range, the target audio signal includes the n audio frames, n is an integer greater than or equal to 1, and i is an integer greater than or equal to 1; if there is a target standard deviation less than or equal to a preset threshold among the multiple standard deviations corresponding to the multiple sets of cepstral coefficients, all cepstral coefficients in the target set of cepstral coefficients corresponding to the target standard deviation are replaced with the mean of the target set of cepstral coefficients to obtain the current third subband.

[0101] In an optional embodiment, the aforementioned local regularization subunit is used to perform local regularization processing on the cepstral coefficients in the current third sub-band in the following manner: determining the variance of the cepstral coefficients corresponding to n audio frames in the current third sub-band to obtain a target variance set, wherein the target audio signal includes the n audio frames, and the variance in the target variance set corresponds one-to-one with the n audio frames, where n is an integer greater than or equal to 1; determining the local regularization factors corresponding to the n audio frames according to the target variance set to obtain n local regularization factors, wherein the n audio frames correspond one-to-one with the n local regularization factors; and performing local regularization processing on the cepstral coefficients in the current third sub-band according to the n local regularization factors to obtain the current fourth sub-band.

[0102] In an optional embodiment, the aforementioned local regularization subunit is used to perform local regularization processing on the cepstral coefficients in the current third sub-band according to the n local regularization factors to obtain the current fourth sub-band: the quotient of the cepstral coefficients corresponding to each audio frame in the current third sub-band and the local regularization factors corresponding to the audio frames in the n local regularization factors is determined as the cepstral coefficients in the current fourth sub-band.

[0103] In an optional embodiment, the acquisition module includes: a division submodule, configured to divide the target audio signal into n audio frames, wherein there is partial overlap between any adjacent audio frames among the n audio frames; and an acquisition submodule, configured to acquire cepstral coefficient subsets corresponding to the n audio frames respectively, to obtain n cepstral coefficient subsets, where n is an integer greater than or equal to 1 and i is an integer greater than or equal to 1, wherein the cepstral coefficient subsets include the n cepstral coefficient subsets.

[0104] In an optional embodiment, the acquisition submodule includes: a windowing unit, used to window the m-th audio frame, where m is an integer greater than or equal to 1; a fast Fourier transform unit, used to perform a fast Fourier transform on the windowed m-th audio frame to obtain the spectral information of the m-th audio frame; a third determination unit, used to determine the power spectrum of the m-th audio frame based on the spectral information of the m-th audio frame; and a discrete cosine transform unit, used to perform a discrete cosine transform on the power spectrum of the m-th audio frame to obtain the cepstral coefficient subset corresponding to the m-th audio frame.

[0105] In an optional embodiment, the determining module includes: a combination submodule for combining the target subbands in the target subband set to form a first cepstral coefficient matrix; a processing submodule for performing nonlinear transformation and filtering on the first cepstral coefficient matrix to obtain a target cepstral coefficient matrix; a conversion submodule for converting the target cepstral coefficient matrix into Mel cepstral coefficients; and a first determining submodule for determining the audio features of the target audio signal based on the Mel cepstral coefficients.

[0106] In an optional embodiment, the grouping module includes: a grouping submodule, configured to group the cepstral coefficients corresponding to each of the n audio frames to obtain r groups of cepstral coefficients corresponding to each audio frame; and a second determining submodule, configured to determine the first sub-band set based on the r groups of cepstral coefficients corresponding to each audio frame, wherein the j-th sub-band in the first sub-band set includes the j-th group of cepstral coefficients in the r groups of cepstral coefficients corresponding to each audio frame, where n is an integer greater than or equal to 1, r is an integer greater than or equal to 1, and j is an integer greater than or equal to 1.

[0107] It should be noted that the above modules can be implemented by software or hardware. For the latter, they can be implemented in the following ways, but are not limited to: all the above modules are located in the same processor; or, the above modules are located in different processors in any combination.

[0108] Embodiments of the present invention also provide a computer-readable storage medium storing a computer program, wherein the computer program is configured to perform the steps in any of the above method embodiments when executed.

[0109] In one exemplary embodiment, the aforementioned computer-readable storage medium may include, but is not limited to, various media capable of storing computer programs, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard disk, magnetic disk, or optical disk.

[0110] Embodiments of the present invention also provide an electronic device including a memory and a processor, the memory storing a computer program and the processor being configured to run the computer program to perform the steps in any of the above method embodiments.

[0111] In one exemplary embodiment, the electronic device may further include a transmission device and an input / output device, wherein the transmission device is connected to the processor and the input / output device is connected to the processor.

[0112] Specific examples in this embodiment can be found in the examples described in the above embodiments and exemplary implementations, and will not be repeated here.

[0113] It is obvious to those skilled in the art that the modules or steps of the present invention described above can be implemented using general-purpose computing devices. They can be centralized on a single computing device or distributed across a network of multiple computing devices. They can be implemented using computer-executable program code, and thus can be stored in a storage device for execution by a computing device. In some cases, the steps shown or described can be performed in a different order than those described herein, or they can be fabricated as separate integrated circuit modules, or multiple modules or steps can be fabricated as a single integrated circuit module. Thus, the present invention is not limited to any particular combination of hardware and software.

[0114] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, or improvements made within the principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. An audio feature extraction method, characterized in that, include: Obtain the cepstral coefficient set of the target audio signal, wherein the cepstral coefficient set records the cepstral coefficients of the target audio signal; The cepstral coefficients in the cepstral coefficient set are grouped to obtain a first sub-band set, wherein the first sub-band set includes multiple sub-bands, and each sub-band includes multiple cepstral coefficients in the cepstral coefficient set; The cepstral coefficients in each subband of the first subband set are regularized to obtain the target subband set, wherein the subbands in the target subband set correspond one-to-one with the subbands in the first subband set; The audio characteristics of the target audio signal are determined based on the target sub-band set; Specifically, the cepstral coefficients in each subband of the first subband set are regularized to obtain the target subband set, which includes: Perform the following operations on the cepstral coefficients in each subband of the first subband set, where the subband at the time of the operation is the current subband: Determine the variance of all cepstral coefficients included in the current sub-band to obtain a first variance; determine a first regularization factor based on the first variance; determine the quotient of each cepstral coefficient in the current sub-band with the first regularization factor as the cepstral coefficient in the current first sub-band, wherein the cepstral coefficients in the current first sub-band correspond one-to-one with the cepstral coefficients in the current sub-band; The cepstral coefficients in the current first sub-band are subjected to local regularization to obtain the current target sub-band, wherein the cepstral coefficients in the current target sub-band correspond one-to-one with the cepstral coefficients in the current first sub-band; The current target subband is determined as a subband in the target subband set.

2. The method according to claim 1, characterized in that, Local regularization is performed on the cepstral coefficients in the current first sub-band to obtain the current target sub-band, including: The cepstral coefficients in the current first subband are grouped to obtain a second subband set, wherein the second subband set includes multiple subbands; Local regularization is performed on the cepstral coefficients in each subband of the second subband set to obtain a third subband set, wherein the third subband set includes multiple subbands; Reconstruct each subband in the third subband set to obtain the current target subband.

3. The method according to claim 2, characterized in that, Local regularization is applied to the cepstral coefficients in each subband of the second subband set to obtain the third subband set, which includes: Perform the following operations on the cepstral coefficients in each subband of the second subband set, where the subband at the time of the operation is the current second subband: A smoothing operation is performed on the cepstral coefficients in the current second sub-band to obtain the current third sub-band, wherein the cepstral coefficients in the current third sub-band correspond one-to-one with the cepstral coefficients in the current second sub-band; The cepstral coefficients in the current third subband are locally regularized to obtain the current fourth subband, wherein the set of third subbands includes the current fourth subband.

4. The method according to claim 3, characterized in that, Smoothing the cepstral coefficients in the current second subband to obtain the current third subband includes: Determine the mean and standard deviation of each set of cepstral coefficients in the current second sub-band, wherein each set of cepstral coefficients includes n cepstral coefficients, and the i-th cepstral coefficient among the n cepstral coefficients represents the characteristics of the i-th audio frame among the n audio frames in the target frequency range, wherein the target audio signal includes the n audio frames, where n is an integer greater than or equal to 1, and i is an integer greater than or equal to 1; If there is a target standard deviation less than or equal to a preset threshold among the multiple standard deviations corresponding to the multiple sets of cepstral coefficients, all cepstral coefficients in the target set of cepstral coefficients corresponding to the target standard deviation are replaced with the average value of the target set of cepstral coefficients to obtain the current third sub-band.

5. The method according to claim 3, characterized in that, Local regularization is performed on the cepstral coefficients in the current third subband to obtain the current fourth subband, including: The variances of the cepstral coefficients corresponding to the n audio frames in the current third sub-band are determined respectively to obtain the target variance set, wherein the target audio signal includes the n audio frames, and the variances in the target variance set correspond one-to-one with the n audio frames, where n is an integer greater than or equal to 1; Based on the target variance set, the local regularization factors corresponding to the n audio frames are determined respectively, resulting in n local regularization factors, wherein the n audio frames correspond one-to-one with the n local regularization factors; The current fourth subband is obtained by performing local regularization processing on the cepstral coefficients in the current third subband according to the n local regularization factors.

6. The method according to claim 5, characterized in that, The current fourth subband is obtained by performing local regularization processing on the cepstral coefficients in the current third subband based on the n local regularization factors, including: The cepstral coefficients in the current fourth sub-band are determined by the quotient of the cepstral coefficients corresponding to each audio frame in the current third sub-band and the local regularization factors corresponding to the audio frames among the n local regularization factors.

7. The method according to claim 1, characterized in that, Obtain the set of cepstral coefficients of the target audio signal, including: The target audio signal is divided into n audio frames, wherein there is partial overlap between any adjacent audio frames among the n audio frames; Obtain the cepstral coefficient subsets corresponding to the n audio frames respectively, resulting in n cepstral coefficient subsets, where n is an integer greater than or equal to 1 and i is an integer greater than or equal to 1, wherein the cepstral coefficient subsets include the n cepstral coefficient subsets.

8. The method according to claim 7, characterized in that, Obtaining the subset of cepstral coefficients corresponding to the m-th audio frame among the n audio frames includes: The m-th audio frame is windowed, where m is an integer greater than or equal to 1; Perform a Fast Fourier Transform on the windowed m-th audio frame to obtain the spectral information of the m-th audio frame; The power spectrum of the m-th audio frame is determined based on the spectral information of the m-th audio frame; Perform a discrete cosine transform on the power spectrum of the m-th audio frame to obtain the cepstral coefficient subset corresponding to the m-th audio frame.

9. The method according to claim 1, characterized in that, Determining the audio characteristics of the target audio signal based on the target sub-band set includes: The target subbands in the target subband set are combined to form the first cepstral coefficient matrix; The first cepstral coefficient matrix is ​​subjected to nonlinear transformation and filtering to obtain the target cepstral coefficient matrix. Convert the target cepstral coefficient matrix into Mel cepstral coefficients; The audio characteristics of the target audio signal are determined based on the Mel-frequency cepstral coefficients.

10. The method according to claim 1, characterized in that, The cepstral coefficients in the cepstral coefficient set are grouped to obtain the first sub-band set, which includes: Group the cepstral coefficients corresponding to each of the n audio frames to obtain r groups of cepstral coefficients corresponding to each audio frame. The first sub-band set is determined based on the r sets of cepstral coefficients corresponding to each audio frame, wherein the j-th sub-band in the first sub-band set includes the j-th set of cepstral coefficients in the r sets of cepstral coefficients corresponding to each audio frame, where n is an integer greater than or equal to 1, r is an integer greater than or equal to 1, and j is an integer greater than or equal to 1.

11. An audio feature extraction device, characterized in that, include: An acquisition module is used to acquire a set of cepstral coefficients of a target audio signal, wherein the set of cepstral coefficients records the cepstral coefficients of the target audio signal; A grouping module is used to group the cepstral coefficients in the cepstral coefficient set to obtain a first sub-band set, wherein the first sub-band set includes multiple sub-bands, and each sub-band includes multiple cepstral coefficients in the cepstral coefficient set; The regularization module is used to perform regularization processing on the cepstral coefficients in each subband of the first subband set to obtain the target subband set, wherein the subbands in the target subband set correspond one-to-one with the subbands in the first subband set; The determining module is used to determine the audio characteristics of the target audio signal based on the target sub-band set; The device is further configured to perform the following operations on the cepstral coefficients in each subband of the first subband set, wherein the subband in which the following operations are performed is the current subband: determining the variance of all cepstral coefficients included in the current subband to obtain a first variance; determining a first regularization factor based on the first variance; determining the quotient of each cepstral coefficient in the current subband with the first regularization factor as the cepstral coefficient in the current first subband, wherein the cepstral coefficients in the current first subband correspond one-to-one with the cepstral coefficients in the current subband; performing local regularization processing on the cepstral coefficients in the current first subband to obtain a current target subband, wherein the cepstral coefficients in the current target subband correspond one-to-one with the cepstral coefficients in the current first subband; and determining the current target subband as a subband in the target subband set.

12. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, wherein the computer program, when executed by a processor, implements the steps of the method described in any one of claims 1 to 10.

13. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the method described in any one of claims 1 to 10.