Audio encoding and decoding method, apparatus, device, medium and program product

By processing audio signals in layers and configuring transmission priorities, the efficiency and quality issues of audio encoding and decoding technology under different network bandwidth scenarios are solved, achieving flexible adaptation and efficient encoding and decoding.

CN119274562BActive Publication Date: 2026-06-26TENCENT TECHNOLOGY (SHENZHEN) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TENCENT TECHNOLOGY (SHENZHEN) CO LTD
Filing Date
2022-06-15
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing audio encoding and decoding technologies are inefficient and produce poor voice quality under different network bandwidth scenarios, lack flexibility, and are particularly computationally complex under low bandwidth conditions, making them difficult to apply.

Method used

The audio signal is processed in layers, decomposed into low-frequency and high-frequency sub-band signals, and multi-level feature extraction and quantization encoding are performed. Transmission priority is configured according to the layers to flexibly adapt to different network bandwidths.

Benefits of technology

It improves encoding and decoding efficiency, reduces the encoding and decoding complexity at each layer, ensures audio quality under different network bandwidth scenarios, and is suitable for a variety of application scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119274562B_ABST
    Figure CN119274562B_ABST
Patent Text Reader

Abstract

The application provides an audio encoding and decoding method, device, equipment, medium and program product; wherein the audio encoding method comprises: decomposing an audio signal to obtain a low-frequency sub-band signal and a high-frequency sub-band signal; performing multi-level feature extraction processing based on the low-frequency sub-band signal and the high-frequency sub-band signal to obtain sub-band signal features corresponding to multiple levels respectively; performing quantization processing on the sub-band signal features corresponding to each level to obtain index values of the sub-band signal features; performing encoding processing on the index values of the sub-band signal features to obtain code streams corresponding to the levels; and configuring corresponding transmission priorities for the code streams corresponding to the multiple levels respectively; wherein the transmission priority is positively correlated with a decoding quality index of the code stream corresponding to the level. The application can be flexibly applied to different network bandwidth application scenarios, and the efficiency and quality of audio encoding and decoding are improved.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] This application is a divisional application of the Chinese patent application filed on June 15, 2022, with application number 202210681816.X and title "Audio Encoding and Decoding Method, Apparatus, Device, Medium and Program Product". Technical Field

[0002] This application relates to audio processing technology, and more particularly to an audio encoding and decoding method, apparatus, device, medium, and program product. Background Technology

[0003] Audio codec technology is a core technology in communication services, including remote audio and video calls. Traditional codec technologies are based on transformations in the time and frequency domains; for example, various standard voice codec protocols fall into this category. Taking encoding as an example, including time-domain coding and frequency-domain coding, their essence is a compression method based on signal processing. While maintaining a certain level of voice quality, it is difficult to significantly reduce the bitrate of the encoding, because the two are mutually restrictive.

[0004] Artificial intelligence (AI) is a comprehensive technology in computer science, and its application in audio coding is increasing. For example, deep learning-based encoding and decoding technologies aim to achieve higher voice quality than traditional encoding and decoding technologies at low bit rates. However, the high computational complexity of AI-based encoding and decoding technologies affects encoding efficiency, and they are not suitable for low-bandwidth applications at higher bit rates.

[0005] In conclusion, there is currently no effective solution in terms of how to improve encoding / decoding efficiency and voice quality for application scenarios with different network bandwidths. Summary of the Invention

[0006] This application provides an audio encoding and decoding method, apparatus, electronic device, computer-readable storage medium, and computer program product, which can be flexibly applied to application scenarios with different network bandwidths, improving the efficiency and quality of audio encoding and decoding.

[0007] The technical solution of this application embodiment is implemented as follows:

[0008] This application provides an audio encoding method, including:

[0009] The audio signal is decomposed to obtain low-frequency subband signals and high-frequency subband signals;

[0010] Based on the low-frequency sub-band signal and the high-frequency sub-band signal, feature extraction processing is performed at multiple levels to obtain the sub-band signal features corresponding to the multiple levels respectively.

[0011] The sub-band signal features corresponding to each level are quantized to obtain the index value of the sub-band signal features;

[0012] The index values ​​of the sub-band signal features are encoded to obtain the code stream corresponding to the level;

[0013] Each of the multiple layers is configured with a corresponding transmission priority for its bitstream; wherein the transmission priority is positively correlated with the decoding quality index of the bitstream corresponding to the layer.

[0014] This application provides an audio decoding method, including:

[0015] Decoding is performed on the bitstreams corresponding to multiple layers to obtain the index value of the bitstream corresponding to each layer; wherein, different layers correspond to different transmission priorities, and the transmission priority is positively correlated with the decoding quality index of the bitstream corresponding to the layer;

[0016] The index value of the code stream corresponding to each level is inversely quantized to obtain the sub-band signal features corresponding to each level.

[0017] Feature reconstruction processing is performed on the sub-band signal features corresponding to each level to obtain the sub-band signal corresponding to each level.

[0018] The sub-band signals corresponding to the multiple layers are combined into an audio signal.

[0019] This application provides an audio encoding device, including:

[0020] The decomposition module is used to decompose the audio signal to obtain low-frequency subband signals and high-frequency subband signals;

[0021] The feature extraction module is used to perform multi-level feature extraction processing based on the low-frequency sub-band signal and the high-frequency sub-band signal to obtain the sub-band signal features corresponding to the multiple levels respectively.

[0022] The quantization module is used to quantize the sub-band signal features corresponding to each level to obtain the index value of the sub-band signal features;

[0023] The encoding module is used to encode the index values ​​of the sub-band signal features to obtain the code stream corresponding to the layer.

[0024] The configuration module is used to configure the corresponding transmission priority for the bitstreams corresponding to the multiple layers respectively; wherein the transmission priority is positively correlated with the decoding quality index of the bitstream corresponding to the layer.

[0025] This application provides an audio decoding device, including:

[0026] The decoding module is used to decode the bitstreams corresponding to multiple layers respectively to obtain the index value of the bitstream corresponding to each layer; wherein, different layers correspond to different transmission priorities, and the transmission priority is positively correlated with the decoding quality index of the bitstream corresponding to the layer;

[0027] The inverse quantization module is used to perform inverse quantization processing on the index value of the code stream corresponding to each level to obtain the sub-band signal features corresponding to each level.

[0028] The feature reconstruction module is used to perform feature reconstruction processing on the sub-band signal features corresponding to each level to obtain the sub-band signal corresponding to each level.

[0029] The synthesis module is used to synthesize the sub-band signals corresponding to the multiple layers into an audio signal.

[0030] This application provides an electronic device, including:

[0031] Memory, used to store executable instructions;

[0032] The processor, when executing executable instructions stored in the memory, implements the audio encoding method and audio decoding method provided in the embodiments of this application.

[0033] This application provides a computer-readable storage medium storing executable instructions, which, when executed by a processor, implement the audio encoding and audio decoding methods provided in this application.

[0034] This application provides a computer program product including computer instructions stored in a computer-readable storage medium. A processor of an electronic device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the electronic device to perform the audio encoding and audio decoding methods described above in this application.

[0035] The embodiments of this application have the following beneficial effects:

[0036] By acquiring the sub-band signal features of the audio sub-band signal at each level in a layered manner, and encoding the corresponding sub-band signal features at each level, each level only needs to encode specific sub-band signal features, instead of encoding the features of the entire audio signal. This not only improves the efficiency of encoding and decoding but also reduces the encoding and decoding complexity of each level. Based on the importance of different levels of bitstream to decoding quality, different transmission priorities can be flexibly configured for different levels of bitstream, ensuring that more important bitstreams are transmitted first. This approach is applicable to application scenarios with different network bandwidths. Attached Figure Description

[0037] Figure 1 This is a schematic diagram comparing the spectrum at different code rates provided in the embodiments of this application;

[0038] Figure 2 This is a schematic diagram of the architecture of the audio codec system 100 provided in the embodiments of this application;

[0039] Figure 3 This is a schematic diagram of a voice communication link provided in an embodiment of this application;

[0040] Figure 4A This is a schematic diagram of the structure of terminal 401 provided in an embodiment of this application;

[0041] Figure 4B This is a schematic diagram of the structure of terminal 402 provided in an embodiment of this application;

[0042] Figures 5A-5G This is a flowchart illustrating the audio encoding method provided in an embodiment of this application;

[0043] Figures 6A-6G This is a flowchart illustrating the audio decoding method provided in an embodiment of this application;

[0044] Figure 7A This is a schematic diagram illustrating a single-level encoding and decoding process provided in an embodiment of this application;

[0045] Figure 7B This is a schematic diagram illustrating two levels of encoding and decoding provided in an embodiment of this application;

[0046] Figure 7C This is a schematic diagram illustrating three levels of encoding and decoding provided in an embodiment of this application;

[0047] Figure 7D This is a schematic diagram of the spectral response of the QMF filter bank provided in the embodiments of this application;

[0048] Figure 7E This is a schematic diagram of the bandwidth extension provided in an embodiment of this application;

[0049] Figure 8AThis is a schematic diagram of a regular convolutional network and a dilated convolutional network provided in the embodiments of this application;

[0050] Figure 8B This is a schematic diagram of the structure of a neural network for performing the first feature extraction process provided in an embodiment of this application;

[0051] Figure 8C This is a schematic diagram of the structure of a neural network for performing third feature extraction processing provided in an embodiment of this application;

[0052] Figure 8D This is a schematic diagram of the structure of the neural network used for first feature reconstruction processing provided in an embodiment of this application;

[0053] Figure 8E This is a schematic diagram of the structure of a neural network for performing third feature reconstruction processing provided in an embodiment of this application. Detailed Implementation

[0054] To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings. The described embodiments should not be regarded as limitations on this application. All other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0055] In the following description, references are made to “some embodiments,” which describe a subset of all possible embodiments. However, it is understood that “some embodiments” may be the same subset or different subsets of all possible embodiments and may be combined with each other without conflict.

[0056] In the following description, the terms "first, second, third" are used merely to distinguish similar objects and do not represent a specific ordering of objects. It is understood that "first, second, third" may be interchanged in a specific order or sequence where permitted, so that the embodiments of this application described herein can be implemented in an order other than that illustrated or described herein.

[0057] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit this application.

[0058] It is understood that in the embodiments of this application, data such as user information are involved. When the embodiments of this application are applied to specific products or technologies, user permission or consent is required, and the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions.

[0059] Before providing a further detailed description of the embodiments of this application, the nouns and terms involved in the embodiments of this application will be explained, and the nouns and terms involved in the embodiments of this application shall be interpreted as follows.

[0060] 1) Audio Coding: Audio coding is an application of data compression for audio signals containing speech. Audio coding uses speech-specific parameter estimation, employs audio signal processing techniques to model the speech signal, and combines general data compression algorithms to represent the generated modeling parameters in a compact bitstream.

[0061] 2) Quadrature Mirror Filter (QMF): Used to decompose an audio signal into two sub-band signals with equal bandwidth, namely, a high-frequency sub-band signal and a low-frequency sub-band signal.

[0062] 3) Scalable Coding: A technique for ensuring compatibility with different terminal devices and link bandwidths. Its characteristic is that the bitstream is layered; lower-level bitstreams can be decoded independently, while higher-level bitstreams enhance audio quality.

[0063] 4) Neural Network (NN): A mathematical model of an algorithm that mimics the behavioral characteristics of animal neural networks to perform distributed parallel information processing. Neural networks rely on the complexity of the system to process information by adjusting the relationships between a large number of interconnected nodes.

[0064] 5) Deep Learning (DL): A type of machine learning that combines low-level features to form more abstract high-level features to represent attribute categories or features, and is used to discover distributed feature representations of data.

[0065] 6) Entropy coding: This refers to coding that does not lose any information during the encoding process according to the entropy principle. Information entropy is the average amount of information in the source. Common entropy coding methods include Shannon coding, Huffman coding, and arithmetic coding.

[0066] The applicant found that, based on signal processing-based audio encoding, and combining distortion rate analysis with past standardization experience, the following conclusions were drawn: a bitrate of at least 0.75 bits per sample is required to provide ideal speech quality; for music encoding, a bitrate of at least 1.5 bits per sample is required. Typically, for a wideband speech signal with a sampling rate of 16000Hz, the encoded bitrate is 12kbps; for an ultrawideband speech signal with a sampling rate of 32000Hz, the encoded bitrate is 48kbps.

[0067] See Figure 1 , Figure 1 This is a schematic diagram comparing the spectrum at different code rates provided in the embodiments of this application. Figure 1 In the code, 101 represents the spectrum of the original audio signal, i.e., the spectrum of the uncompressed audio signal; 102 represents the spectrum of the audio signal reconstructed from data with a bitrate of 20kbps after encoding; and 103 represents the spectrum of the audio signal reconstructed from data with a bitrate of 6kbps after encoding. For example... Figure 1 As shown, the higher the bitrate of the encoded data, the closer the spectrum of the restored audio signal is to the spectrum of the original audio signal.

[0068] In audio encoding / decoding method 1 of related technologies, the encoder extracts typical speech features based on traditional signal processing methods, such as line spectral frequency (LSF) coefficients. For wideband speech, 10-16 LSF coefficients can be extracted per frame, along with the speech frame energy. The encoder compresses these speech features. The decoder decodes these speech features and calls a speech generation model, such as a generative network like WaveNet, to generate the speech signal. This method can achieve a bitrate below 2kbps. However, the computational complexity of the decoder is very high due to the need to call generative networks like WaveNet to generate the speech signal, posing a significant challenge for use on mobile terminals. Furthermore, the audio quality obtained through decoding is inferior to that obtained through signal processing methods.

[0069] In audio encoding / decoding method 2 of the related technology, the encoding end uses a convolutional network or autoencoder to convert the speech signal into a feature vector. Then, the feature vector is encoded. The decoding end is the reverse process of the encoding end. Using the feature vector obtained from decoding, the corresponding network is called to generate the speech signal.

[0070] The applicant found that the two methods mentioned above mainly generate feature vectors through audio signal analysis or encoding networks, and then encode these feature vectors. However, since both the encoding and decoding ends use deep learning networks, the computational complexity is very high, requiring huge computing resources to achieve audio encoding and decoding. This results in low encoding and decoding efficiency, making it unsuitable for application scenarios with different network bandwidths and lacking flexibility.

[0071] This application provides an audio encoding and decoding method, apparatus, electronic device, storage medium, and program product, which can be flexibly applied to application scenarios with different network bandwidths, improving the efficiency and quality of audio encoding and decoding. The following describes exemplary applications of the electronic device for audio encoding and decoding provided in this application. The electronic device for audio encoding and decoding provided in this application can be implemented as various types of user terminals such as laptops, tablets, desktop computers, set-top boxes, and mobile devices (e.g., mobile phones, portable music players, personal digital assistants, dedicated messaging devices, portable gaming devices, and in-vehicle terminals), or it can be implemented as a server. The following describes exemplary applications when the electronic device is implemented as a server.

[0072] The audio encoding and decoding methods provided in this application can be executed independently or collaboratively by a terminal or server. See also Figure 2 , Figure 2 This is a schematic diagram of the architecture of the audio encoding and decoding system 100 provided in this application embodiment, including terminal 401 and terminal 402. Terminal 401 is connected to terminal 402 through network 300, which can be a wide area network, a local area network, or a combination of both.

[0073] In some embodiments, taking a Voice over Internet Protocol (VoIP) conferencing system as an example, terminal 401 includes a conferencing client. When a user uses the conferencing client, an audio signal is generated. The encoder deployed in terminal 401 performs layered encoding processing on the audio signal to be encoded generated by terminal 401 to obtain multiple bitstreams corresponding to different layers. Then, different transmission priorities are configured for the obtained bitstreams, and the bitstreams are transmitted to terminal 402 according to the corresponding transmission priorities. The decoder deployed in terminal 402 performs layered decoding and synthesis processing on the received bitstreams to obtain an audio signal, and plays the decoded audio signal through the corresponding conferencing client in terminal 402.

[0074] In addition to being applicable to meeting scenarios, the audio codec system 100 in this application embodiment can also be applied to voice and video chat in instant messaging clients, voice chat in game clients, and voice and video chat in online live streaming rooms.

[0075] In some embodiments, the server may be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. Terminals 401 and 402 may be smartphones, tablets, laptops, desktop computers, smart speakers, smartwatches, in-vehicle terminals, etc., but are not limited to these.

[0076] In other embodiments, the embodiments of this application can be implemented with the aid of cloud technology, which refers to a hosting technology that unifies a series of resources such as hardware, software, and networks within a wide area network or local area network to realize the computation, storage, processing, and sharing of data.

[0077] Cloud technology is a general term encompassing network technology, information technology, integration technology, management platform technology, and application technology based on the cloud computing business model. It can form resource pools, allowing for on-demand use with flexibility and convenience. Cloud computing technology will become a crucial support. The backend services of cloud computing systems require substantial computing and storage resources.

[0078] In some embodiments, see Figure 3 , Figure 3 This is a schematic diagram of a voice communication link provided in an embodiment of this application. An encoder can be deployed on the uplink client, and a decoder on the downlink client. The encoder acquires the audio signal from the uplink client and performs preprocessing enhancements on the audio signal, such as feature extraction. The enhanced speech features are then encoded and noise reduction performance improved to obtain the bitstream, which is then transmitted to the downlink client where the decoder is located. The decoder in the downlink client decodes the received bitstream, performs noise reduction performance improvement processing, and further enhances the decoded result with enhancements and sound effects, thereby restoring the original audio signal as closely as possible for playback on the downlink client.

[0079] It's important to note that, for backward compatibility, a transcoder (such as a combination of an NN decoder and a G.722 encoder) can be deployed in the background of the audio encoding / decoding system to achieve interoperability between the new encoder and existing encoders. For example, if the transmitting end uses a new NN encoder and the receiving end uses a G.722 decoder, the receiver's G.722 decoder cannot decode the bitstream directly transmitted by the transmitting end, which is encoded by the NN encoder. Therefore, in the background (i.e., on the server), the NN decoder can decode the bitstream obtained by the transmitting end through the NN encoder to generate an audio signal. Then, the G.722 encoder can be called to generate a specific bitstream, which is then sent to the receiving end for correct decoding.

[0080] See below Figure 4A , Figure 4A This is a schematic diagram of the structure of terminal 401 provided in an embodiment of this application. Figure 4A The terminal 401 shown includes at least one processor 410, a memory 430, and at least one network interface 420. The various components in terminal 401 are coupled together via a bus system 440. It is understood that the bus system 440 is used to implement communication between these components. In addition to a data bus, the bus system 440 also includes a power bus, a control bus, and a status signal bus. However, for clarity, ... Figure 4A The general labeled all buses as Bus System 440.

[0081] The processor 410 can be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor, etc.

[0082] The memory 430 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid-state storage, hard disk drives, optical disk drives, etc. The memory 430 may optionally include one or more storage devices physically located away from the processor 410.

[0083] The memory 430 may include volatile memory or non-volatile memory, or both. The non-volatile memory may be read-only memory (ROM), and the volatile memory may be random access memory (RAM). The memory 430 described in this application embodiment is intended to include any suitable type of memory.

[0084] In some embodiments, memory 430 is capable of storing data to support various operations, examples of which include programs, modules, and data structures or subsets or supersets thereof, as illustrated below.

[0085] Operating system 431 includes system programs for handling various basic system services and performing hardware-related tasks, such as the framework layer, core library layer, and driver layer, for implementing various basic business functions and handling hardware-based tasks.

[0086] The network communication module 432 is used to reach other computing devices via one or more (wired or wireless) network interfaces 420, such as Bluetooth, WiFi, and Universal Serial Bus (USB).

[0087] In some embodiments, the audio encoding apparatus provided in this application can be implemented in software. Figure 4A An audio encoding device 433 stored in memory 430 is shown. It can be software in the form of programs and plug-ins, including the following software modules: decomposition module 4331, feature extraction module 4332, quantization module 4333, encoding module 4334, and configuration module 4335. These modules are logical and can therefore be arbitrarily combined or further split according to the functions they implement.

[0088] See below Figure 4B , Figure 4B This is a schematic diagram of the structure of terminal 402 provided in an embodiment of this application. Figure 4B The terminal 402 shown includes at least one processor 450, a memory 470, and at least one network interface 460. The various components in the terminal 402 are coupled together via a bus system 480.

[0089] In some embodiments, the audio decoding device provided in this application can be implemented in software. Figure 4B An audio decoding device 473 stored in memory 470 is shown. It can be software in the form of programs and plug-ins, including the following software modules: decoding module 4731, inverse quantization module 4734, feature reconstruction module 4733, and synthesis module 4734. These modules are logically related and can therefore be arbitrarily combined or further separated according to the functions they implement.

[0090] The audio encoding and decoding methods provided in this application will be described below with reference to exemplary applications and implementations of the electronic devices provided in the embodiments of this application. It will be understood that the following methods can be executed individually or collaboratively by the terminal or server described above.

[0091] See Figure 5A , Figure 5A This is a flowchart illustrating the audio encoding method provided in the embodiments of this application, which will be combined with... Figure 5A The steps shown are explained.

[0092] In step 101, the audio signal is decomposed to obtain low-frequency subband signal and high-frequency subband signal.

[0093] As an example, after the audio signal is acquired, it is first decomposed to obtain low-frequency subband signal and high-frequency subband signal.

[0094] In some embodiments, the audio signal is decomposed to obtain a low-frequency subband signal and a high-frequency subband signal, which can be achieved by: sampling the audio signal at a first sampling frequency to obtain a sampled signal; wherein the sampled signal includes multiple sample points sampled from the audio signal; performing low-pass filtering on the sampled signal, and performing a first downsampling on the obtained low-pass filtering result to obtain a low-frequency subband signal at a second sampling frequency; performing high-pass filtering on the sampled signal, and performing a second downsampling on the obtained high-pass filtering result to obtain a high-frequency subband signal at a second sampling frequency.

[0095] As an example, a QMF filter bank can be used to implement the decomposition process. See also Figure 7D , Figure 7D This is a schematic diagram of the spectral response of the QMF filter bank provided in the embodiments of this application.

[0096] A QMF filter bank is a filter pair that includes a QMF analysis filter and a QMF synthesis filter.

[0097] For a QMF analysis filter, an input audio signal with a sampling rate of Fs can be decomposed into two sub-band signals with a sampling rate of Fs / 2: a QMF low-pass signal and a QMF high-pass signal. For example... Figure 7D As shown, 701 represents the low-pass signal H of the QMF analysis filter. Low The spectral response of (z), where 702 represents the high-pass signal H of the QMF analysis filter. High The spectral response of (z).

[0098] As an example, the correlation between the QMF low-pass signal filter coefficients and the QMF high-pass signal filter coefficients can be calculated using the following formula:

[0099] h High (k)=(-1) k h Low (k) Formula 1

[0100] Among them, h High (k) represents the coefficient at the k-th point of the high-pass signal, h Low (k) represents the coefficient at the k-th point of the low-pass signal.

[0101] The following explanation uses an example with a first sampling frequency Fs = 32000Hz and a single frame audio signal duration of 20ms. The first sampling frequency can also be 8000Hz, 16000Hz, 48000Hz, etc., and this application does not limit this.

[0102] First, a frame of audio signal x(n) (i.e., a continuous frame of analog signal) is sampled at Fs = 32000Hz to obtain a sampled signal (i.e., a discrete digital signal). As an example, the sampled signal consists of 640 sample points, or 640 sample values, obtained from the audio signal.

[0103] Secondly, a two-channel QMF analysis filter is used to perform low-pass filtering on the sampled signal at the first sampling frequency (i.e., 32000Hz). After obtaining the low-pass filtering result, the low-pass filtering result is subjected to a first downsampling process to obtain the low-frequency subband signal x at the second sampling frequency (i.e., 16000Hz). LB (n).

[0104] Correspondingly, a high-pass filter is applied to the sampled signal at the first sampling frequency (i.e., 32000Hz) using a 2-channel QMF analysis filter. After obtaining the high-pass filtered result, a second downsampling process is performed on the high-pass filtered result to obtain the high-frequency subband signal x at the second sampling frequency (i.e., 16000Hz). HB (n).

[0105] Low-pass filtering and high-pass filtering can be performed in parallel to improve filtering efficiency. The effective bandwidth of the low-frequency sub-band signal obtained after downsampling is 0-8kHz, and the effective bandwidth of the high-frequency sub-band signal is 8-16kHz. Since bandwidth is the difference between the highest and lowest frequencies contained in the signal, the bandwidths of the low-frequency and high-frequency sub-band signals are the same, both being 8kHz. After decomposition processing, both sub-band signals include 320 sample points.

[0106] The above method can accurately obtain the low-frequency and high-frequency sub-band signals corresponding to the audio signal, which facilitates further processing based on the accurate sub-band signals.

[0107] In step 102, feature extraction processing at multiple levels is performed based on the low-frequency subband signal and the high-frequency subband signal to obtain subband signal features corresponding to each level.

[0108] As an example, after obtaining the low-frequency subband signal and the high-frequency subband signal, in some embodiments, feature extraction processing at least one level can be performed based on the low-frequency subband signal and the high-frequency subband signal to obtain the subband signal features corresponding to each level.

[0109] As an example, feature extraction can be performed at one level (i.e., the first level) based on the low-frequency subband signal. See also Figure 7A , Figure 7A This is a schematic diagram of a single-level encoding and decoding process provided in an embodiment of this application.

[0110] like Figure 7A As shown, when the low-frequency subband signal x is obtained... LB After (n), firstly, for the low-frequency subband signal x LB (n) Perform the first level of analysis (i.e., feature extraction processing) to obtain the first low-frequency sub-band signal feature F. LB (n).

[0111] Secondly, the characteristics F of the first low-frequency subband signal LB (n) is quantized and encoded to obtain the bitstream corresponding to the first level.

[0112] Next, the bitstream corresponding to the first level is transmitted to the decoding end, where it is decoded to obtain the decoding result F′. LB (n), for the decoding result F′ LB (n) Perform the first-level synthesis process and call the synthesis filter to synthesize the resulting x′. LB (n) performs upsampling to obtain an estimate of the low-frequency part of the original audio signal.

[0113] Correspondingly, a single-level encoding and decoding process can also be performed based on the high-frequency subband signal, with the processing flow similar to... Figure 7A The process shown is similar and will not be repeated here.

[0114] In some embodiments, feature extraction processing can be performed at two levels based on low-frequency subband signals and high-frequency subband signals to obtain the subband signal features corresponding to each level.

[0115] As an example, the first level of feature extraction can be performed based on the low-frequency subband signal, and the second level of feature extraction can be performed based on the high-frequency subband signal. See also Figure 7B , Figure 7B This is a schematic diagram illustrating two levels of encoding and decoding provided in an embodiment of this application.

[0116] At the encoding end, the low-frequency subband signal is analyzed at the first level to obtain the first low-frequency subband signal characteristic F. LB (n), and perform a second-level analysis on the high-frequency subband signal to obtain the first high-frequency subband signal characteristic F. HB (n). Next, for F LB (n) and F HB (n) Perform quantization encoding processing respectively to obtain the bit stream corresponding to the first level and the bit stream corresponding to the second level, and transmit the bit streams of the two levels to the decoding end.

[0117] At the decoding end, firstly, the received bitstreams at both levels are decoded to obtain the decoding results for the first and second levels, respectively. Then, based on the decoding results for the first level, a first-level synthesis process is performed to obtain the first-level synthesis result x′. LB (n), based on the decoding result of the code stream corresponding to the second level, perform the second-level synthesis processing to obtain the second-level synthesis result x′. HB (n); Finally, the synthesis filter is called on x′ LB (n) and x′ HB (n) is upsampled and the two upsampled results are combined to obtain the estimated value x′(n) of the original audio signal.

[0118] The following will illustrate, with reference to the accompanying figures, the two-level feature extraction process based on low-frequency subband signals and high-frequency subband signals.

[0119] See Figure 5B , Figure 5B This is a flowchart illustrating an audio encoding method provided in an embodiment of this application. Based on Figure 5A , Figure 5B Step 102 can be achieved through steps 1021-1022. The following will combine... Figure 5B Steps 1021-1022 are described below.

[0120] In step 1021, the first level of feature extraction processing is performed in the following manner: the first feature extraction processing is performed based on the low-frequency subband signal to obtain the first low-frequency subband signal features.

[0121] As an example, in the case of performing first-level and second-level feature extraction processing, based on the low-frequency subband signal x LB (n) Perform the first feature extraction process at the first level to obtain the first low-frequency sub-band signal feature F. LB (n), where the dimension of the first low-frequency subband signal feature is smaller than the dimension of the low-frequency subband signal.

[0122] Here, the first feature extraction process can be implemented through signal processing, such as extracting waveform features or parameter features of low-frequency subband signals; or it can be implemented through neural networks, including but not limited to autoencoders (AE), fully connected networks (FC), long short-term memory networks (LSTM), convolutional neural networks (CNN) + long short-term memory networks (LSTM), and dilated convolutional networks (DCNN).

[0123] It should be noted that, in this embodiment, the optimal parameters can be obtained by jointly training the neural networks (e.g., dilated convolutional networks) at both the encoding and decoding ends through data collection. Since multiple open-source platforms for neural networks and deep learning exist in related technologies, developers only need to prepare data and set the corresponding network structure to train the neural network. After the server completes the training, the trained neural network can be put into use. The specific training process will not be elaborated here. This embodiment assumes that feature extraction processing is performed based on the optimal parameters of the trained neural network.

[0124] As an example, we will use a dilated convolutional network as an example to explain the first feature extraction process. Below, we will first explain the dilated convolutional network.

[0125] See Figure 8A , Figure 8A This is a schematic diagram of a regular convolutional network and a dilated convolutional network provided in the embodiments of this application. The purpose of dilated convolutional networks is to increase the receptive field while maintaining the size of the feature map, avoiding errors caused by upsampling and downsampling. Figure 8A As shown, although the kernel size in both the ordinary convolutional network (i.e., the form shown in 801) and the dilated convolutional network (i.e., the form shown in 802) is 3x3, the receptive field of the ordinary convolutional network is only 3 and the dilation rate is 1; while the receptive field of the dilated convolutional network reaches 5 and the dilation rate is 2.

[0126] During convolution, the convolution kernel can be similar to... Figure 8AThe convolution kernel moves on a plane, and this movement involves a stide rate. For example, if the kernel moves one grid at a time, the stide rate is 1. Furthermore, the number of channels in the convolution process indicates how many parameters corresponding to the convolution kernel are used for convolution processing. Generally, more channels result in more comprehensive signal analysis and higher accuracy; however, more channels also increase computational complexity. For example, a 1x320 feature vector convolved with 24 channels will yield a 24x320 feature vector. During convolution, the appropriate size of the dilated convolution kernel (e.g., for audio signals, the dilated kernel is typically set to 1x3), dilation rate, stide rate, and number of channels can be defined according to the specific application requirements.

[0127] The following will explain how to use a dilated convolutional network for the first feature extraction process.

[0128] See Figure 5B , Figure 5B Step 1021 can be achieved through steps 10211-10214. The following will combine... Figure 5B Steps 10211-10214 are described below.

[0129] In step 10211, the low-frequency subband signal is subjected to a first convolution process to obtain the first convolution feature.

[0130] See Figure 8B , Figure 8B This is a schematic diagram of the structure of a neural network used for performing the first feature extraction process, provided in an embodiment of this application.

[0131] like Figure 8B As shown, firstly based on the low-frequency subband signal x LB (n) Call a 24-channel convolutional layer (e.g., a causal convolutional layer) for the first convolution process, thereby transforming the 1*320 low-frequency subband signal x LB (n), expanded into a first convolutional feature of 24*320.

[0132] In step 10212, the first convolutional features are subjected to the first pooling process to obtain the first pooled features.

[0133] As an example, see Figure 8B After obtaining the first convolutional feature, the pooling layer is called to perform the first pooling process based on the 24*320 first convolutional feature. For example, the pooling factor can be set to 2 and the activation function can be set to the ReLU function. After the first pooling process, the 24*160 first pooling feature is obtained.

[0134] In step 10213, the first pooling feature is subjected to a third downsampling process to obtain a third downsampling feature. The third downsampling process includes multiple cascaded downsampling operations.

[0135] As an example, see Figure 8B After obtaining the first pooling feature, the three cascaded downsampling layers are called to perform the third downsampling process based on the first pooling feature.

[0136] These three downsampling layers correspond to three different downsampling factors. For example, the first downsampling layer has a downsampling factor of 4 and 48 channels; the second downsampling layer has a downsampling factor of 5 and 96 channels; and the third downsampling layer has a downsampling factor of 8 and 192 channels. Therefore, after this third downsampling process through these three downsampling layers, the 24*160 first pooling feature is successively transformed into 48*40, 96*8, and 192*1 third downsampling features.

[0137] During the third downsampling process, one or more dilated convolution processes can be performed based on the first pooling features. For example, the size of each dilated convolution kernel can be set to 1*3, the shift rate can be set to 1, and the dilation rate of the dilated convolution process can be set to 3. This application embodiment does not limit this.

[0138] In step 10214, the third downsampling feature is subjected to a second convolution process to obtain the first low-frequency subband signal feature.

[0139] As an example, see Figure 8B After obtaining the third downsampled feature of 192*1, a second convolution process is performed using a 64-channel convolutional layer (e.g., a causal convolutional layer) based on the third downsampled feature of 192*1, resulting in a 64-dimensional feature vector, which is the first low-frequency subband signal feature F. LB (n).

[0140] like Figure 8B As shown, the low-frequency subband signal x LB The dimension of (n) is 320, and the first low-frequency sub-band signal characteristic F LB The dimension of (n) is 64. Therefore, the first feature extraction process has a dimensionality reduction effect, that is, data compression.

[0141] Using the above method, the first low-frequency sub-band signal features can be obtained quickly and efficiently with the help of neural networks. This not only provides accurate first-frequency sub-band signal features but also enables data compression.

[0142] In step 1022, the second-level feature extraction process is performed as follows: the second feature extraction process is performed based on the high-frequency subband signal to obtain the first high-frequency subband signal features.

[0143] As an example, the second feature extraction process can be implemented based on the bandwidth extension method to obtain the first high-frequency sub-band signal features corresponding to the high-frequency sub-band signal. The dimension of the first high-frequency sub-band signal features is smaller than the dimension of the high-frequency sub-band signal.

[0144] See Figure 7E , Figure 7E This is a schematic diagram of the bandwidth extension provided in an embodiment of this application. For example... Figure 7E As shown, firstly, a core layer of encoding is performed on the original ultra-wideband signal at a low sampling frequency to obtain the reconstructed wideband signal. Secondly, the spectrum of the low-frequency part of the reconstructed wideband signal is copied to the high-frequency part of the ultra-wideband signal. Finally, gain control is applied to the copied high-frequency spectrum based on pre-recorded boundary information (e.g., information describing the energy correlation between high and low frequencies). Generally, a bit rate of only 1-2 kbps is sufficient to achieve the effect of doubling the sampling rate.

[0145] See Figure 5C , Figure 5C This is a flowchart illustrating an audio encoding method provided in an embodiment of this application. Based on Figure 5B , Figure 5C Step 1022 can be achieved through steps 10221-10224. The following will combine... Figure 5C Steps 10221-10224 are described below.

[0146] In step 10221, frequency domain transformation is performed on the first number of sample points to obtain the first number of transformation coefficients.

[0147] As an example, when performing second feature extraction processing based on the high-frequency subband signal, the number of sample points included in the high-frequency subband signal is first determined. For example, if the high-frequency subband signal includes 320 sample points, then the first number is 320. Next, frequency domain transformation processing is performed based on the 320 sample points of the high-frequency subband signal to obtain 320 transform coefficients.

[0148] Here, frequency domain transformation processing includes Discrete Cosine Transform (DCT) processing, Modified Discrete Cosine Transform (MDCT) processing, and Fast Fourier Transform (FFT) processing.

[0149] In some embodiments, frequency domain transformation processing is performed on a first number of sample points to obtain a first number of transformation coefficients. This can be achieved by: obtaining the reference frame high-frequency sub-band signal corresponding to the reference frame audio signal; and performing discrete cosine transform processing on the first number of sample points included in the high-frequency sub-band signal based on the first number of sample points in the reference frame high-frequency sub-band signal and the first number of sample points in the high-frequency sub-band signal to obtain the transformation coefficients corresponding to the first number of sample points included in the high-frequency sub-band signal.

[0150] As an example, firstly, the audio signal of the next frame or the previous frame of the current frame is obtained, and this obtained audio signal is used as the reference frame audio signal. Next, the reference frame audio signal is decomposed to obtain the reference frame high-frequency sub-band signal. The method for obtaining the reference frame high-frequency sub-band signal is similar to that for the current frame high-frequency sub-band signal, and will not be repeated here. The reference frame high-frequency sub-band signal also includes a first number of sample points, namely 320 sample points.

[0151] After obtaining 320 sample points of the high-frequency subband signal of the reference frame, the 320 sample points of the high-frequency subband signal of the reference frame are merged with the 320 sample points of the high-frequency subband signal of the current frame to obtain 640 sample points.

[0152] Based on these 640 sample points, MDCT processing is performed. For a 50% time-domain overlap window, the MDCT transform coefficients corresponding to the 320 sample points of the current frame's high-frequency subband signal can be calculated.

[0153] By determining the 320 MDCT transform coefficients corresponding to the high-frequency subband signal of the current frame using MDCT processing, key information of the high-frequency subband signal can be accurately extracted.

[0154] In step 10222, the first number of transformation coefficients are divided into multiple first sub-bands.

[0155] As an example, after obtaining 320 transform coefficients, these 320 transform coefficients are divided into multiple first sub-bands. For example, they can be divided into 8 first sub-bands, where each first sub-band includes the transform coefficients corresponding to a set of adjacent sample points.

[0156] Here, the first sub-band can be divided evenly so that each first sub-band includes the same number of transform coefficients. For example, if it is evenly divided into 8 first sub-bands, each first sub-band includes 320 / 8, or 40 transform coefficients. Alternatively, the first sub-band can be divided non-uniformly, such as the first sub-band with lower frequencies containing fewer transform coefficients (i.e., higher frequency resolution) and the first sub-band with higher frequencies containing more transform coefficients (i.e., lower frequency resolution).

[0157] In some embodiments, according to the Nyquist sampling theorem (i.e., to recover the original signal without distortion from the sampled signal, the sampling frequency should be greater than twice the highest frequency of the original signal; when the sampling frequency is less than twice the highest frequency of the original signal, the signal spectrum exhibits aliasing; when the sampling frequency is greater than twice the highest frequency of the original signal, the signal spectrum does not exhibit aliasing), since the 320 transform coefficients corresponding to the high-frequency subband signal of the current frame correspond to a spectrum of 8-16kHz, however, typical ultra-wideband voice communication does not necessarily require a spectrum up to 16kHz. Therefore, if the maximum spectrum is set to 14kHz, only the first 240 transform coefficients need to be considered; correspondingly, if the first subband is divided evenly, six first subbands can be obtained.

[0158] In step 10223, the following processing is performed for each first sub-band: the average value is calculated based on the second number of transformation coefficients in the first sub-band to obtain the first average energy corresponding to the first sub-band, and the first average energy is determined as the first sub-band spectral envelope corresponding to the first sub-band.

[0159] As an example, let's take the case of dividing 320 transform coefficients evenly into 8 first sub-bands. In this case, the number of transform coefficients included in each first sub-band, i.e., the second number, is 40.

[0160] For each of the eight first sub-bands, the average value of the 40 transformation coefficients in the first sub-band is calculated to obtain the first average energy corresponding to the first sub-band, and the first average energy is used as the spectral envelope of the first sub-band corresponding to that first sub-band.

[0161] As an example, the sum of squares of the 40 transform coefficients included in the first subband can be calculated, and the ratio of the obtained sum of squares to 40 can be determined as the first average energy, thereby obtaining the spectral envelope of the first subband.

[0162] In this way, the first sub-band spectral envelopes corresponding to the eight first sub-bands can be obtained.

[0163] In step 10224, the first sub-band spectral envelopes corresponding to the multiple first sub-bands are determined as the first high-frequency sub-band signal features.

[0164] As an example, after obtaining the spectral envelopes of the first sub-bands corresponding to the eight first sub-bands, the spectral envelopes of the first sub-bands corresponding to the eight first sub-bands are determined as the first high-frequency sub-band signal features F. HB (n).

[0165] Using the above method, a high-frequency subband signal with a dimension of 320 can be converted into an 8-dimensional first high-frequency subband signal feature F. HB(n), thus requiring only a small amount of data to represent high-frequency subband signals, which helps improve coding efficiency.

[0166] As an example, feature extraction can be performed at the first level based on low-frequency subband signals, at the second level based on high-frequency subband signals, and at the third level based on both low-frequency and high-frequency subband signals. See also Figure 7C , Figure 7C This is a schematic diagram illustrating three levels of encoding and decoding provided in an embodiment of this application.

[0167] At the encoding end, the low-frequency subband signal is analyzed at the first level to obtain the first low-frequency subband signal characteristic F. LB (n); Perform a second-level analysis on the high-frequency subband signal to obtain the first high-frequency subband signal characteristic F. HB (n); Based on the low-frequency sub-band signal and the characteristics of the first low-frequency sub-band signal, a third-level low-frequency analysis is performed to obtain the characteristics F of the second low-frequency sub-band signal. LB,e (n), based on the high-frequency subband signal, a third-level high-frequency analysis is performed to obtain the second high-frequency subband signal characteristic F. HB,e (n). Next, for F LB (n), F HB (n), F LB,e (n) and F HB,e (n) Perform quantization encoding processing respectively to obtain the bit streams corresponding to the first level, the second level, and the third level, and transmit the bit streams corresponding to the three levels to the decoding end.

[0168] At the decoding end, firstly, the received three-level bitstreams are decoded to obtain the first-level, second-level, and third-level bitstream decoding results, respectively. Next, the first-level bitstream decoding results are combined to obtain the first-level combined result x′. LB (n) performs low-frequency synthesis processing on the decoded result of the low-frequency part of the code stream corresponding to the third level to obtain the low-frequency synthesis result x′ of the third level. LB,e (n), and for x′ LB (n) and x′ LB,e (n) Perform summation to obtain the summation result of the low-frequency part; based on the code stream decoding result corresponding to the second level and the code stream decoding result corresponding to the high-frequency part of the third level, perform high-frequency synthesis processing at the third level to obtain the high-frequency synthesis result x′. HB (n); Finally, the summation result of the low-frequency part by the synthetic filter and x′ are called. HB(n) is upsampled and the two upsampled results are combined to obtain the estimated value x′(n) of the original audio signal.

[0169] The following will illustrate, with reference to the accompanying figures, the method of feature extraction processing at three levels based on low-frequency subband signals and high-frequency subband signals.

[0170] See Figure 5D , Figure 5D This is a flowchart illustrating an audio encoding method provided in an embodiment of this application. Figure 5A Step 102 can be achieved through steps 1023-1024. The following will combine... Figure 5D Steps 1023-1024 are described below.

[0171] In step 1023, the third-level feature extraction process is performed as follows: based on the low-frequency sub-band signal and the first low-frequency sub-band signal features, the third feature extraction process is performed to obtain the second low-frequency sub-band signal features.

[0172] As an example, in cases where multiple layers include a third layer, the feature extraction process for the first and second layers is the same as steps 1021-1022 above, and will not be repeated here.

[0173] The following section will use the implementation of the third feature extraction process through a neural network as an example to illustrate the third feature extraction process.

[0174] See Figure 5D , Figure 5D Step 1023 can be achieved through steps 10231-10235. The following will combine... Figure 5D Steps 10231-10235 are described below.

[0175] In step 10231, the low-frequency subband signal and the first low-frequency subband signal features are spliced ​​together to obtain spliced ​​features.

[0176] See Figure 8C , Figure 8C This is a schematic diagram of the structure of a neural network for performing third feature extraction processing provided in an embodiment of this application.

[0177] like Figure 8C As shown, firstly, the low-frequency subband signal x... LB (n) (dimension 320) and the first low-frequency subband signal feature F LB (n) (with a dimension of 64) are concatenated to obtain a 384-dimensional concatenated feature.

[0178] In step 10232, the spliced ​​features are subjected to a third convolution process to obtain the third convolution features.

[0179] As an example, after obtaining the concatenated features, a third convolution process is performed using a convolutional layer with 24 channels (e.g., a causal convolutional layer) based on the concatenated features, resulting in a 24*384 third convolutional feature.

[0180] In step 10233, the third convolutional feature is subjected to a second pooling process to obtain the second pooled feature.

[0181] As an example, see Figure 8C After obtaining the third convolutional feature, the pooling layer is called to perform the second pooling process based on the third convolutional feature. With the pooling factor of 2, a second pooling feature of 24*192 is obtained.

[0182] In step 10234, the second pooling feature is subjected to a fourth downsampling process to obtain the fourth downsampling feature.

[0183] As an example, see Figure 8C After obtaining the second pooling feature, the four downsampling processes are performed by calling the three cascaded downsampling layers based on the second pooling feature.

[0184] These three downsampling layers correspond to three different downsampling factors. For example, the first downsampling layer has a downsampling factor of 4 and 48 channels; the second downsampling layer has a downsampling factor of 6 and 96 channels; and the third downsampling layer has a downsampling factor of 8 and 192 channels. Therefore, after the fourth downsampling process through these three downsampling layers, the 24*192 second pooling feature is successively transformed into fourth downsampling features of 48*48, 96*8, and 192*1.

[0185] In step 10235, the fourth downsampling feature is subjected to a fourth convolution process to obtain the second low-frequency subband signal feature.

[0186] As an example, see Figure 8C After obtaining the 192*1 fourth downsampled feature, a convolutional layer with 28 channels (e.g., a causal convolutional layer) is called based on the 192*1 fourth downsampled feature to perform a fourth convolution process, resulting in a 28-dimensional feature vector, which is the second low-frequency subband signal feature F. LB,e (n).

[0187] Due to the second low-frequency subband signal characteristic F LB,e The dimension of (n) is 28, and the first low-frequency subband signal feature F LB The dimension of (n) is 64, therefore, the dimension of the second low-frequency subband signal feature is smaller than the dimension of the first low-frequency subband signal feature.

[0188] The purpose of the third-level third feature extraction process is to further extract features of the low-frequency subband signal. It should be noted that the second low-frequency subband signal feature obtained by the third feature extraction process reflects the residual between the reconstructed signal at the decoding end and the original low-frequency subband signal, based on the features of the first low-frequency subband signal. Therefore, the second low-frequency subband signal feature can also be called the low-frequency subband signal residual feature.

[0189] Using the above method, the second low-frequency subband signal features can be obtained quickly and efficiently with the help of neural networks. This not only provides accurate second-frequency subband signal features but also enables data compression.

[0190] In step 1024, a fourth feature extraction process is performed based on the first high-frequency sub-band signal features to obtain the second high-frequency sub-band signal features.

[0191] As an example, the fourth feature extraction process at the third level can be implemented based on the second feature extraction process at the second level. That is, the fourth feature extraction process is performed based on the first high-frequency sub-band signal features obtained from the second feature extraction process to obtain the second high-frequency sub-band signal features.

[0192] See Figure 5E , Figure 5E This is a flowchart illustrating an audio encoding method provided in an embodiment of this application. Based on Figure 5D , Figure 5E Step 1024 can be achieved through steps 10241-10243. The following will combine... Figure 5E Steps 10241-10243 are described below.

[0193] In step 10241, a third number of transformation parameters are selected from the second number of transformation parameters in the first sub-band, and the third number of transformation coefficients are determined as the second sub-band.

[0194] As an example, in the process of obtaining the signal characteristics of the first high-frequency sub-band, eight first sub-bands were obtained. For each first sub-band, a third number of transform parameters were selected from the 40 transform parameters included in the first sub-band, and the selected third number of transform coefficients were determined as the second sub-band. The third number is half the second number; when the second number is 40, the third number is 20.

[0195] Here, since there are 8 first subbands, and a second subband is determined based on each first subband, we can eventually obtain 8 second subbands.

[0196] In step 10242, the average value is calculated based on the third number of transformation coefficients in the second sub-band to obtain the second average energy corresponding to the second sub-band, and the second average energy is determined as the second sub-band spectral envelope corresponding to the second sub-band.

[0197] As an example, after obtaining 8 second subbands, the average value is calculated based on the 20 transformation coefficients in the second subband. For example, the sum of squares of the 20 transformation coefficients included in the second subband can be calculated, and the ratio of the obtained sum of squares to 20 can be determined as the second average energy, thereby obtaining the spectral envelope of the second subband.

[0198] In this way, the spectral envelopes of the second sub-bands corresponding to the eight second sub-bands can be obtained.

[0199] In step 10243, the second sub-band spectral envelope corresponding to each second sub-band is determined as the second high-frequency sub-band signal feature.

[0200] As an example, after obtaining the spectral envelopes of the eight second sub-bands respectively, the spectral envelopes of the eight second sub-bands are determined as the second high-frequency sub-band signal features F. HB,e (n).

[0201] Since the second feature extraction process maps the 320-dimensional high-frequency subband signal to an 8-dimensional first high-frequency subband signal feature, meaning that the transform coefficients corresponding to every 40 sample points share one first subband, this mapping method suffers from some information loss. Therefore, in the fourth feature extraction process, the spectral envelope of the high-frequency subband signal is further passed for each first subband, that is, the transform coefficients corresponding to every 20 sample points share one second subband. This allows for a spectral energy adjustment of the transform coefficients every 20 sample points, resulting in a second high-frequency subband signal feature with higher resolution and more accurately reflecting the characteristics of the high-frequency subband signal.

[0202] In step 103, the sub-band signal features corresponding to each level are quantized to obtain the index value of the sub-band signal features.

[0203] As an example, after obtaining the sub-band signal features corresponding to each level, the sub-band signal features corresponding to each level are quantized to obtain the index value of the sub-band signal features corresponding to each level.

[0204] Here, quantization is used to digitize the subband signal features on the amplitude axis, and typically includes vector quantization or scalar quantization.

[0205] The principle of vector quantization is to combine multiple scalar data into a vector, divide the vector space into multiple sub-regions, and determine a representative vector for each sub-region. If a sub-band signal feature falls into a certain sub-region during quantization, the representative vector corresponding to that sub-region is used to replace the sub-band signal feature, that is, the sub-band signal feature is quantized into the representative vector.

[0206] The principle of scalar quantization is to divide the entire dynamic range into multiple sub-intervals, and each sub-interval is assigned a representative value. If the sub-band signal feature falls into a certain sub-interval during quantization, the representative value corresponding to that sub-interval is used to replace the sub-band signal feature, that is, the sub-band signal feature is quantized into the representative value.

[0207] As an example, to improve quantization efficiency, the quantization process can be completed by querying a predefined quantization table for the subband signal characteristics corresponding to each level.

[0208] In step 104, the index values ​​of the sub-band signal features are encoded to obtain the code stream corresponding to the layer.

[0209] As an example, after obtaining the index value of the sub-band signal feature corresponding to each level, the index value of the sub-band signal feature is encoded, for example, by entropy encoding, to obtain the bitstream corresponding to each level.

[0210] See Figure 5F , Figure 5F This is a flowchart illustrating an audio encoding method provided in an embodiment of this application. Based on Figure 5A , Figure 5F Step 104 can be achieved through steps 1041-1044. The following will combine... Figure 5F Steps 1041-1044 are described below.

[0211] In step 1041, the index value of the first low-frequency signal feature is encoded to obtain the first code stream corresponding to the first level.

[0212] As an example, if feature extraction processing at the first and second levels is performed, and the sub-band signal feature corresponding to the first level is the first low-frequency sub-band signal feature, then encoding the index value of the first low-frequency signal feature can yield the first bitstream corresponding to the first level.

[0213] As an example, in the first level, assuming the average bit rate for quantizing one parameter per frame of audio signal is 2.5 bits, and since the duration of one frame is 20ms, this is equivalent to 0.125 bits / ms, which translates to 125 bits / t / s after unit conversion. Since the first low-frequency sub-band signal has a 64-dimensional feature, meaning it contains 64 parameters, the average bit rate for encoding these 64 parameters is 64 * 125 bits / t / s, or 8000 bits / t / s, or 8kbps. Therefore, the first bit rate corresponding to the first bitstream is 8kbps.

[0214] In step 1042, the index value of the first high-frequency signal feature is encoded to obtain the second code stream corresponding to the second level.

[0215] As an example, since the sub-band signal characteristics corresponding to the second level are the same as those of the first high-frequency sub-band signal, the second code stream corresponding to the second level can be obtained by encoding the index value of the first high-frequency signal characteristics.

[0216] As an example, in the second level, assuming the average bit rate for quantizing one parameter per frame of audio signal is 5 bits, and since the first high-frequency subband signal has an 8-dimensional feature, which is equivalent to containing 8 parameters, the average bit rate for encoding these 8 parameters is 2 kbps. Therefore, the second bit rate corresponding to the second bitstream is 2 kbps.

[0217] Therefore, the bitrate of the first bitstream is greater than that of the second bitstream. Since the bitrate is positively correlated with the decoding quality index of the corresponding bitstream, the decoding quality index of the first bitstream corresponding to the first level is higher, while the decoding quality index of the second bitstream corresponding to the second level is relatively lower.

[0218] In step 1043, the index value of the second low-frequency signal feature is encoded to obtain the third code stream corresponding to the third level.

[0219] As an example, if a third level of feature extraction processing is also performed, since the sub-band signal features corresponding to the third level include the second low-frequency sub-band signal features, the index values ​​of the second low-frequency signal features are encoded to obtain the third bitstream corresponding to the third level.

[0220] As an example, in the third level, assuming the average bit rate for quantizing a low-frequency subband signal feature parameter per frame of audio signal is 2.5 bits, and since the second high-frequency subband signal feature is a 28-dimensional feature, equivalent to containing 28 parameters, the average bit rate for encoding these 28 parameters is 3.5 kbps. Therefore, the third bit rate corresponding to the third bitstream is 3.5 kbps.

[0221] In step 1044, the index value of the second high-frequency signal feature is encoded to obtain the fourth code stream corresponding to the third level.

[0222] As an example, since the sub-band signal features corresponding to the third level also include the second high-frequency sub-band signal features, the index value of the second high-frequency signal features is encoded to obtain the fourth code stream corresponding to the third level.

[0223] As an example, in the third level, assuming the average bit rate for quantizing a high-frequency subband signal feature parameter per frame of audio signal is 5 bits, and since the second high-frequency subband signal feature is an 8-dimensional feature, equivalent to containing 8 parameters, the average bit rate for encoding these 8 parameters is 2 kbps. Therefore, the fourth bit rate corresponding to the fourth bitstream is 2 kbps.

[0224] Therefore, the first bitrate of the first bitstream (i.e., 8kbps) is greater than the third bitrate of the third bitstream (i.e., 3.5kbps), the third bitrate of the third bitstream (i.e., 3.5kbps) is greater than the second bitrate of the second bitstream (i.e., 2kbps), and the second bitrate of the second bitstream (i.e., 2kbps) is equal to the fourth bitrate of the fourth bitstream (i.e., 2kbpps).

[0225] Since the bitrate is positively correlated with the decoding quality index of the corresponding bitstream, the decoding quality index of the first bitstream corresponding to the first level is higher, the decoding quality index of the third bitstream corresponding to the third level is lower, and the decoding quality index of the second bitstream corresponding to the second level and the fourth bitstream corresponding to the third level is relatively lower.

[0226] By using the above-mentioned layered encoding method, different bit rates can be used to progressively encode the features corresponding to different layers. Since each layer only needs to encode a portion of the sub-band signal features, the encoding efficiency can be improved, the encoding complexity can be reduced, and the flexibility to select different encoding methods based on different sub-band signal features can be provided.

[0227] In step 105, the transmission priority is configured for the bitstreams corresponding to the multiple layers.

[0228] As an example, after obtaining the bitstream corresponding to each level, a corresponding transmission priority is configured for the bitstreams of different levels. The transmission priority is positively correlated with the decoding quality index of the bitstream corresponding to the level. That is, the higher the decoding quality index of the bitstream, the higher the corresponding transmission priority, and it can be transmitted first when bandwidth is limited.

[0229] As an example, transmission priority can be configured through the Forward Error Correction (FEC) mechanism, which improves the stability and quality of transmitted data through redundant transmission. Here, bitstreams with higher transmission priority can be configured with a larger FEC redundancy factor, while bitstreams with lower transmission priority can be configured with a smaller FEC redundancy factor.

[0230] See Figure 5G , Figure 5G This is a flowchart illustrating an audio encoding method provided in an embodiment of this application. Based on Figure 5A , Figure 5G Step 105 can be achieved through steps 1051-1054. The following will combine... Figure 5G Steps 1051-1054 are described below.

[0231] In step 1051, a first transmission priority is configured for the first bitstream corresponding to the first level.

[0232] As an example, if first-level and second-level encoding processing are performed, and the first bitstream corresponding to the first level is obtained by encoding the index value of the first low-frequency signal feature, then the first bitstream corresponding to the first level is configured with a first transmission priority.

[0233] In step 1052, a second transmission priority is configured for the second bitstream corresponding to the second level.

[0234] As an example, if a second level of encoding processing is performed, and the second bitstream corresponding to the second level is obtained by encoding the index value of the first high-frequency signal feature, then a second transmission priority is configured for the second bitstream corresponding to the second level.

[0235] Since the first bitrate corresponding to the first bitstream is greater than the second bitrate corresponding to the second bitstream, and the bitrate is positively correlated with the decoding quality index corresponding to the bitstream, the first bitstream is configured with a higher first transmission priority, and the second bitstream is configured with a lower second transmission priority.

[0236] In step 1053, a third transmission priority is configured for the third bitstream corresponding to the third level.

[0237] As an example, if a third level of encoding is also performed, and the third level corresponds to a third bitstream, which is obtained by encoding the index value of the second low-frequency signal feature, then the third transmission priority is configured for the third bitstream corresponding to the third level.

[0238] In step 1054, a fourth transmission priority is configured for the fourth bitstream corresponding to the third level.

[0239] As an example, the third level also corresponds to a fourth bitstream, which is obtained by encoding the index values ​​of the second high-frequency signal features. Therefore, a fourth transmission priority is configured for the fourth bitstream corresponding to the third level.

[0240] Since the first bitrate of the first bitstream is greater than the third bitrate of the third bitstream, the third bitrate of the third bitstream is greater than the second bitrate of the second bitstream, the second bitrate of the second bitstream is equal to the fourth bitrate of the fourth bitstream, and the bitrate is positively correlated with the decoding quality metric corresponding to the bitstream, the first transmission priority is higher than the third transmission priority, the third transmission priority is higher than the second transmission priority, and the second and fourth transmission priorities are the same.

[0241] By configuring corresponding transmission priorities for different levels of bitstreams, it can be ensured that more important bitstreams are transmitted first when bandwidth is limited, thereby improving the flexibility of data transmission.

[0242] In this embodiment, the sub-band signal features of the audio sub-band signal at each level are obtained in layers, and the sub-band signal features corresponding to each level are encoded in layers. In this way, each level only needs to encode specific sub-band signal features, instead of encoding the features of the entire audio signal. This not only improves the efficiency of encoding and decoding, but also reduces the encoding and decoding complexity of each level. According to the importance of the bitstream at different levels to the decoding quality, different transmission priorities are flexibly configured for the bitstream at different levels to ensure that the more important bitstream is transmitted first, which can be applied to application scenarios with different network bandwidths.

[0243] The audio decoding method provided in this application will be described below with reference to exemplary applications and implementations of the electronic devices provided in the embodiments of this application. It will be understood that the following method can be executed individually or collaboratively by the terminal or server described above.

[0244] See Figure 6A , Figure 6A This is a flowchart illustrating the audio decoding method provided in the embodiments of this application, which will be combined with... Figure 6A The steps shown are explained.

[0245] In step 201, the bitstreams corresponding to multiple levels are decoded to obtain the index value of the bitstream corresponding to each level.

[0246] As an example, after receiving the bitstreams corresponding to multiple layers at the decoding end, the bitstreams corresponding to multiple layers are decoded to obtain the index value of the bitstream corresponding to each layer.

[0247] Here, if the decoding end receives a bitstream corresponding to one level, it decodes to obtain the index value of the bitstream corresponding to that level. If the decoding end receives bitstreams corresponding to multiple levels, it decodes to obtain the index values ​​of the bitstreams corresponding to each of the multiple levels.

[0248] Different layers correspond to different transmission priorities, and the transmission priority is positively correlated with the decoding quality index of the bitstream corresponding to the layer.

[0249] In step 202, the index values ​​of the code streams corresponding to each level are inversely quantized to obtain the sub-band signal features corresponding to each level.

[0250] As an example, after obtaining the index value of the bitstream corresponding to each level, the sub-band signal characteristics corresponding to the index value of the bitstream can be obtained by querying the quantization table, thereby realizing inverse quantization processing.

[0251] As an example, if the index value of the code stream corresponding to each level from the first to the third level is obtained, then after inverse quantization, the first low-frequency subband signal feature F′ corresponding to the first level can be obtained. LB (n), the first high-frequency sub-band signal feature F′ corresponding to the second level HB (n), the second low-frequency sub-band signal characteristic F′ corresponding to the third level LB,e (n), the second high-frequency sub-band signal characteristic F′ corresponding to the third level HB,e (n).

[0252] In step 203, feature reconstruction processing is performed on the sub-band signal features corresponding to each level to obtain the sub-band signal corresponding to each level.

[0253] As an example, after obtaining the sub-band signal features corresponding to each level, feature reconstruction processing is performed on the sub-band signal features corresponding to each level to obtain the sub-band signal corresponding to each level.

[0254] As an example, if the encoder performs feature extraction processing at only one level, resulting in a bitstream corresponding to that level, then the decoder, upon receiving this bitstream, decodes it to obtain its index value. Next, it performs inverse quantization on this index value to obtain the sub-band signal features corresponding to that level. Finally, based on these sub-band signal features, it performs feature reconstruction processing to obtain the corresponding sub-band signal.

[0255] As an example, see Figure 7AIf the encoding end only performs feature extraction processing at the first level, it ultimately obtains the first bitstream corresponding to the first level. The decoding end processes the first bitstream to obtain the first low-frequency sub-band signal features corresponding to the first level. Then, feature reconstruction processing is performed based on the first low-frequency sub-band signal features to obtain the first low-frequency sub-band signal.

[0256] Correspondingly, if the encoding end only performs the second level of feature extraction processing, the corresponding processing flow is similar to the above method, and will not be repeated here.

[0257] See Figure 6B , Figure 6B This is a flowchart illustrating an audio decoding method provided in an embodiment of this application. Based on Figure 6A , Figure 6B Step 203 can be achieved through steps 2031-2032. The following will combine... Figure 6B Steps 2031-2032 are described below.

[0258] In step 2031, the first level of feature reconstruction processing is performed in the following manner: the first feature reconstruction processing is performed based on the features of the first low-frequency sub-band signal to obtain the first low-frequency sub-band signal.

[0259] As an example, if the sub-band signal features corresponding to the first and second levels are obtained, and the first level corresponds to the first low-frequency sub-band signal features, then, based on the first low-frequency sub-band signal features, the first feature reconstruction process is performed to obtain the first low-frequency sub-band signal x′. LB (n). Wherein, the dimension of the first low-frequency sub-band signal is greater than the dimension of the features of the first low-frequency sub-band signal.

[0260] See Figure 6B Step 2031 can be achieved through steps 20311-20314. The following will describe the combined body. Figure 6B Steps 20311-20314 are explained.

[0261] In step 20311, the first low-frequency subband signal features are subjected to a first convolution process to obtain the first convolution features.

[0262] See Figure 8D , Figure 8D This is a schematic diagram of the structure of a neural network used for first feature reconstruction processing provided in an embodiment of this application.

[0263] like Figure 8D As shown, firstly, based on the first low-frequency sub-band signal features, a convolutional layer with 192 channels (e.g., a causal convolutional layer) is called to perform the first convolutional processing, resulting in a 192*1 first convolutional feature.

[0264] In step 20312, the first convolutional features are subjected to a first upsampling process to obtain the first upsampled features.

[0265] As an example, see Figure 8D After obtaining the first convolutional feature, the three cascaded upsampling layers are called to perform the first upsampling process based on the first convolutional feature.

[0266] These three upsampling layers correspond to three different upsampling factors. For example, the first upsampling layer has an upsampling factor of 8 and 96 channels; the second upsampling layer has an upsampling factor of 5 and 48 channels; and the third upsampling layer has an upsampling factor of 4 and 24 channels. Therefore, after the first upsampling process through these three upsampling layers, the 192*1 first convolutional feature is successively transformed into 96*8, 48*40, and 24*160 first upsampling features.

[0267] In step 20313, the first upsampled feature is subjected to the first pooling process to obtain the first pooled feature.

[0268] As an example, after obtaining the first upsampled feature, the pooling layer is called to perform the first pooling process based on the first upsampled feature. When the pooling factor is 2, the first pooled feature of 24*320 is obtained.

[0269] In step 20314, the first pooling feature is subjected to a second convolution process to obtain the first low-frequency subband signal.

[0270] As an example, after obtaining the first pooling feature, a second convolution process is performed based on the first pooling feature using a convolutional layer with 1 channel (e.g., a causal convolutional layer), resulting in a 1*320 first low-frequency subband signal. It should be noted that the first low-frequency subband signal here is an estimate of the low-frequency subband signal at the encoding end.

[0271] Using the above method, the first low-frequency subband signal can be obtained quickly and efficiently with the help of neural networks.

[0272] In step 2032, the second-level feature reconstruction process is performed as follows: the second feature reconstruction process is performed based on the features of the first high-frequency sub-band signal to obtain the first high-frequency sub-band signal.

[0273] As an example, if we obtain the sub-band signal features corresponding to the first and second levels, and the second level corresponds to the first high-frequency sub-band signal features, then, based on the first high-frequency sub-band signal features, we perform second feature reconstruction processing to obtain the first high-frequency sub-band signal x′. HB (n). Wherein, the dimension of the first high-frequency sub-band signal is greater than the dimension of the features of the first high-frequency sub-band signal.

[0274] See Figure 6C , Figure 6C This is a flowchart illustrating an audio decoding method provided in an embodiment of this application. Based on Figure 6B , Figure 6C Step 2032 can be achieved through steps 20321-20324. The following will combine... Figure 6C Steps 20321-20324 are explained below.

[0275] In step 20321, frequency domain transformation is performed on a first number of sample points included in the first low-frequency subband signal to obtain a first number of low-frequency transformation coefficients.

[0276] As an example, when performing second feature reconstruction processing based on the features of the first high-frequency sub-band signal, a first low-frequency sub-band signal is first obtained based on the first feature reconstruction processing. Then, frequency domain transformation processing is performed on a first number (e.g., 320) of sample points included in the first low-frequency sub-band signal to obtain 320 low-frequency transformation coefficients. Here, the frequency domain transformation processing can be FFT, DCT, MDCT, etc.

[0277] In step 20322, based on the latter half of the low-frequency transformation coefficients in the first number of low-frequency transformation coefficients, two spectrum copying processes are performed to obtain the first number of reference transformation coefficients included in the first high-frequency reference sub-band signal.

[0278] As an example, after obtaining 320 low-frequency transform coefficients, the latter half of the low-frequency transform coefficients, namely the last 160 low-frequency transform coefficients, are selected from these 320 low-frequency transform coefficients. The selected last 160 low-frequency transform coefficients are then subjected to two spectral copying processes to obtain a new 320 low-frequency transform coefficients. These new 320 low-frequency transform coefficients are then used as the 320 reference transform coefficients included in the first high-frequency reference sub-band signal.

[0279] In this part, the frequency of the sample points corresponding to the low-frequency transform coefficients in the latter half is higher than the frequency of the sample points corresponding to the low-frequency transform coefficients in the first half.

[0280] It should be noted that audio signals have the following characteristics: the low-frequency spectrum has relatively more harmonics, while the high-frequency spectrum has relatively fewer harmonics. Therefore, to avoid the generated spectrum containing too many harmonics due to simple copying, the spectrum of the low-frequency transform coefficients corresponding to the higher-frequency sample points is copied here, that is, the spectrum of the low-frequency transform coefficients corresponding to the last 160 higher-frequency sample points.

[0281] In step 20323, based on the multiple first sub-band spectrum envelopes corresponding to the characteristics of the first high-frequency sub-band signal, a first gain control is performed on a first number of reference transformation coefficients included in the first high-frequency reference sub-band signal to obtain a first number of first gain reference transformation coefficients.

[0282] As an example, after obtaining the 320 reference transform coefficients corresponding to the first high-frequency reference sub-band signal, the 320 reference transform coefficients included in the first high-frequency reference sub-band signal are subjected to first gain control based on the 8 first sub-band spectrum envelopes corresponding to the characteristics of the first high-frequency sub-band signal, thereby obtaining 320 first gain reference transform coefficients.

[0283] See Figure 6D , Figure 6D This is a flowchart illustrating an audio decoding method provided in an embodiment of this application. Based on Figure 6C , Figure 6D Step 20323 can be achieved through steps 203231-203236. The following will combine... Figure 6D Steps 203231-203236 are explained below.

[0284] In step 203231, the first number of reference transformation coefficients are divided into multiple first reference sub-bands.

[0285] As an example, when performing the first gain control, the 320 reference transform coefficients are first divided into multiple first reference sub-bands, which are the same number as the number of first sub-band spectral envelopes. That is, when the number of first sub-band spectral envelopes is 8, 8 first reference sub-bands are obtained here, and each first reference sub-band includes 40 reference transform coefficients.

[0286] In step 203232, multiple different first combinations are generated, each first combination including a first reference subband and a first subband spectral envelope, and for each first combination, the following process is performed: the first subband spectral envelope is determined as the first average energy.

[0287] As an example, after obtaining 8 first reference subbands, 8 first combinations are generated for the 8 first reference subbands and the 8 first subband spectral envelopes. Each first combination includes a first reference subband and a first subband spectral envelope. The first reference subbands and first subband spectral envelopes included in different first combinations are different.

[0288] For each of the eight first combinations, the first subband spectral envelope is first determined as the first average energy.

[0289] In step 203233, the average value is calculated based on the second number of reference transformation coefficients in the first reference sub-band to obtain the first reference average energy corresponding to the first reference sub-band.

[0290] As an example, the second quantity here is the ratio of the first quantity to the number of multiple first reference subbands. In the case where the first quantity is 320 and the number of first reference subbands is 8, the second quantity is 40.

[0291] The first reference average energy is obtained by averaging the 40 reference transformation coefficients included in the first reference sub-band. The calculation method for the first reference average energy is similar to that described above and will not be repeated here.

[0292] In step 203234, the ratio of the first average energy to the first reference average energy is determined, and the square root of the ratio is taken to obtain the first scaling factor.

[0293] As an example, after obtaining the first average energy and the first reference average energy corresponding to the first combination, the ratio of the first average energy to the first reference average energy is calculated, and the square root of the ratio is taken to obtain the first scaling factor.

[0294] In step 203235, the second number of reference transformation coefficients in the first reference sub-band are multiplied by the first scaling factor to obtain the second number of first gain reference transformation coefficients in the first reference sub-band.

[0295] As an example, after obtaining the first scaling factor, the 40 reference transformation coefficients in the first reference sub-band are all multiplied by the first scaling factor to obtain 40 first gain reference transformation coefficients.

[0296] In step 203236, the second number of first gain reference transformation coefficients in each first reference sub-band are merged to obtain the first number of first gain reference transformation coefficients.

[0297] As an example, since 8 first combinations are obtained in step 203232, and each first combination calculates 40 first gain reference transformation coefficients for a first reference sub-band, the 40 first gain reference transformation coefficients for the first reference sub-bands corresponding to these 8 first combinations are combined to obtain 320 first gain reference transformation coefficients.

[0298] By performing gain control in the above manner, the spectral energy of the virtually generated reference transform coefficients at the decoding end can be made closer to the spectral energy of the original transform coefficients at the encoding end, thus facilitating the accurate generation of estimates for high-frequency subband signals.

[0299] In step 20324, the first number of first gain reference transformation coefficients are subjected to inverse frequency domain transformation to obtain the first high-frequency subband signal.

[0300] As an example, after obtaining 320 first gain reference transform coefficients, performing an inverse frequency domain transform on these 320 first gain reference transform coefficients yields the first high-frequency subband signal. It should be noted that the first high-frequency subband signal here is an estimate of the high-frequency subband signal at the encoding end.

[0301] Since the inverse frequency domain transformation is performed based on the transform coefficients after gain control, the resulting first high-frequency subband signal is closer to the high-frequency subband signal at the encoding end, thereby improving the decoding quality of the high-frequency subband signal.

[0302] See Figure 6E , Figure 6E This is a flowchart illustrating an audio decoding method provided in an embodiment of this application. Based on Figure 6A , Figure 6E Step 203 can be achieved through steps 2033-2034. The following will combine... Figure 6E Steps 2033-2034 are explained below.

[0303] In step 2033, the third-level feature reconstruction process is performed as follows: the second low-frequency sub-band signal features are subjected to the third feature reconstruction process to obtain the second low-frequency sub-band signal.

[0304] As an example, if we also obtain the sub-band signal features corresponding to the third level, and the third level corresponds to the second low-frequency sub-band signal features, then, based on the second low-frequency sub-band signal features, we perform third feature reconstruction processing to obtain the second low-frequency sub-band signal x′. LB,e (n).

[0305] See Figure 6E Step 2033 can be achieved through steps 20331-20334. The following will describe the combined body. Figure 6E Steps 20331-20334 are explained.

[0306] In step 20331, the second low-frequency subband signal features are subjected to a third convolution process to obtain the third convolution features.

[0307] See Figure 8E , Figure 8E This is a schematic diagram of the structure of a neural network for performing third feature reconstruction processing provided in an embodiment of this application.

[0308] like Figure 8E As shown, based on the second low-frequency subband signal features, a convolutional layer with 192 channels (e.g., a causal convolutional layer) is called to perform a third convolutional process, resulting in a 192*1 third convolutional feature.

[0309] In step 20332, the third convolutional features are subjected to a second upsampling process to obtain the second upsampled features.

[0310] As an example, see Figure 8E After obtaining the third convolutional feature, the three cascaded upsampling layers are called to perform a second upsampling process based on the third convolutional feature.

[0311] These three upsampling layers correspond to three different upsampling factors. For example, the first upsampling layer has an upsampling factor of 8 and 96 channels; the second upsampling layer has an upsampling factor of 5 and 48 channels; and the third upsampling layer has an upsampling factor of 4 and 24 channels. Therefore, after these three upsampling layers, the 192*1 first convolutional feature is successively transformed into 96*8, 48*40, and 24*160 second upsampling features.

[0312] In step 20333, the second upsampled feature is subjected to a second pooling process to obtain the second pooled feature.

[0313] As an example, after obtaining the second upsampled feature, the pooling layer is called to perform the second pooling process based on the second upsampled feature. When the pooling factor is 2, a second pooled feature of 24*320 is obtained.

[0314] In step 20334, the second pooling feature is subjected to a fourth convolution process to obtain the second low-frequency subband signal.

[0315] As an example, see Figure 8E After obtaining the second pooling feature, a fourth convolution process is performed based on the second pooling feature using a convolutional layer with 1 channel (e.g., a causal convolutional layer), resulting in a 1*320 second low-frequency sub-band signal. It should be noted that the second low-frequency sub-band signal here is the residual signal of the first low-frequency sub-band signal.

[0316] Using the above method, the second low-frequency subband signal can be obtained quickly and efficiently with the help of neural networks.

[0317] In step 2034, a fourth feature reconstruction process is performed based on the first high-frequency sub-band signal features and the second high-frequency sub-band signal features to obtain the second high-frequency sub-band signal.

[0318] As an example, if the sub-band signal features corresponding to the third level are also obtained, and the third level corresponds to the second high-frequency sub-band signal features, then the fourth feature reconstruction process is performed based on the first high-frequency sub-band signal features and the second high-frequency sub-band signal features to obtain the second high-frequency sub-band signal.

[0319] See Figure 6F , Figure 6FThis is a flowchart illustrating an audio decoding method provided in an embodiment of this application. Based on Figure 6E , Figure 6F Step 2034 can be achieved through steps 20341-20344. The following will combine... Figure 6F Steps 20341-20344 are described below.

[0320] In step 20341, multiple first sub-bands corresponding to the first high-frequency sub-band signal characteristics are determined, and multiple second sub-bands corresponding to the second high-frequency sub-band signal characteristics are determined.

[0321] As an example, since the first high-frequency sub-band signal feature is determined based on multiple first sub-bands and the second high-frequency sub-band signal feature is determined based on multiple second sub-bands during the encoding process, the fourth feature reconstruction process first determines the multiple first sub-bands corresponding to the first high-frequency sub-band signal feature and then determines the multiple second sub-bands corresponding to the second high-frequency sub-band signal feature. The number of multiple first sub-bands is the same as the number of multiple second sub-bands. For example, here the number of both first and second sub-bands is 8.

[0322] In step 20342, the spectral envelopes of the third sub-bands corresponding to the multiple first sub-bands are determined based on the spectral envelopes of the first sub-bands corresponding to the multiple second sub-bands and the spectral envelopes of the second sub-bands corresponding to the multiple second sub-bands.

[0323] As an example, after obtaining 8 first sub-bands and 8 second sub-bands, the spectral envelopes of the 8 third sub-bands are determined based on the spectral envelopes of the first sub-bands and the spectral envelopes of the second sub-bands. The number of third sub-bands is the same as the number of second sub-bands and also the same as the number of first sub-bands, for example, 8 in each case.

[0324] In some embodiments, the spectral envelopes of the third sub-bands corresponding to the multiple third sub-bands are determined by generating multiple different second combinations, each second combination including a first sub-band and a second sub-band corresponding to the first sub-band, and performing the following processing for each second combination: determining the spectral envelope of the first sub-band corresponding to the first sub-band as a first average energy, and determining the spectral envelope of the second sub-band corresponding to the second sub-band as a second average energy; multiplying the first average energy by a second quantity to obtain a first multiplication result; multiplying the second average energy by a third quantity to obtain a second multiplication result; subtracting the first multiplication result from the second multiplication result, and determining the ratio of the subtraction result to the third quantity as the third average energy corresponding to the third sub-band, and determining the third average energy as the spectral envelope of the third sub-band corresponding to the third sub-band.

[0325] As an example, firstly, eight second combinations are generated, each of which includes a first subband and a corresponding second subband. The first and second subbands included in different second combinations are different.

[0326] Secondly, for each of the eight second combinations, the first sub-band spectrum envelope corresponding to the first sub-band is taken as the first average energy, and the first average energy is multiplied by the number of high-frequency transform coefficients (e.g., 40) included in the first sub-band to obtain the first multiplication result.

[0327] Correspondingly, the spectral envelope of the second sub-band is used as the second average energy. The second average energy is multiplied by the number of high-frequency transform coefficients (e.g., 20) included in the second sub-band to obtain the second multiplication result. Here, the third number is the number of high-frequency transform coefficients included in the second sub-band, and the third number is half of the second number.

[0328] Finally, the first multiplication result is subtracted from the second multiplication result. The ratio of the subtracted result to the third quantity is determined as the third average energy corresponding to the third subband, and the third average energy is determined as the spectral envelope of the third subband. The second quantity of high-frequency transform coefficients included in the first subband is the union of the high-frequency transform coefficients included in the second subband and the high-frequency transform coefficients included in the third subband. That is, the 40 high-frequency transform coefficients included in the first subband are the union of the 20 high-frequency transform coefficients included in the second subband and the 20 high-frequency transform coefficients included in the third subband.

[0329] As an example, the formula for calculating the third average energy is as follows:

[0330]

[0331] Among them, F′ HB (n) represents the first average energy corresponding to the first sub-band in each second combination, F′ HB,e (n) represents the second average energy corresponding to the second subband in each second combination.

[0332] As an example, since eight second combinations are obtained, and each second combination yields a third subband spectral envelope, eight third subband spectral envelopes can be obtained based on these eight second combinations.

[0333] The above method can accurately obtain the spectral envelopes of multiple third subbands.

[0334] In step 20343, the multiple second subband spectrum envelopes and multiple third subband spectrum envelopes are determined as the fourth subband spectrum envelope.

[0335] As an example, after obtaining 8 third sub-band spectral envelopes, both the 8 second sub-band spectral envelopes and the 8 third sub-band spectral envelopes are determined as fourth sub-band spectral envelopes, that is, 16 fourth sub-band spectral envelopes are obtained.

[0336] In step 20344, based on the multiple fourth sub-band spectral envelopes, a second gain control is performed on a first number of reference transformation coefficients included in the first high-frequency reference sub-band signal to obtain a first number of second gain reference transformation coefficients.

[0337] As an example, after obtaining 16 fourth sub-band spectral envelopes, a second gain control is performed on the 320 reference transform coefficients included in the first high-frequency reference sub-band signal based on these 16 fourth sub-band spectral envelopes, thereby obtaining 320 second gain reference transform coefficients.

[0338] See Figure 6G , Figure 6G This is a flowchart illustrating an audio decoding method provided in an embodiment of this application. Based on Figure 6F , Figure 6G Step 20344 can be achieved through steps 203441-203446. The following will combine... Figure 6G Steps 203441-203446 are explained below.

[0339] In step 203441, the first number of reference transformation coefficients are divided into multiple second reference sub-bands.

[0340] As an example, when performing the second gain control, the 320 reference transform coefficients are first divided into the same number of second reference subbands as the fourth subband spectral envelope, that is, divided into 16 second reference subbands, each of which includes 20 reference transform coefficients.

[0341] In step 203442, multiple different third combinations are generated, each third combination including a second reference subband and a fourth subband spectral envelope, and for each third combination, the following processing is performed: the fourth subband spectral envelope is determined as the fourth average energy.

[0342] As an example, after obtaining 16 second reference subbands, 16 third combinations are generated for the spectral envelopes of the 16 second reference subbands and 16 fourth subbands. Each third combination includes a second reference subband and a fourth subband spectral envelope. The second reference subband and fourth subband spectral envelopes included in different third combinations are different.

[0343] For each of the 16 third combinations, the fourth subband spectral envelope is first determined as the fourth average energy.

[0344] In step 203443, the average value is calculated based on the third number of reference transformation coefficients in the second reference sub-band to obtain the second reference average energy corresponding to the second reference sub-band.

[0345] As an example, the third quantity here is the ratio of the first quantity to the number of multiple second reference subbands. In the case where the first quantity is 320 and the number of second reference subbands is 16, the third quantity is 20.

[0346] The second reference average energy is obtained by averaging the 20 reference transformation coefficients included in the second reference sub-band. The calculation method for the second reference average energy is similar to that described above and will not be repeated here.

[0347] In step 203444, the ratio of the fourth average energy to the second reference average energy is determined, and the square root of the ratio is taken to obtain the second scaling factor.

[0348] As an example, after obtaining the fourth average energy and the second reference average energy corresponding to the third combination, the ratio of the fourth average energy to the second reference average energy is calculated, and the square root of the ratio is taken to obtain the second scaling factor.

[0349] In step 203445, the third number of reference transformation coefficients in the second reference sub-band are multiplied by the second scaling factor to obtain the third number of second gain reference transformation coefficients in the second reference sub-band.

[0350] As an example, after obtaining the second scaling factor, the 20 reference transformation coefficients in the second reference sub-band are all multiplied by the second scaling factor to obtain 20 second gain reference transformation coefficients.

[0351] In step 203446, the third number of second gain reference transformation coefficients in each second reference sub-band are combined to obtain the first number of second gain reference transformation coefficients.

[0352] As an example, since 16 third combinations are obtained in step 203442, and each third combination calculates 20 second gain reference transformation coefficients for a second reference sub-band, the 20 second gain reference transformation coefficients for the second reference sub-bands corresponding to these 16 third combinations are combined to obtain 320 second gain reference transformation coefficients.

[0353] By performing gain control in the above manner, the spectral energy of the virtually generated reference transform coefficients at the decoding end can be made closer to the spectral energy of the original transform coefficients at the encoding end, thus facilitating the accurate generation of estimates for high-frequency subband signals.

[0354] In step 20345, the first number of second gain reference transformation coefficients are subjected to inverse frequency domain transformation to obtain the second high-frequency subband signal.

[0355] As an example, after obtaining 320 second-gain reference transform coefficients, performing an inverse frequency domain transform on these 320 coefficients yields the second high-frequency subband signal. It should be noted that this second high-frequency subband signal is an estimate of the high-frequency subband signal at the encoding end.

[0356] Since the inverse frequency domain transformation is performed based on the transform coefficients after gain control, the resulting second high-frequency subband signal is closer to the high-frequency subband signal at the encoding end, thereby improving the decoding quality of the high-frequency subband signal.

[0357] In step 204, the sub-band signals corresponding to multiple levels are combined into an audio signal.

[0358] As an example, after obtaining the sub-band signals corresponding to multiple levels, these signals are combined into an audio signal. This synthesis process can be achieved by calling a QMF synthesis filter.

[0359] As an example, the H corresponding to the QMF analysis filter can be analyzed. Low (z) and H High (z) describes the QMF synthesis filter bank, and its calculation formula is as follows:

[0360] G Low (z)=H Low (z) Formula 3

[0361] G High (z)=(-1)*H High (z) Formula 4

[0362] Among them, G Low (z) represents the spectral response of the low-pass signal of the QMF synthesis filter, H Low (z) represents the spectral response of the low-pass signal of the QMF analysis filter, G High (z) represents the spectral response of the high-pass signal of the QMF synthesizer, H High (z) represents the spectral response of the high-pass signal of the QMF analysis filter.

[0363] The low-pass signal (i.e., the low-frequency subband signal) and high-pass signal (i.e., the high-frequency subband signal) with a sampling rate of Fs / 2 recovered by the decoding end can be reconstructed into a signal with the same sampling rate Fs as the original audio signal after being processed by the QMF synthesis filter bank.

[0364] In some embodiments, after obtaining the sub-band signals corresponding to the first level and the second level respectively, the sub-band signals corresponding to the two levels are combined into an audio signal in the following manner: the first low-frequency sub-band signal at the second sampling frequency is subjected to a third upsampling process to obtain a third upsampling result at the first sampling frequency; the first high-frequency sub-band signal at the second sampling frequency is subjected to a fourth upsampling process to obtain a fourth upsampling result at the first sampling frequency; the third upsampling result and the fourth upsampling result are combined to obtain an audio signal.

[0365] As an example, the first level corresponds to the first low-frequency sub-band signal. The first low-frequency sub-band signal at the second sampling rate (e.g., 16000Hz) is subjected to a third upsampling process using a QMF synthesis filter to obtain a third upsampling result at the first sampling frequency (e.g., 32000Hz). Here, the second sampling frequency is half of the first sampling frequency.

[0366] Correspondingly, the second level corresponds to the first high-frequency sub-band signal. The first high-frequency sub-band signal at the second sampling rate (e.g., 16000Hz) is subjected to a fourth upsampling process by a QMF synthesis filter to obtain the fourth upsampling result at the first sampling frequency (e.g., 32000Hz).

[0367] Then, the third upsampling result and the fourth upsampling result are combined to obtain the audio signal. It should be noted that the audio signal here is an estimate of the original audio signal at the encoding end.

[0368] The above method can be used to obtain accurate audio signals.

[0369] In some embodiments, after obtaining the sub-band signals corresponding to the first, second, and third levels respectively, the sub-band signals corresponding to these three levels are combined into an audio signal in the following manner: the first low-frequency sub-band signal and the second low-frequency sub-band signal are summed to obtain a first summation result; the first summation result at the second sampling frequency is subjected to a fifth upsampling process to obtain a fifth upsampling result at the first sampling frequency; the second high-frequency sub-band signal at the second sampling frequency is subjected to a sixth upsampling process to obtain a sixth upsampling result at the first sampling frequency; and the fifth upsampling result and the sixth upsampling result are combined to obtain the audio signal.

[0370] As an example, the third level corresponds to the second low-frequency sub-band signal. First, the first low-frequency sub-band signal corresponding to the first level and the second low-frequency sub-band signal corresponding to the third level are summed to obtain the first summation result, which is the high-precision estimate of the low-frequency sub-band signal. Second, the first summation result at the second sampling rate (e.g., 16000Hz) is subjected to a fifth upsampling process using a QMF synthesis filter to obtain the fifth upsampling result at the first sampling frequency (e.g., 32000Hz).

[0371] Since the third level also corresponds to the second high-frequency sub-band signal, and the second high-frequency sub-band signal has higher resolution and accuracy than the first high-frequency sub-band signal, synthesis processing is performed based on the second high-frequency sub-band signal. For example, a QMF synthesis filter is used to perform a sixth upsampling process on the second high-frequency sub-band signal at the second sampling rate (e.g., 16000Hz) to obtain a sixth upsampling result at the first sampling frequency (e.g., 32000Hz).

[0372] Then, the fifth upsampling result and the sixth upsampling result are combined to obtain the audio signal. It should be noted that the audio signal here is an estimate of the original audio signal at the encoding end.

[0373] The above method can be used to obtain accurate audio signals.

[0374] In this embodiment, decoding and feature reconstruction are performed in layers, and synthesis is performed based on the sub-band signals corresponding to each layer. Since each layer only needs to process a portion of the sub-band signals, decoding efficiency can be improved while decoding complexity is reduced.

[0375] The following description continues to illustrate the exemplary structure of the audio encoding device 433 provided in the embodiments of this application as a software module. In some embodiments, such as Figure 4A As shown, the software modules stored in the audio encoding device 433 in the memory 430 may include: a decomposition module 4331, used to decompose the audio signal to obtain low-frequency sub-band signals and high-frequency sub-band signals; a feature extraction module 4332, used to perform multi-level feature extraction based on the low-frequency sub-band signals and high-frequency sub-band signals to obtain sub-band signal features corresponding to multiple levels; a quantization module 4333, used to quantize the sub-band signal features corresponding to each level to obtain the index value of the sub-band signal features; an encoding module 4334, used to encode the index value of the sub-band signal features to obtain the bitstream corresponding to the level; and a configuration module 4335, used to configure the corresponding transmission priority for the bitstreams corresponding to multiple levels; wherein the transmission priority is positively correlated with the decoding quality index of the bitstream corresponding to the level.

[0376] In the above scheme, the decomposition module 4331 is used to sample the audio signal at a first sampling frequency to obtain a sampled signal; wherein the sampled signal includes multiple sample points sampled from the audio signal; the sampled signal is subjected to low-pass filtering, and the obtained low-pass filtering result is subjected to a first downsampling process to obtain a low-frequency sub-band signal at a second sampling frequency; the sampled signal is subjected to high-pass filtering, and the obtained high-pass filtering result is subjected to a second downsampling process to obtain a high-frequency sub-band signal at a second sampling frequency; wherein the second sampling frequency is half of the first sampling frequency, and the bandwidth of the low-frequency sub-band signal and the high-frequency sub-band signal is the same.

[0377] In the above scheme, multiple layers include a first layer and a second layer; the feature extraction module 4332 is used to perform the feature extraction processing of the first layer in the following manner: performing the first feature extraction processing based on the low-frequency sub-band signal to obtain the first low-frequency sub-band signal feature; wherein, the dimension of the first low-frequency sub-band signal feature is smaller than the dimension of the low-frequency sub-band signal; and performing the feature extraction processing of the second layer in the following manner: performing the second feature extraction processing based on the high-frequency sub-band signal to obtain the first high-frequency sub-band signal feature; wherein, the dimension of the first high-frequency sub-band signal feature is smaller than the dimension of the high-frequency sub-band signal.

[0378] In the above scheme, the feature extraction module 4332 is used to perform a first convolution process on the low-frequency subband signal to obtain a first convolution feature; perform a first pooling process on the first convolution feature to obtain a first pooling feature; perform a third downsampling process on the first pooling feature to obtain a third downsampling feature; wherein the third downsampling process includes multiple cascaded downsampling; and perform a second convolution process on the third downsampling feature to obtain a first low-frequency subband signal feature.

[0379] In the above scheme, the high-frequency sub-band signal includes a first number of sample points, where the first number is an integer greater than 2; the feature extraction module 4332 is used to perform frequency domain transformation processing based on the first number of sample points to obtain a first number of transformation coefficients; divide the first number of transformation coefficients into multiple first sub-bands; and perform the following processing for each first sub-band: perform averaging processing based on a second number of transformation coefficients in the first sub-band to obtain a first average energy corresponding to the first sub-band, and determine the first average energy as the first sub-band spectral envelope corresponding to the first sub-band; wherein, the second number is the ratio of the first number to the number of multiple first sub-bands; and determine the first sub-band spectral envelopes corresponding to the multiple first sub-bands as features of the first high-frequency sub-band signal.

[0380] In the above scheme, the feature extraction module 4332 is used to obtain the reference frame high-frequency sub-band signal corresponding to the reference frame audio signal; wherein, the reference frame audio signal is the previous frame or the next frame of the audio signal, and the reference frame high-frequency sub-band signal includes a first number of sample points; based on the first number of sample points in the reference frame high-frequency sub-band signal and the first number of sample points in the high-frequency sub-band signal, discrete cosine transform processing is performed on the first number of sample points included in the high-frequency sub-band signal to obtain the transform coefficients corresponding to the first number of sample points included in the high-frequency sub-band signal respectively.

[0381] In the above scheme, multiple layers also include a third layer; the feature extraction module 4332 is used to perform the feature extraction processing of the third layer in the following manner: perform the third feature extraction processing based on the low-frequency sub-band signal and the first low-frequency sub-band signal features to obtain the second low-frequency sub-band signal features; perform the fourth feature extraction processing based on the first high-frequency sub-band signal features to obtain the second high-frequency sub-band signal features.

[0382] In the above scheme, the feature extraction module 4332 is used to concatenate the low-frequency sub-band signal and the first low-frequency sub-band signal features to obtain concatenated features; to perform a third convolution on the concatenated features to obtain a third convolution feature; to perform a second pooling on the third convolution feature to obtain a second pooling feature; to perform a fourth downsampling on the second pooling feature to obtain a fourth downsampling feature; wherein the fourth downsampling includes multiple cascaded downsampling; and to perform a fourth convolution on the fourth downsampling feature to obtain a second low-frequency sub-band signal feature; wherein the dimension of the second low-frequency sub-band signal feature is smaller than the dimension of the first low-frequency sub-band signal feature.

[0383] In the above scheme, the feature extraction module 4332 is used to perform the following processing for each first sub-band corresponding to the first high-frequency sub-band signal features: select a third number of transformation parameters from a second number of transformation parameters in the first sub-band, and determine the third number of transformation coefficients as the second sub-band; wherein, the third number is half of the second number; perform averaging processing based on the third number of transformation coefficients in the second sub-band to obtain the second average energy corresponding to the second sub-band, and determine the second average energy as the second sub-band spectral envelope corresponding to the second sub-band; determine the second sub-band spectral envelope corresponding to each second sub-band as the second high-frequency sub-band signal feature.

[0384] In the above scheme, multiple layers include a first layer and a second layer, and the sub-band signal feature corresponding to the first layer is the first low-frequency sub-band signal feature, and the sub-band signal feature corresponding to the second layer is the first high-frequency sub-band signal feature; the encoding module 4334 is used to encode the index value of the first low-frequency signal feature to obtain the first bitstream corresponding to the first layer; and to encode the index value of the first high-frequency signal feature to obtain the second bitstream corresponding to the second layer; wherein, the bit rate of the first bitstream is greater than the bit rate of the second bitstream, and the bit rate is positively correlated with the decoding quality index of the corresponding bitstream.

[0385] In the above scheme, multiple layers include a third layer, and the sub-band signal features corresponding to the third layer are the second low-frequency sub-band signal features and the second high-frequency sub-band signal features; the encoding module 4334 is used to encode the index value of the second low-frequency signal feature to obtain the third bitstream corresponding to the third layer; and to encode the index value of the second high-frequency signal feature to obtain the fourth bitstream corresponding to the third layer; wherein, the bit rate of the first bitstream is greater than the bit rate of the third bitstream, the bit rate of the third bitstream is greater than the bit rate of the second bitstream, the bit rate of the second bitstream is equal to the bit rate of the fourth bitstream, and the bit rate is positively correlated with the decoding quality index of the corresponding bitstream.

[0386] In the above scheme, multiple layers include a first layer and a second layer. The first bitstream corresponding to the first layer is obtained by encoding the index value of the first low-frequency signal feature, and the second bitstream corresponding to the second layer is obtained by encoding the index value of the first high-frequency signal feature. The configuration module 4335 is used to configure a first transmission priority for the first bitstream corresponding to the first layer and configure a second transmission priority for the second bitstream corresponding to the second layer. The first transmission priority is higher than the second transmission priority.

[0387] In the above scheme, multiple layers include a third layer, and the third layer corresponds to a third bitstream and a fourth bitstream. The third bitstream is obtained by encoding the index value of the second low-frequency signal feature, and the fourth bitstream is obtained by encoding the index value of the second high-frequency signal feature. The configuration module 4335 is used to configure a third transmission priority for the third bitstream corresponding to the third layer and to configure a fourth transmission priority for the fourth bitstream corresponding to the third layer. Among them, the first transmission priority is higher than the third transmission priority, the third transmission priority is higher than the second transmission priority, and the second transmission priority is the same as the fourth transmission priority.

[0388] The following description continues to illustrate the exemplary structure of the audio decoding device 473 provided in the embodiments of this application as a software module. In some embodiments, such as Figure 4BAs shown, the software modules stored in the audio encoding device 473 in the memory 470 may include: a decoding module 4731, used to decode the bitstreams corresponding to multiple levels respectively, to obtain the index value of the bitstream corresponding to each level; wherein, different levels correspond to different transmission priorities, and the transmission priority is positively correlated with the decoding quality index of the bitstream corresponding to the level; an inverse quantization module 4732, used to perform inverse quantization processing on the index value of the bitstream corresponding to each level respectively, to obtain the sub-band signal features corresponding to each level; a feature reconstruction module 4733, used to perform feature reconstruction processing on the sub-band signal features corresponding to each level respectively, to obtain the sub-band signal corresponding to each level respectively; and a synthesis module 4734, used to synthesize the sub-band signals corresponding to multiple levels into an audio signal.

[0389] In the above scheme, multiple layers include a first layer and a second layer. The first layer corresponds to the first low-frequency sub-band signal features, and the second layer corresponds to the first high-frequency sub-band signal features. The feature reconstruction module 4733 is used to perform feature reconstruction processing of the first layer in the following manner: performing first feature reconstruction processing based on the first low-frequency sub-band signal features to obtain the first low-frequency sub-band signal; wherein, the dimension of the first low-frequency sub-band signal is greater than the dimension of the first low-frequency sub-band signal features. The feature reconstruction processing of the second layer is performed in the following manner: performing second feature reconstruction processing based on the first high-frequency sub-band signal features to obtain the first high-frequency sub-band signal; wherein, the dimension of the first high-frequency sub-band signal is greater than the dimension of the first high-frequency sub-band signal features.

[0390] In the above scheme, the feature reconstruction module 4733 is used to perform a first convolution process on the features of the first low-frequency sub-band signal to obtain a first convolution feature; perform a first upsampling process on the first convolution feature to obtain a first upsampled feature; wherein the first upsampling process includes multiple cascaded upsampling; perform a first pooling process on the first upsampled feature to obtain a first pooling feature; and perform a second convolution process on the first pooling feature to obtain the first low-frequency sub-band signal.

[0391] In the above scheme, the feature reconstruction module 4733 is used to perform frequency domain transformation processing on a first number of sample points included in the first low-frequency sub-band signal to obtain a first number of low-frequency transformation coefficients; based on the latter half of the low-frequency transformation coefficients in the first number of low-frequency transformation coefficients, perform two spectrum copying processes to obtain a first number of reference transformation coefficients included in the first high-frequency reference sub-band signal; wherein the frequency of the sample points corresponding to the latter half of the low-frequency transformation coefficients is higher than the frequency of the sample points corresponding to the former half of the low-frequency transformation coefficients; based on the multiple first sub-band spectral envelopes corresponding to the features of the first high-frequency sub-band signal, perform first gain control on the first number of reference transformation coefficients included in the first high-frequency reference sub-band signal to obtain a first number of first gain reference transformation coefficients; and perform inverse frequency domain transformation processing on the first number of first gain reference transformation coefficients to obtain the first high-frequency sub-band signal.

[0392] In the above scheme, the feature reconstruction module 4733 is used to divide a first number of reference transformation coefficients into multiple first reference sub-bands; wherein the number of first reference sub-bands is the same as the number of first sub-band spectral envelopes; generate multiple different first combinations, each first combination including a first reference sub-band and a first sub-band spectral envelope, and perform the following processing for each first combination: determine the first sub-band spectral envelope as the first average energy; perform averaging processing based on a second number of reference transformation coefficients in the first reference sub-band to obtain the first reference average energy corresponding to the first reference sub-band; wherein the second number is the ratio of the first number to the number of multiple first reference sub-bands; determine the ratio of the first average energy to the first reference average energy, and perform square root processing on the ratio to obtain a first scaling factor; multiply the second number of reference transformation coefficients in the first reference sub-band by the first scaling factor respectively to obtain a second number of first gain reference transformation coefficients in the first reference sub-band; merge the second number of first gain reference transformation coefficients in each first reference sub-band to obtain a first number of first gain reference transformation coefficients.

[0393] In the above scheme, multiple layers also include a third layer, which corresponds to the second low-frequency sub-band signal features and the second high-frequency sub-band signal features; the feature reconstruction module 4733 is used to perform the feature reconstruction processing of the third layer in the following manner: perform the third feature reconstruction processing on the second low-frequency sub-band signal features to obtain the second low-frequency sub-band signal; perform the fourth feature reconstruction processing based on the first high-frequency sub-band signal features and the second high-frequency sub-band signal features to obtain the second high-frequency sub-band signal.

[0394] In the above scheme, the feature reconstruction module 4733 is used to perform a third convolution process on the features of the second low-frequency sub-band signal to obtain a third convolution feature; perform a second upsampling process on the third convolution feature to obtain a second upsampling feature; wherein the second upsampling process includes multiple cascaded upsampling; perform a second pooling process on the second upsampling feature to obtain a second pooling feature; and perform a fourth convolution process on the second pooling feature to obtain the second low-frequency sub-band signal; wherein the second low-frequency sub-band signal is the residual signal of the first low-frequency sub-band signal.

[0395] In the above scheme, the feature reconstruction module 4733 is used to determine multiple first sub-bands corresponding to the features of the first high-frequency sub-band signal, and to determine multiple second sub-bands corresponding to the features of the second high-frequency sub-band signal; wherein the number of multiple first sub-bands is the same as the number of multiple second sub-bands; based on the spectral envelopes of the first sub-bands and the spectral envelopes of the second sub-bands respectively, the spectral envelopes of the third sub-bands respectively are determined; wherein the number of multiple third sub-bands is the same as the number of multiple second sub-bands; the spectral envelopes of the second and third sub-bands are determined as the spectral envelopes of the fourth sub-band; based on the spectral envelopes of the fourth sub-bands, a second gain control is performed on a first number of reference transformation coefficients included in the first high-frequency reference sub-band signal to obtain a first number of second gain reference transformation coefficients; the first number of second gain reference transformation coefficients are subjected to inverse frequency domain transformation processing to obtain the second high-frequency sub-band signal.

[0396] In the above scheme, the feature reconstruction module 4733 is used to generate multiple different second combinations. Each second combination includes a first sub-band and a second sub-band corresponding to the first sub-band. For each second combination, the following processing is performed: the spectral envelope of the first sub-band corresponding to the first sub-band is determined as the first average energy, and the spectral envelope of the second sub-band corresponding to the second sub-band is determined as the second average energy; the first average energy is multiplied by a second quantity to obtain a first multiplication result; wherein the second quantity is the number of high-frequency transform coefficients included in the first sub-band; the second average energy is multiplied by a third quantity to obtain a second multiplication result; wherein the third quantity is the number of high-frequency transform coefficients included in the second sub-band, and the third quantity is half of the second quantity; the first multiplication result is subtracted from the second multiplication result, and the ratio of the subtraction result to the third quantity is determined as the third average energy corresponding to the third sub-band, and the third average energy is determined as the spectral envelope of the third sub-band; wherein the high-frequency transform coefficients included in the first sub-band are the union of the following two: the high-frequency transform coefficients included in the second sub-band and the high-frequency transform coefficients included in the third sub-band.

[0397] In the above scheme, the feature reconstruction module 4733 is used to divide the first number of reference transformation coefficients into multiple second reference sub-bands; wherein the number of second reference sub-bands is the same as the number of fourth sub-band spectral envelopes;

[0398] Multiple distinct third combinations are generated, each including a second reference sub-band and a fourth sub-band spectral envelope. For each third combination, the following processing is performed: the fourth sub-band spectral envelope is determined as the fourth average energy; the average of the third number of reference transformation coefficients in the second reference sub-band is calculated to obtain the second reference average energy corresponding to the second reference sub-band; where the third number is the ratio of the first number to the number of multiple second reference sub-bands; the ratio of the fourth average energy to the second reference average energy is determined, and the square root of the ratio is taken to obtain the second scaling factor; the third number of reference transformation coefficients in the second reference sub-band are multiplied by the second scaling factor to obtain the third number of second gain reference transformation coefficients in the second reference sub-band; the third number of second gain reference transformation coefficients in each second reference sub-band are combined to obtain the first number of second gain reference transformation coefficients.

[0399] In the above scheme, multiple layers include a first layer and a second layer. The first layer corresponds to the first low-frequency sub-band signal, and the second layer corresponds to the first high-frequency sub-band signal. The synthesis module 4734 is used to perform a third upsampling process on the first low-frequency sub-band signal at the second sampling frequency to obtain a third upsampling result at the first sampling frequency. The second sampling frequency is half of the first sampling frequency. The first high-frequency sub-band signal at the second sampling frequency is then subjected to a fourth upsampling process to obtain a fourth upsampling result at the first sampling frequency. The third upsampling result and the fourth upsampling result are then synthesized to obtain an audio signal.

[0400] In the above scheme, multiple layers also include a third layer, which corresponds to the second low-frequency sub-band signal and the second high-frequency sub-band signal; the synthesis module 4734 is used to sum the first low-frequency sub-band signal and the second low-frequency sub-band signal to obtain a first summation result; to perform a fifth upsampling process on the first summation result at the second sampling frequency to obtain a fifth upsampling result at the first sampling frequency; to perform a sixth upsampling process on the second high-frequency sub-band signal at the second sampling frequency to obtain a sixth upsampling result at the first sampling frequency; and to synthesize the fifth upsampling result and the sixth upsampling result to obtain an audio signal.

[0401] This application provides a computer program product or computer program that includes computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the audio encoding and audio decoding methods described above in this application.

[0402] This application provides a computer-readable storage medium storing executable instructions. When the executable instructions are executed by a processor, the processor will execute the audio encoding method and audio decoding method provided in this application.

[0403] In some embodiments, the computer-readable storage medium may be a memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or it may be a variety of devices including one or any combination of the above-mentioned memories.

[0404] In some embodiments, executable instructions may take the form of a program, software, software module, script, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and may be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

[0405] As an example, executable instructions may, but do not necessarily, correspond to files in a file system. They may be stored as part of a file that holds other programs or data, for example, in one or more scripts in a Hyper Text Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple collaborating files (e.g., a file that stores one or more modules, subroutines, or code sections).

[0406] As an example, executable instructions can be deployed to execute on a single computing device, or on multiple computing devices located in one location, or on multiple computing devices distributed across multiple locations and interconnected via a communication network.

[0407] In summary, in this embodiment, by acquiring the sub-band signal features of the audio sub-band signal at each level in a layered manner, and encoding the corresponding sub-band signal features at each level in a layered manner, each level only needs to encode specific sub-band signal features, instead of encoding the features of the entire audio signal. This not only improves the efficiency of encoding and decoding but also reduces the encoding and decoding complexity of each level. According to the importance of different levels of bitstream to decoding quality, different transmission priorities can be flexibly configured for different levels of bitstream, ensuring that more important bitstreams are transmitted first, which is applicable to application scenarios with different network bandwidths.

[0408] Furthermore, in this embodiment, decoding and feature reconstruction are performed in layers, and synthesis is performed based on the sub-band signals corresponding to each layer. Since each layer only needs to process a portion of the sub-band signals, decoding efficiency can be improved while decoding complexity is reduced.

[0409] The above description is merely an embodiment of this application and is not intended to limit the scope of protection of this application. Any modifications, equivalent substitutions, and improvements made within the spirit and scope of this application are included within the scope of protection of this application.

Claims

1. An audio encoding method, characterized in that, The method includes: The audio signal is decomposed to obtain low-frequency subband signals and high-frequency subband signals; Based on the low-frequency sub-band signal and the high-frequency sub-band signal, feature extraction processing is performed at multiple levels to obtain the sub-band signal features corresponding to the multiple levels respectively. Based on the sub-band signal characteristics corresponding to the multiple levels, the code streams corresponding to the multiple levels are determined, wherein the code stream of the first level includes the first code stream corresponding to the low-frequency sub-band signal, the code stream of the second level includes the second code stream corresponding to the high-frequency sub-band signal, and the code stream of the third level includes the third code stream corresponding to the low-frequency sub-band signal and the fourth code stream corresponding to the high-frequency sub-band signal. Configure the corresponding transmission priority for the bitstreams corresponding to the multiple layers.

2. The method according to claim 1, characterized in that, The transmission priority is positively correlated with the decoding quality index of the bitstream corresponding to the layer.

3. The method according to claim 1, characterized in that, The plurality of layers includes the first layer, the second layer, and the third layer; The process of performing multi-level feature extraction based on the low-frequency sub-band signal and the high-frequency sub-band signal to obtain sub-band signal features corresponding to each of the multiple levels includes: The first-level feature extraction process is performed as follows: based on the low-frequency subband signal, a first feature extraction process is performed to obtain a first low-frequency subband signal feature; wherein, the dimension of the first low-frequency subband signal feature is smaller than the dimension of the low-frequency subband signal. The second-level feature extraction process is performed as follows: based on the high-frequency subband signal, a second feature extraction process is performed to obtain a first high-frequency subband signal feature; wherein, the dimension of the first high-frequency subband signal feature is smaller than the dimension of the high-frequency subband signal; The third-level feature extraction process is performed as follows: based on the low-frequency sub-band signal and the features of the first low-frequency sub-band signal, the third feature extraction process is performed to obtain the features of the second low-frequency sub-band signal. Based on the first high-frequency sub-band signal features, a fourth feature extraction process is performed to obtain the second high-frequency sub-band signal features.

4. The method according to claim 3, characterized in that, The first feature extraction process based on the low-frequency sub-band signal to obtain the first low-frequency sub-band signal features includes: The low-frequency subband signal is subjected to a first convolution process to obtain a first convolution feature; The first convolutional feature is subjected to a first pooling process to obtain the first pooling feature; The first pooling feature is subjected to a third downsampling process to obtain a third downsampling feature; wherein, the third downsampling process includes multiple cascaded downsampling operations; The third downsampling feature is subjected to a second convolution process to obtain the first low-frequency subband signal feature.

5. The method according to claim 3 or 4, characterized in that, The high-frequency subband signal includes a first number of sample points, where the first number is an integer greater than 2; The second feature extraction process based on the high-frequency sub-band signal to obtain the first high-frequency sub-band signal features includes: Based on the first number of sample points, frequency domain transformation is performed to obtain the first number of transformation coefficients; Divide the first number of transformation coefficients into multiple first sub-bands; For each of the first sub-bands, the following processing is performed: averaging is performed based on the second number of transformation coefficients in the first sub-band to obtain the first average energy corresponding to the first sub-band, and the first average energy is determined as the first sub-band spectral envelope corresponding to the first sub-band; wherein, the second number is the ratio of the first number to the number of the plurality of first sub-bands; The spectral envelopes of the first sub-bands corresponding to the plurality of first sub-bands are determined as the first high-frequency sub-band signal features.

6. The method according to claim 5, characterized in that, The step of performing frequency domain transformation processing on the first number of sample points to obtain the first number of transformation coefficients includes: Obtain the reference frame high-frequency sub-band signal corresponding to the reference frame audio signal; wherein, the reference frame audio signal is the previous frame or the next frame of the audio signal, and the reference frame high-frequency sub-band signal includes the first number of sample points; Based on the first number of sample points in the reference frame high-frequency subband signal and the first number of sample points in the high-frequency subband signal, discrete cosine transform processing is performed on the first number of sample points included in the high-frequency subband signal to obtain the transform coefficients corresponding to the first number of sample points included in the high-frequency subband signal.

7. The method according to claim 3, characterized in that, The step of performing a third feature extraction process based on the low-frequency sub-band signal and the features of the first low-frequency sub-band signal to obtain the features of the second low-frequency sub-band signal includes: The low-frequency sub-band signal and the features of the first low-frequency sub-band signal are spliced ​​together to obtain spliced ​​features. The spliced ​​features are subjected to a third convolution process to obtain the third convolution features; The third convolutional feature is subjected to a second pooling process to obtain the second pooled feature; The second pooling feature is subjected to a fourth downsampling process to obtain a fourth downsampling feature; wherein, the fourth downsampling process includes multiple cascaded downsampling operations; The fourth downsampling feature is subjected to a fourth convolution process to obtain the second low-frequency subband signal feature; wherein the dimension of the second low-frequency subband signal feature is smaller than the dimension of the first low-frequency subband signal feature.

8. The method according to claim 5, characterized in that, The step of performing a fourth feature extraction process based on the first high-frequency sub-band signal features to obtain the second high-frequency sub-band signal features includes: For each of the first sub-bands corresponding to the characteristics of the first high-frequency sub-band signal, the following processing is performed: From the second number of transformation coefficients in the first sub-band, a third number of transformation coefficients are selected, and the third number of transformation coefficients is determined as the second sub-band; wherein, the third number is one-half of the second number; The average value is calculated based on the third number of transformation coefficients in the second sub-band to obtain the second average energy corresponding to the second sub-band, and the second average energy is determined as the spectral envelope of the second sub-band corresponding to the second sub-band. The spectral envelope of each second sub-band is determined as the second high-frequency sub-band signal feature.

9. The method according to claim 1, characterized in that, The multiple layers include the first layer, the second layer and the third layer, and the sub-band signal feature corresponding to the first layer is the first low-frequency sub-band signal feature, the sub-band signal feature corresponding to the second layer is the first high-frequency sub-band signal feature, and the sub-band signal feature corresponding to the third layer is the second low-frequency sub-band signal feature and the second high-frequency sub-band signal feature. The step of determining the code stream corresponding to each of the multiple levels based on the sub-band signal features of each level includes: The index value of the first low-frequency signal feature is encoded to obtain the first code stream corresponding to the first level; The index value of the first high-frequency signal feature is encoded to obtain the second code stream corresponding to the second level; Wherein, the bitrate of the first bitstream is greater than the bitrate of the second bitstream, and the bitrate is positively correlated with the decoding quality index of the corresponding bitstream; The index value of the second low-frequency signal feature is encoded to obtain the third code stream corresponding to the third level; The index value of the second high-frequency signal feature is encoded to obtain the fourth code stream corresponding to the third level; Wherein, the bitrate of the first bitstream is greater than the bitrate of the third bitstream, the bitrate of the third bitstream is greater than the bitrate of the second bitstream, the bitrate of the second bitstream is equal to the bitrate of the fourth bitstream, and the bitrate is positively correlated with the decoding quality index of the corresponding bitstream.

10. The method according to claim 1, characterized in that, The step of determining the code stream corresponding to each of the multiple levels based on the sub-band signal features of each level includes: The sub-band signal features corresponding to each level are quantized to obtain the index value of the sub-band signal features; The index values ​​of the sub-band signal features are encoded to obtain the code stream corresponding to the level.

11. The method according to claim 1, characterized in that, The multiple layers include the first layer, the second layer, and the third layer. The first bitstream corresponding to the first layer is obtained by encoding the index value of the first low-frequency signal feature. The second bitstream corresponding to the second layer is obtained by encoding the index value of the first high-frequency signal feature. The third layer corresponds to the third bitstream and the fourth bitstream. The third bitstream is obtained by encoding the index value of the second low-frequency signal feature, and the fourth bitstream is obtained by encoding the index value of the second high-frequency signal feature. The configuration of transmission priorities for the bitstreams corresponding to the multiple layers includes: Configure a first transmission priority for the first bitstream corresponding to the first level; Configure a second transmission priority for the second bitstream corresponding to the second level; wherein, the first transmission priority is higher than the second transmission priority; Configure a third transmission priority for the third bitstream corresponding to the third level; Configure a fourth transmission priority for the fourth bitstream corresponding to the third level; Wherein, the first transmission priority is higher than the third transmission priority, the third transmission priority is higher than the second transmission priority, and the second transmission priority is the same as the fourth transmission priority.

12. The method according to claim 1, characterized in that, The process of decomposing the audio signal to obtain low-frequency sub-band signals and high-frequency sub-band signals includes: The audio signal is sampled at a first sampling frequency to obtain a sampled signal; wherein the sampled signal includes multiple sample points sampled from the audio signal; The sampled signal is subjected to low-pass filtering, and the low-pass filtering result is subjected to a first downsampling process to obtain the low-frequency sub-band signal with the second sampling frequency; The sampled signal is subjected to high-pass filtering, and the high-pass filtering result is subjected to a second downsampling process to obtain the high-frequency sub-band signal of the second sampling frequency; Wherein, the second sampling frequency is half of the first sampling frequency, and the low-frequency sub-band signal and the high-frequency sub-band signal have the same bandwidth.

13. An audio decoding method, characterized in that, The method includes: Based on the code streams corresponding to multiple levels, the sub-band signal characteristics corresponding to each level are determined. Different levels correspond to different transmission priorities. The code stream of the first level includes the first code stream corresponding to the low-frequency sub-band signal, the code stream of the second level includes the second code stream corresponding to the high-frequency sub-band signal, and the code stream of the third level includes the third code stream corresponding to the low-frequency sub-band signal and the fourth code stream corresponding to the high-frequency sub-band signal. Feature reconstruction processing is performed on the sub-band signal features corresponding to each level to obtain the sub-band signal corresponding to each level. The sub-band signals corresponding to the multiple layers are combined into an audio signal.

14. The method according to claim 13, characterized in that, The transmission priority is positively correlated with the decoding quality index of the bitstream corresponding to the layer.

15. The method according to claim 13, characterized in that, The multiple layers include the first layer, the second layer and the third layer, wherein the first layer corresponds to the first low-frequency sub-band signal characteristics, the second layer corresponds to the first high-frequency sub-band signal characteristics, and the third layer corresponds to the second low-frequency sub-band signal characteristics and the second high-frequency sub-band signal characteristics. The step of performing feature reconstruction processing on the sub-band signal features corresponding to each of the aforementioned levels to obtain the sub-band signal corresponding to each of the aforementioned levels includes: The first-level feature reconstruction process is performed as follows: a first feature reconstruction process is performed based on the features of the first low-frequency sub-band signal to obtain the first low-frequency sub-band signal; wherein, the dimension of the first low-frequency sub-band signal is greater than the dimension of the features of the first low-frequency sub-band signal. The second-level feature reconstruction process is performed as follows: based on the features of the first high-frequency sub-band signal, a second feature reconstruction process is performed to obtain the first high-frequency sub-band signal; wherein, the dimension of the first high-frequency sub-band signal is greater than the dimension of the features of the first high-frequency sub-band signal; The third-level feature reconstruction process is performed by performing a third feature reconstruction process on the features of the second low-frequency sub-band signal to obtain the second low-frequency sub-band signal. The second high-frequency sub-band signal is obtained by performing a fourth feature reconstruction process based on the features of the first high-frequency sub-band signal and the features of the second high-frequency sub-band signal.

16. The method according to claim 15, characterized in that, The first feature reconstruction process based on the features of the first low-frequency sub-band signal to obtain the first low-frequency sub-band signal includes: The first low-frequency sub-band signal features are subjected to a first convolution process to obtain the first convolution features; The first convolutional feature is subjected to a first upsampling process to obtain a first upsampled feature; wherein, the first upsampling process includes multiple cascaded upsampling operations; The first upsampled feature is subjected to a first pooling process to obtain a first pooling feature; The first pooling feature is subjected to a second convolution process to obtain the first low-frequency sub-band signal.

17. The method according to claim 15 or 16, characterized in that, The step of performing second feature reconstruction processing based on the features of the first high-frequency sub-band signal to obtain the first high-frequency sub-band signal includes: Based on the first number of sample points included in the first low-frequency sub-band signal, frequency domain transformation processing is performed to obtain the first number of low-frequency transformation coefficients. Based on the latter half of the low-frequency transformation coefficients of the first number of low-frequency transformation coefficients, two spectrum copying processes are performed to obtain the first number of reference transformation coefficients included in the first high-frequency reference sub-band signal; wherein, the frequency of the sample point corresponding to the latter half of the low-frequency transformation coefficients is higher than the frequency of the sample point corresponding to the former half of the low-frequency transformation coefficients. Based on the multiple first sub-band spectrum envelopes corresponding to the characteristics of the first high-frequency sub-band signal, the first number of reference transformation coefficients included in the first high-frequency reference sub-band signal are subjected to first gain control to obtain the first number of first gain reference transformation coefficients. The first number of first gain reference transformation coefficients are subjected to inverse frequency domain transformation to obtain the first high-frequency subband signal.

18. The method according to claim 17, characterized in that, The first gain control is applied to the first number of reference transformation coefficients included in the first high-frequency reference sub-band signal based on the multiple first sub-band spectral envelopes corresponding to the characteristics of the first high-frequency sub-band signal, to obtain the first number of first gain reference transformation coefficients, including: The first number of reference transformation coefficients are divided into multiple first reference sub-bands; wherein the number of first reference sub-bands is the same as the number of the first sub-band spectral envelopes; Multiple distinct first combinations are generated, each first combination including a first reference subband and a first subband spectral envelope, and for each first combination, the following processing is performed: The first sub-band spectral envelope is determined as the first average energy; The average value of the reference transformation coefficients in the first reference sub-band is calculated to obtain the first reference average energy corresponding to the first reference sub-band; wherein, the second quantity is the ratio of the first quantity to the quantity of the plurality of first reference sub-bands; Determine the ratio of the first average energy to the first reference average energy, and take the square root of the ratio to obtain the first scaling factor; The second number of reference transformation coefficients in the first reference sub-band are multiplied by the first scaling factor to obtain the second number of first gain reference transformation coefficients in the first reference sub-band. The first gain reference transformation coefficients of the second number in each of the first reference sub-bands are combined to obtain the first number of first gain reference transformation coefficients.

19. The method according to claim 15, characterized in that, The step of performing third feature reconstruction processing on the features of the second low-frequency sub-band signal to obtain the second low-frequency sub-band signal includes: The second low-frequency subband signal features are subjected to a third convolution process to obtain the third convolution features; The third convolutional feature is subjected to a second upsampling process to obtain a second upsampling feature; wherein, the second upsampling process includes multiple cascaded upsampling operations; The second upsampled feature is subjected to a second pooling process to obtain the second pooled feature; The second pooling feature is subjected to a fourth convolution process to obtain a second low-frequency sub-band signal; wherein the second low-frequency sub-band signal is the residual signal of the first low-frequency sub-band signal.

20. The method according to claim 18, characterized in that, The step of performing a fourth feature reconstruction process based on the features of the first high-frequency sub-band signal and the features of the second high-frequency sub-band signal to obtain the second high-frequency sub-band signal includes: A plurality of first sub-bands corresponding to the first high-frequency sub-band signal characteristics are determined, and a plurality of second sub-bands corresponding to the second high-frequency sub-band signal characteristics are determined; wherein the number of the plurality of first sub-bands is the same as the number of the plurality of second sub-bands; Based on the spectral envelopes of the first sub-bands corresponding to the plurality of first sub-bands and the spectral envelopes of the second sub-bands corresponding to the plurality of second sub-bands, the spectral envelopes of the third sub-bands corresponding to the plurality of third sub-bands are determined; wherein, the number of the plurality of third sub-bands is the same as the number of the plurality of second sub-bands; The multiple second sub-band spectrum envelopes and the multiple third sub-band spectrum envelopes are determined as the fourth sub-band spectrum envelope; Based on the multiple fourth sub-band spectral envelopes, the first number of reference transformation coefficients included in the first high-frequency reference sub-band signal are subjected to second gain control to obtain the first number of second gain reference transformation coefficients. The first number of second gain reference transformation coefficients are subjected to inverse frequency domain transformation to obtain the second high-frequency sub-band signal.

21. The method according to claim 20, characterized in that, The step of determining the third sub-band spectral envelopes corresponding to multiple third sub-bands based on the first sub-band spectral envelopes corresponding to the multiple first sub-bands and the second sub-band spectral envelopes corresponding to the multiple second sub-bands includes: Multiple different second combinations are generated, each second combination including a first sub-band and a second sub-band corresponding to the first sub-band, and the following processing is performed for each second combination: The first sub-band spectral envelope corresponding to the first sub-band is determined as the first average energy, and the second sub-band spectral envelope corresponding to the second sub-band is determined as the second average energy; The first average energy is multiplied by the second quantity to obtain the first multiplication result; wherein, the second quantity is the number of high-frequency conversion coefficients included in the first sub-band; The second average energy is multiplied by the third quantity to obtain a second multiplication result; wherein the third quantity is the number of high-frequency conversion coefficients included in the second sub-band, and the third quantity is half of the second quantity; The first multiplication result is subtracted from the second multiplication result. The ratio of the subtraction result to the third quantity is determined as the third average energy corresponding to the third sub-band. The third average energy is then determined as the third sub-band spectral envelope corresponding to the third sub-band. Wherein, the second number of high-frequency conversion coefficients included in the first sub-band is the union of the following two: the high-frequency conversion coefficients included in the second sub-band, and the high-frequency conversion coefficients included in the third sub-band.

22. The method according to claim 20, characterized in that, The step of performing a second gain control on the first number of reference transformation coefficients included in the first high-frequency reference sub-band signal based on the multiple fourth sub-band spectral envelopes to obtain the first number of second gain reference transformation coefficients includes: The first number of reference transformation coefficients are divided into multiple second reference sub-bands; wherein the number of second reference sub-bands is the same as the number of the fourth sub-band spectral envelope; Multiple distinct third combinations are generated, each third combination comprising a second reference subband and a fourth subband spectral envelope, and for each third combination, the following processing is performed: The fourth sub-band spectral envelope is determined as the fourth average energy; The average value of the reference transformation coefficients in the second reference sub-band is calculated to obtain the second reference average energy corresponding to the second reference sub-band; wherein, the third quantity is the ratio of the first quantity to the quantity of the plurality of second reference sub-bands; Determine the ratio of the fourth average energy to the second reference average energy, and take the square root of the ratio to obtain the second scaling factor; The third number of reference transformation coefficients in the second reference sub-band are multiplied by the second scaling factor to obtain the third number of second gain reference transformation coefficients in the second reference sub-band. The third number of second gain reference transformation coefficients in each of the second reference sub-bands are combined to obtain the first number of second gain reference transformation coefficients.

23. The method according to claim 13, characterized in that, The multiple layers include the first layer, the second layer and the third layer, the first layer corresponds to the first low-frequency sub-band signal, the second layer corresponds to the first high-frequency sub-band signal, and the third layer corresponds to the characteristics of the second low-frequency sub-band signal and the characteristics of the second high-frequency sub-band signal. The step of combining the sub-band signals corresponding to the multiple layers into an audio signal includes: The first low-frequency sub-band signal and the second low-frequency sub-band signal are summed to obtain a first summation result; The first summation result at the second sampling frequency is subjected to a fifth upsampling process to obtain the fifth upsampling result at the first sampling frequency; The second high-frequency sub-band signal at the second sampling frequency is subjected to a sixth upsampling process to obtain the sixth upsampling result at the first sampling frequency; The fifth upsampling result and the sixth upsampling result are combined to obtain the audio signal.

24. The method according to claim 13, characterized in that, The step of determining the sub-band signal characteristics corresponding to each of the multiple layers of code streams includes: Decode the bitstreams corresponding to multiple levels to obtain the index value of the bitstream corresponding to each level. The index values ​​of the code streams corresponding to each level are inversely quantized to obtain the sub-band signal features corresponding to each level.

25. An audio encoding device, characterized in that, The device includes: The decomposition module is used to decompose the audio signal to obtain low-frequency subband signals and high-frequency subband signals; The feature extraction module is used to perform multi-level feature extraction processing based on the low-frequency sub-band signal and the high-frequency sub-band signal to obtain the sub-band signal features corresponding to the multiple levels respectively. The encoding module is used to determine the code streams corresponding to the multiple levels based on the sub-band signal characteristics corresponding to the multiple levels respectively, wherein the code stream of the first level includes the first code stream corresponding to the low-frequency sub-band signal, the code stream of the second level includes the second code stream corresponding to the high-frequency sub-band signal, and the code stream of the third level includes the third code stream corresponding to the low-frequency sub-band signal and the fourth code stream corresponding to the high-frequency sub-band signal. The configuration module is used to configure the corresponding transmission priority for the code streams corresponding to the multiple layers.

26. An audio decoding device, characterized in that, The device includes: The decoding module is used to determine the sub-band signal characteristics corresponding to each of the multiple layers based on the code streams corresponding to each layer. Different layers correspond to different transmission priorities. The code stream of the first layer includes a first code stream corresponding to the low-frequency sub-band signal, the code stream of the second layer includes a second code stream corresponding to the high-frequency sub-band signal, and the code stream of the third layer includes a third code stream corresponding to the low-frequency sub-band signal and a fourth code stream corresponding to the high-frequency sub-band signal. The feature reconstruction module is used to perform feature reconstruction processing on the sub-band signal features corresponding to each level to obtain the sub-band signal corresponding to each level. The synthesis module is used to synthesize the sub-band signals corresponding to the multiple layers into an audio signal.

27. An electronic device, characterized in that, The electronic device includes: Memory, used to store executable instructions; A processor, when executing executable instructions stored in the memory, implements the audio encoding method according to any one of claims 1 to 12 or the audio decoding method according to any one of claims 13 to 24.

28. A computer-readable storage medium, characterized in that, The device stores executable instructions that, when executed by a processor, implement the audio encoding method according to any one of claims 1 to 12 or the audio decoding method according to any one of claims 13 to 24.

29. A computer program product comprising a computer program or instructions, characterized in that, When the computer program or instructions are executed by a processor, they implement the audio encoding method according to any one of claims 1 to 12 or the audio decoding method according to any one of claims 13 to 24.

30. A method for processing a bitstream, characterized in that, The bitstream is stored on a non-volatile computer-readable storage medium, the bitstream being generated by the audio encoding method according to any one of claims 1 to 12, or decoded based on the audio decoding method according to any one of claims 13 to 24.

31. A method for transmitting a code stream, characterized in that, The audio encoding method according to any one of claims 1 to 12 is used to generate a bitstream and to transmit the bitstream.