Method, system, apparatus, electronic device and storage medium for multi-channel speech reconstruction
By decoding the bitstream of multiple voice signals and performing linear inverse transform processing, the problems of large data volume and high decoding complexity in asynchronous voice interaction among multiple people are solved, achieving high-quality voice signal reconstruction and smooth voice interaction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING DAJIA INTERNET INFORMATION TECH CO LTD
- Filing Date
- 2023-05-09
- Publication Date
- 2026-06-12
AI Technical Summary
Existing technologies require a large amount of data to be processed and high decoding complexity at the audio signal decoding end when multiple people are interacting asynchronously, which cannot meet the requirements for smooth audio interaction in special scenarios.
By acquiring multi-channel bitstream data of multiple speech signals, bitstream decoding is performed. The target fusion strategy is determined based on the comparison result between the number of paths and the preset number of fusion paths. The target number of speech features are fused and linear inverse transformation is performed to reconstruct the speech signal.
The voice signal reconstruction process has been optimized, improving audio quality, reducing data processing volume and decoding complexity, and enhancing the smoothness of multi-person synchronous voice interaction.
Smart Images

Figure CN116682440B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of Internet technology, and in particular to a multi-channel speech reconstruction method, a multi-channel speech reconstruction system, a multi-channel speech reconstruction device, an electronic device, a storage medium, and a computer program product. Background Technology
[0002] With the popularization of multi-terminal device interconnection technology and the advancement of high-speed network transmission technology, voice data plays an increasingly important role in scenarios such as online teaching and audio-visual conferencing, specifically in the transmission and application of voice data between user terminals and servers, and between user terminals.
[0003] Currently, the usual practice is for the sending end to encode and compress the voice data, and then send the encoded and compressed voice data to the receiving end, where the receiving end decodes the voice data to obtain the voice data.
[0004] However, in existing methods for encoding and decoding multi-channel voice signals, when faced with asynchronous voice interaction among multiple users, the decoding end of the voice signal often has a large amount of data to process and high decoding complexity, which places high demands on the performance of the decoding end. This cannot meet the needs of users to perform smooth synchronous voice interaction among multiple users in special scenarios (such as when the number of voice signal input paths is large or the communication network is weak). Summary of the Invention
[0005] This disclosure provides a multi-channel speech reconstruction method, a multi-channel speech reconstruction system, a multi-channel speech reconstruction device, an electronic device, a storage medium, and a computer program product, to at least solve the problems of large data volume and high decoding complexity in related technologies during voice interaction. The technical solution of this disclosure is as follows:
[0006] According to a first aspect of the present disclosure, a multi-channel speech reconstruction method is provided, comprising:
[0007] Acquire multi-channel bitstream data corresponding to multiple audio signals; the bitstream data is an encoded bitstream after encoding the audio features of the audio signals, and the audio features are signal features extracted from the corresponding audio signals after linear transformation processing;
[0008] The multi-stream data is decoded to obtain the speech features corresponding to each of the multi-stream data.
[0009] Based on the comparison between the number of paths corresponding to the multi-stream data and the preset number of fusion paths, the target fusion strategy is determined.
[0010] Based on the target fusion strategy, a target number of speech features are fused to obtain fused speech features; the target number is the number of paths corresponding to the fused speech features corresponding to at least some paths according to the target fusion strategy.
[0011] The fused speech features are subjected to inverse linear transform processing to obtain the reconstructed speech signal corresponding to the multi-channel speech signal.
[0012] In an exemplary embodiment, performing an inverse linear transform on the fused speech features to obtain the reconstructed speech signal corresponding to the multi-channel speech signals includes:
[0013] The fused speech features are subjected to time-frequency decoding to obtain corresponding time-frequency domain data; the time-frequency domain data is used to characterize the time-domain and frequency-domain properties of the fused speech features.
[0014] The time-frequency domain data is subjected to inverse time-frequency transform processing to obtain the speech signal with fused speech features, and the speech signal with fused speech features is used as the reconstructed speech signal corresponding to the multi-channel speech signal.
[0015] In one exemplary embodiment, the speech features of each of the speech signals are expressed based on corresponding feature vectors; the bitstream data is an encoded bitstream obtained based on a vector coding strategy or a scalar coding strategy.
[0016] The step of decoding the multi-stream data to obtain the speech features corresponding to each of the multi-stream data includes:
[0017] Based on the decoding strategy corresponding to the encoding strategy of each bitstream data, the bitstream data of each bitstream data is decoded respectively to obtain the decoding feature vector of the multi-bitstream data, and the decoding feature vector of the multi-bitstream data is used as the speech feature corresponding to each of the multi-bitstream data.
[0018] The decoding strategy corresponding to the encoding strategy of the bitstream data includes a vector decoding strategy corresponding to the vector encoding strategy or a scalar decoding strategy corresponding to the scalar encoding strategy.
[0019] In an exemplary embodiment, determining the target fusion strategy based on the comparison result between the number of paths corresponding to the multi-stream data and the preset number of fusion paths includes:
[0020] If the comparison result indicates that the number of paths corresponding to the multi-stream data is less than or equal to the preset number of fusion paths, a preset first fusion strategy is determined as the target fusion strategy; the first fusion strategy is used to instruct the fusion of the decoding feature vectors corresponding to each of the multi-stream data to obtain a fused feature vector; or
[0021] If the comparison result shows that the number of paths corresponding to the multi-stream data is greater than the preset number of fusion paths, the preset second fusion strategy is determined as the target fusion strategy. The second fusion strategy is used to indicate the sorting result among multiple energy values corresponding to the speech signals of the multi-stream data, to filter the bitstream data of at least some paths from the number of paths corresponding to the multi-stream data, and to fuse the decoding feature vectors corresponding to the at least some paths of the filtered bitstream data to obtain a fusion feature vector.
[0022] In an exemplary embodiment, fusing the decoding feature vectors corresponding to the multiple bitstream data to obtain a fused feature vector includes:
[0023] If the comparison result indicates that the number of paths in the multi-stream data is equal to the preset number of fusion paths, the decoding feature vectors corresponding to each of the multi-stream data are fused to obtain a fused feature vector; or
[0024] If the comparison result shows that the number of paths of the multi-stream data is less than the preset number of fusion paths, the decoding feature vectors corresponding to each of the multi-stream data are fused with the zero vector of the first number of paths to obtain a fused feature vector.
[0025] Wherein, the first path number is the difference between the preset fusion path number and the path number corresponding to the multi-stream data, and the zero vector has the same dimension as the decoding feature vector.
[0026] In an exemplary embodiment, the sorting result based on the multiple energy values corresponding to the speech signals of the multi-stream data, the filtering of at least a portion of the path data from the number of paths corresponding to the multi-stream data, and the fusion of the decoding feature vectors corresponding to each of the filtered at least a portion of the path data to obtain a fused feature vector, includes:
[0027] The multi-stream data are sorted from largest to smallest according to the energy value corresponding to each stream data to obtain the sorting result; the energy value corresponding to each stream data is the energy value of the speech signal of the stream data.
[0028] The first preset number of bitstream data in the sorting result are determined as the filtered bitstream data; the first preset number is equal to the preset number of fusion paths;
[0029] The fused feature vector is obtained by fusing the decoding feature vectors corresponding to the preset number of filtered bitstream data.
[0030] In an exemplary embodiment, the sorting result based on the multiple energy values corresponding to the speech signals of the multi-stream data, the filtering of at least a portion of the path data from the number of paths corresponding to the multi-stream data, and the fusion of the decoding feature vectors corresponding to each of the filtered at least a portion of the path data to obtain a fused feature vector, includes:
[0031] The multi-stream data are sorted from largest to smallest according to the energy value corresponding to each stream data to obtain the sorting result; the energy value corresponding to each stream data is the energy value of the speech signal of the stream data.
[0032] The first second preset number of bitstream data in the sorting result are determined as the filtered bitstream data; the second preset number is less than the preset number of fusion paths;
[0033] In the multi-stream data, the decoding feature vectors corresponding to the remaining stream data are fused to obtain a subclass fused feature vector; the remaining stream data are the stream data that do not belong to the filtered stream data among the multi-stream data.
[0034] The fused feature vector is obtained by fusing the decoding feature vectors corresponding to each of the filtered bitstream data and the fused feature vector of the subclass.
[0035] According to a second aspect of the present disclosure, a multi-channel speech reconstruction system is provided, comprising a plurality of encoding end devices and a decoding end device communicatively connected to each of the plurality of encoding end devices, wherein:
[0036] Each of the encoding end devices is configured to perform the following: acquire one speech signal; perform linear transformation processing on the speech signal to obtain a corresponding linearly transformed speech signal; extract speech features from the linearly transformed speech signal and encode the speech features to obtain a corresponding bitstream data.
[0037] The decoding device is configured to: acquire multi-stream data from the plurality of encoding devices; perform stream decoding on the multi-stream data to obtain speech features corresponding to each of the multi-stream data; determine a target fusion strategy based on a comparison between the number of paths corresponding to the multi-stream data and a preset number of fusion paths; fuse a target number of speech features based on the target fusion strategy to obtain fused speech features; the target number is the number of paths corresponding to the fused speech features corresponding to at least some paths according to the target fusion strategy; and perform inverse linear transform on the fused speech features to obtain reconstructed speech signals corresponding to the multi-stream speech signals.
[0038] According to a third aspect of the present disclosure, a multi-channel speech reconstruction apparatus is provided, comprising:
[0039] The data acquisition unit is configured to acquire multi-channel bitstream data corresponding to multiple audio signals; the bitstream data is an encoded bitstream after encoding the audio features of the audio signals, and the audio features are signal features extracted from the corresponding audio signals after linear transformation.
[0040] The stream decoding unit is configured to perform stream decoding processing on the multi-stream data to obtain the speech features corresponding to each of the multi-stream data.
[0041] The strategy determination unit is configured to determine the target fusion strategy by comparing the number of paths corresponding to the multi-stream data with the preset number of fusion paths.
[0042] The feature fusion unit is configured to perform a target number of speech features based on the target fusion strategy to obtain fused speech features; the target number is the number of paths corresponding to the fused speech features corresponding to at least a portion of the paths according to the target fusion strategy.
[0043] The signal reconstruction unit is configured to perform linear inverse transform processing on the fused speech features to obtain the reconstructed speech signal corresponding to the multi-channel speech signal.
[0044] According to a fourth aspect of the present disclosure, an electronic device is provided, comprising:
[0045] processor;
[0046] Memory for storing the executable instructions of the processor;
[0047] The processor is configured to execute the executable instructions to implement the multi-channel speech reconstruction method as described in any of the preceding claims.
[0048] According to a fifth aspect of the present disclosure, a computer-readable storage medium is provided, the computer-readable storage medium including a computer program that, when executed by a processor of an electronic device, enables the electronic device to perform the multi-channel speech reconstruction method as described in any of the preceding claims.
[0049] According to a fifth aspect of the present disclosure, a computer program product is provided, the computer program product including program instructions that, when executed by a processor of an electronic device, enable the electronic device to perform the multi-channel speech reconstruction method as described in any of the preceding claims.
[0050] The technical solutions provided by the embodiments of this disclosure bring at least the following beneficial effects:
[0051] The method first acquires multi-stream data corresponding to multiple speech signals; where the stream data is the encoded stream after encoding the speech features of the speech signals, and the speech features are the signal features extracted from the corresponding speech signals after linear transformation processing; then, the multi-stream data is decoded to obtain the speech features corresponding to each of the multi-stream data; then, a target fusion strategy is determined based on the comparison between the number of paths corresponding to the multi-stream data and the preset number of fusion paths; then, based on the target fusion strategy, a target number of speech features are fused to obtain fused speech features; where the target number is the number of paths corresponding to the speech features of at least some paths fused according to the target fusion strategy; finally, the fused speech features are subjected to inverse linear transformation processing to obtain the reconstructed speech signals corresponding to the multiple speech signals. In this way, on the one hand, by using the speech signal after linear transformation to encode and decode speech features, and by using inverse linear transformation to reconstruct the fused speech features into a speech signal, data packet loss and data distortion can be effectively avoided during the encoding, decoding and reconstruction of the speech signal, thereby optimizing the speech signal reconstruction process and improving the audio quality of the reconstructed speech signal. On the other hand, by referring to the relationship between the number of preset fusion paths, the speech features of at least some paths are fused according to the corresponding fusion strategy to reconstruct the speech signal based on the fused speech features, thereby reducing the amount of data to be processed and lowering the complexity of the decoding process during voice interaction, and improving the fluency of multi-person synchronous voice interaction.
[0052] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure. Attached Figure Description
[0053] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this disclosure and, together with the description, serve to explain the principles of this disclosure, and are not intended to unduly limit this disclosure.
[0054] Figure 1 This is an application environment diagram illustrating a multi-channel speech reconstruction method according to an exemplary embodiment.
[0055] Figure 2 This is a flowchart illustrating a multi-channel speech reconstruction method according to an exemplary embodiment.
[0056] Figure 3 This is an interface diagram illustrating a step of fusing at least a portion of the number of path data streams according to an exemplary embodiment.
[0057] Figure 4 This is a flowchart illustrating another step of fusing at least a portion of the number of path stream data, according to an exemplary embodiment.
[0058] Figure 5 This is a flowchart illustrating a step of reconstructing a speech signal according to an exemplary embodiment.
[0059] Figure 6 This is a flowchart illustrating a multi-channel speech reconstruction method according to another exemplary embodiment.
[0060] Figure 7 This is a block diagram illustrating a multi-channel speech reconstruction method according to an exemplary embodiment.
[0061] Figure 8 This is a block diagram of a multi-channel speech reconstruction system according to an exemplary embodiment.
[0062] Figure 9 This is a block diagram of a multi-channel speech reconstruction apparatus according to another exemplary embodiment.
[0063] Figure 10 This is a block diagram illustrating an electronic device for multi-channel voice reconstruction according to an exemplary embodiment.
[0064] Figure 11 This is a block diagram illustrating a computer-readable storage medium for multi-channel speech reconstruction according to an exemplary embodiment.
[0065] Figure 12 This is a block diagram illustrating a computer program product for multi-channel speech reconstruction according to an exemplary embodiment. Detailed Implementation
[0066] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0067] The term "and / or" in the embodiments of this application refers to any and all possible combinations including one or more of the associated listed items. It should also be noted that, when used in this specification, "including / comprising" specifies the presence of the stated features, integers, steps, operations, elements, and / or components, but does not exclude the presence or addition of one or more other features, integers, steps, operations, elements, and / or components and / or groups thereof.
[0068] The terms "first," "second," etc., used in this application are used to distinguish different objects, not to describe a specific order. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or apparatus that includes a series of steps or units is not limited to the listed steps or units, but may optionally include steps or units not listed, or may optionally include other steps or units inherent to these processes, methods, products, or apparatuses.
[0069] Furthermore, although the terms "first," "second," etc., are used repeatedly in this application to describe various operations (or various components, or various applications, or various instructions, or various data), these operations (or components, or applications, or instructions, or data) should not be limited by these terms. These terms are only used to distinguish one operation (or component, or application, or instruction, or data) from another operation (or component, or application, or instruction, or data). For example, a first voice signal can be called a second voice signal, and a second voice signal can be called a first voice signal; the only difference is the scope they encompass, but it does not depart from the scope of this application. Both the first voice signal and the second voice signal are collections of voice signals acquired by corresponding acquisition devices, only they are not acquired by the same acquisition device.
[0070] The multi-channel speech reconstruction method provided in this application can be applied to, for example... Figure 1 In the application environment shown, the peer terminal 102a and the local terminal 102b communicate with the server 104 via a communication network. The data storage system can store the processed data in the server 104, or send the processed data stored in the server 104 to the terminal 102. The data storage system can be integrated on the server 104, or it can be placed in the cloud or on other network servers.
[0071] In some embodiments, reference Figure 1First, the local terminal 102b acquires multi-channel bitstream data corresponding to multiple audio signals. The bitstream data is the encoded bitstream obtained by the peer terminal 102a after encoding the audio features of the audio signals. The audio features are signal features extracted by the peer terminal 102a from the corresponding audio signals after linear transformation. Then, the local terminal 102b performs bitstream decoding on the multi-channel bitstream data to obtain the audio features corresponding to each of the multiple bitstream data. Based on the comparison between the number of paths corresponding to the multi-channel bitstream data and the preset number of fusion paths, a target fusion strategy is determined. Based on the target fusion strategy, a target number of audio features are fused to obtain fused audio features. The target number is the number of paths corresponding to at least some of the audio features fused according to the target fusion strategy. Finally, the local terminal 102b performs an inverse linear transform on the fused audio features to obtain the reconstructed audio signals corresponding to the multiple audio signals.
[0072] In some embodiments, the peer terminal 102a and / or the local terminal 102b (such as a mobile terminal or a fixed terminal) can be implemented in various forms. The peer terminal 102a and the local terminal 102b can be mobile terminals, including mobile phones, smartphones, laptops, portable handheld devices, personal digital assistants (PDAs), tablet computers (PADs), etc., which can encode corresponding multi-channel bitstream data based on the voice characteristics of multiple voice signals, decode the multi-channel bitstream data, and fuse the decoded voice features into a corresponding reconstructed voice signal. Similarly, the peer terminal 102a and / or the local terminal 102b can be fixed terminals, including automated teller machines (ATMs), automated kiosks, digital TVs, desktop computers, fixed-line computers, etc., which can encode corresponding multi-channel bitstream data based on the voice characteristics of multiple voice signals, decode the multi-channel bitstream data, and fuse the decoded voice features into a corresponding reconstructed voice signal.
[0073] Hereinafter, it is assumed that the peer terminal 102a and / or the local terminal 102b are fixed terminals. However, those skilled in the art will understand that, if there are operations or elements specifically designed for mobile purposes, the construction according to the embodiments disclosed in this application can also be applied to mobile peer terminals 102a and / or local terminals 102b.
[0074] In some embodiments, the data processing components running on the peer terminal 102a and / or the local terminal 102b may load any of the various additional server applications and / or middleware applications that are being executed, such as HTTP (Hypertext Transfer Protocol), FTP (File Transfer Protocol), CGI (Common Gateway Interface), RDBMS (Relational Database Management System), etc.
[0075] In some embodiments, the peer terminal 102a and / or the local terminal 102b may be implemented using a separate data processor or a data processing cluster consisting of multiple data processors. The peer terminal 102a and / or the local terminal 102b may be adapted to run one or more application services or software components that provide the various additional server applications and / or middleware applications described in the foregoing disclosure.
[0076] In some embodiments, the application service may include a service interface for providing users with speech signal fusion (e.g., an operation interface for users to select multiple acquisition paths of speech signals to be fused, or a display interface for playing multiple reconstructed speech signals to users), and corresponding program services, etc. The software components may include, for example, an application (SDK) or client (APP) with functions for encoding and decoding coded bitstreams of speech features, and for fusing and reconstructing speech features.
[0077] In some embodiments, the application or client provided by the peer terminal 102a and / or the local terminal 102b, which has the functions of encoding and decoding the encoded bitstream of speech features and fusing and reconstructing speech features, includes a portal port that provides one-to-one application services to users in the foreground and multiple business systems that perform data processing in the background, so as to extend the application of speech encoding and decoding, speech feature fusion and reconstruction functions to the APP or client, so that users can use and access the speech encoding and decoding, speech feature fusion and reconstruction functions anytime and anywhere.
[0078] In some embodiments, the resource transfer function of an APP or client can be a computer program running in user mode to complete one or more specific tasks, which can interact with the user and has a visual user interface. The APP or client can include two parts: a graphical user interface (GUI) and an engine, which together provide users with a variety of application services in the form of a user interface in a digital client system.
[0079] In some embodiments, the user can input corresponding code data or control parameters to the APP or client through the input device in the peer terminal 102a and / or the local terminal 102b, so as to execute the application service of the computer program in the server 104 and display the application service in the user interface.
[0080] As an example, when a user needs to play a single audio signal reconstructed from multiple audio signals via the local terminal 102b of an electronic device, the user can input service information for obtaining multiple audio signals to the server 104 through the input device in the local terminal 102b. After the server 104 obtains the multi-stream data corresponding to the multiple audio signals generated by multiple peer terminals 102a based on the service information input by the user, the local terminal 102b obtains the multi-stream data corresponding to the multiple audio signals based on a preset network communication protocol, performs computer processing of fusion and reconstruction, and displays the finally processed reconstructed audio signal in the player of the local terminal 102b, so as to display the single reconstructed audio signal corresponding to the multiple audio signals to the user in real time. Optionally, the input method corresponding to the input device can be touch screen input, button input, voice input, or related control program input, etc.
[0081] In some embodiments, the operating system running on the APP or client may include various versions of Microsoft Windows®, Apple Macintosh® and / or Linux operating systems, various commercial or UNIX-like® operating systems (including but not limited to various GNU / Linux operating systems, Google Chrome® OS, etc.) and / or mobile operating systems such as iOS®, Windows® Phone, Android® OS, BlackBerry® OS, Palm® OS, and other online or offline operating systems, without specific limitations herein.
[0082] In some embodiments, such as Figure 2 As shown, a multi-channel speech reconstruction method is provided, which is then applied to... Figure 1 Taking the peer terminal 102a and / or the local terminal 102b as examples, the method includes the following steps:
[0083] Step S11: Obtain the multi-channel bitstream data corresponding to the multi-channel voice signals.
[0084] In one embodiment, the bitstream data is an encoded bitstream after encoding the speech features of the speech signal, and the speech features are signal features extracted from the corresponding speech signal after linear transformation.
[0085] In some embodiments, the voice signal is a signal from a voice source that is collected in real time by a voice acquisition device mounted on the peer terminal (i.e., the encoding terminal).
[0086] There can be one or more peer terminals, and each peer terminal is equipped with a voice acquisition device for acquiring a voice source signal.
[0087] For example, if the local terminal acquires X streams of data corresponding to X voice signals, it indicates that there are X remote terminals, and their corresponding X voice acquisition devices have acquired signals from X voice sources. Where X ≥ 1.
[0088] In some embodiments, after the peer terminal acquires the signal data of the corresponding multiple voice sources, it first performs linear transformation processing on the multiple signal data, then extracts the voice features of each channel from the linearly transformed multiple signal data, and finally performs feature encoding processing on the voice features of each channel to generate multi-channel code stream data corresponding to the multiple voice signals.
[0089] Linear transformation refers to the process by which data, after undergoing a linear mapping from a linear space to itself, retains the linear properties of the output transformed data and the original input data.
[0090] In some embodiments, the method by which the peer terminal performs linear transformation processing on the speech signal may include Fast Fourier Transform, (improved) Discrete Cosine Transform, Wavelet Transform, etc., without specific limitations.
[0091] In some embodiments, the peer terminal uses a pre-trained neural network model to extract features from each speech signal to obtain the speech features of each speech signal. The pre-trained neural network model may include a Gated Recurrent Unit (GRU) neural network based on an improved RNN, a residual network based on a improved CNN (such as ResNet, Wide ResNet-34), etc., and is not specifically limited here.
[0092] In some embodiments, the speech features of a speech signal include long-term speech features and short-term speech features of the speech data. Short-term speech features include low-level speech features such as timbre, pitch, and rhythm information of the speech signal expressed in a single speech frame. Long-term speech features include human voice content features of the speech signal expressed in multiple consecutive speech frames. Human voice content features belong to the speaker's information and are high-level speech features that contain more information such as feature context after processing the low-level speech features such as timbre, pitch, and rhythm information of the speech signal.
[0093] In some embodiments, high-level speech features are high-dimensional hidden layer features. High-level audio features may include, for example, Mel spectrograms of audio shock waves of a speech signal to reflect the rhythmic intensity of the speech signal, or MFCC features of its harmonics to reflect the timbre changes of the speech signal, or constant Q transform features of audio harmonics of a speech signal to reflect the pitch changes.
[0094] In some embodiments, the peer terminal can select the corresponding feature coding network according to the feature coding form, bit rate requirements, sound quality requirements, etc., and perform feature coding on the speech features of each speech signal to generate multi-channel bitstream data corresponding to multiple speech signals.
[0095] In some embodiments, the feature coding network prepared in advance by the peer terminal includes vector coding networks, scalar coding networks, etc.
[0096] Vector coding networks are used to perform vector transformations on feature vectors to encode them into feature bitstreams; scalar coding networks are used to perform scalar transformations on feature vectors to encode them into feature bitstreams.
[0097] In some embodiments, the local terminal establishes a communication network (communication protocol) connection with the remote terminal in advance. When the local terminal responds to the start of execution of this voice encoding and decoding method, the local terminal receives the multi-channel code stream data corresponding to the multi-channel voice signals sent by the remote terminal through the connected communication network (communication protocol).
[0098] In some embodiments, the communication network (communication protocol) supported between the local terminal and the peer terminal may include Ethernet, Bluetooth, Zigbee, Z-Wave, or the Smart Home Working Group standard (i.e., Project Connected Home over IP) that supports IPv6 networks, VoIP protocol (Voice over Internet Protocol, voice transmission based on Internet Protocol), etc.
[0099] Step S12: Perform code stream decoding on the multi-stream data to obtain the speech features corresponding to each of the multi-stream data.
[0100] In some embodiments, the local terminal uses a pre-prepared feature decoding network to perform bitstream decoding processing on the received multi-stream data to obtain the speech features corresponding to each bitstream data after decoding. The feature decoding network includes vector decoding networks corresponding to vector coding networks, scalar decoding networks corresponding to scalar coding networks, etc.
[0101] Before decoding each stream of data, the local terminal needs to identify the encoding method corresponding to each feature stream, and then select the decoding method of the feature decoding network that matches the encoding method. That is, if the encoding method corresponding to the feature stream is vector encoding, then the feature decoding network needs to be the corresponding vector decoding network, and the vector decoding method corresponding to the vector decoding network is applied to decode the feature stream; or, if the encoding method corresponding to the feature stream is scalar encoding, then the feature decoding network needs to be the corresponding scalar decoding network, and the scalar decoding method corresponding to the scalar decoding network is applied to decode the feature stream.
[0102] Step S13: Determine the target fusion strategy based on the comparison results between the number of paths corresponding to the multi-stream data and the preset number of fusion paths.
[0103] In some embodiments, the local terminal determines the target fusion strategy from among a plurality of preset fusion strategies based on the comparison between the number of paths corresponding to the multi-stream data and the number of preset fusion paths.
[0104] In some embodiments, the preset number of fusion paths is a limit for voice signal fusion set by the development engineer on the local terminal. That is, only a maximum of the preset number of fusion paths can be fused at a time on the local terminal.
[0105] In some embodiments, the fusion strategy is used to instruct the local terminal to fuse the speech features of at least a portion of the paths in the multi-stream data. The paths and number of stream data to be fused are related to the corresponding fusion strategy.
[0106] Step S14: Based on the target fusion strategy, fuse the speech features of the target number to obtain fused speech features.
[0107] The target number is the number of paths corresponding to the fusion of speech features corresponding to at least some paths according to the target fusion strategy.
[0108] For example, the comparison between the number of paths corresponding to the multi-stream data and the preset number of fusion paths is result X. Among the preset fusion strategies, the target fusion strategy corresponding to result X is strategy Y. Strategy Y instructs the local terminal to fuse the speech features of P stream data from the multi-stream data. Here, P stream data refers to a specific P-bar code stream data path corresponding to strategy Y within the multi-stream data. This specific P-bar code stream data path represents the target number corresponding to strategy Y.
[0109] In a specific implementation scenario, the local terminal acquires the decoded speech features of M speech signals corresponding to the corresponding code streams. The preset number of fusion paths set by the local terminal is L. If M ≤ L, the local terminal uses a preset feature fusion network to fuse the decoded speech features of the M speech signals corresponding to the corresponding code streams based on the corresponding target fusion strategy, resulting in a fused single mixed speech feature. If M > L, the local terminal first selects L speech features (i.e., the target number corresponding to the target fusion strategy) from the decoded speech features of the M corresponding code streams based on the corresponding target fusion strategy, and then fuses the selected L speech features based on the preset feature fusion network, resulting in a fused single mixed speech feature.
[0110] Step S15: Perform linear inverse transform processing based on the fused speech features to obtain the reconstructed speech signals corresponding to the multiple speech signals.
[0111] In some embodiments, the local terminal first performs linear decoding on one of the fused speech features to obtain the speech signal of that speech feature, and then performs inverse linear transform on the linearly decoded speech signal to obtain a reconstructed speech signal that fuses multiple speech signals. This reconstructed speech signal is the output speech signal played by the local terminal to the user, and this output speech signal contains speech features corresponding to the fused multiple speech signals.
[0112] In some embodiments, the inverse linear transformation of the linearly decoded speech signal performed by the local terminal needs to match the linear transformation method of the multiple speech signals to be fused by the remote terminal.
[0113] For example, if the peer terminal performs Fast Fourier Transform (FFT) processing on the multiple speech signals to be fused, then the local terminal performs Inverse Fast Fourier Transform (IFFT) processing on the fused and linearly decoded speech signals; or, if the peer terminal performs (improved) Discrete Cosine Transform (DCT) processing on the multiple speech signals to be fused, then the local terminal performs (improved) Inverse Discrete Cosine Transform (IFFT) processing on the fused and linearly decoded speech signals; or, if the peer terminal performs Wavelet Transform (WPT) processing on the multiple speech signals to be fused, then the local terminal performs Inverse Wavelet Transform (IFFT) processing on the fused and linearly decoded speech signals.
[0114] The speech encoding and decoding method provided in this embodiment can be applied to any scenario of processing speech data.
[0115] For example, scenarios involving the transmission of voice data include voice calls, video calls, multi-person voice conferences, and multi-person video conferences. In this embodiment, the terminal may include a local terminal and a peer terminal, and the peer terminal may include one or more terminal devices, as may the local terminal include one or more terminal devices.
[0116] In some embodiments, both the local terminal and the remote terminal run an application client provided by the server. The application client stores various speech coding models (including various convolutional neural network models, deep learning neural network models, and feature coding network models) and speech decoding models (including various convolutional neural network models, deep learning neural network models, and feature decoding network models) trained by the server. The application client has the function of voice call.
[0117] During a multi-person synchronous voice call, the remote terminal uses an application client to invoke a voice coding model to encode the collected voice data and then sends the encoded voice features to the local terminal. The local terminal uses an application client to invoke a voice decoding model to decode and fuse the received voice features, obtaining the final voice information output to the user, thereby realizing the transmission and reconstruction of voice data between the remote and local terminals.
[0118] In the aforementioned speech encoding and decoding process, the terminal first acquires multi-stream data corresponding to multiple speech signals; wherein, the stream data is the encoded stream after encoding the speech features of the speech signals, and the speech features are the signal features extracted from the corresponding speech signals after linear transformation processing; then, the multi-stream data is decoded to obtain the speech features corresponding to each of the multi-stream data; then, based on the comparison between the number of paths corresponding to the multi-stream data and the preset number of fusion paths, a target fusion strategy is determined; then, based on the target fusion strategy, a target number of speech features are fused to obtain fused speech features; wherein, the target number is the number of paths corresponding to the speech features of at least some paths fused according to the target fusion strategy; finally, the fused speech features are subjected to inverse linear transformation processing to obtain the reconstructed speech signals corresponding to the multiple speech signals. In this way, on the one hand, by using the speech signal after linear transformation to encode and decode speech features, and by using inverse linear transformation to reconstruct the fused speech features into a speech signal, data packet loss and data distortion can be effectively avoided during the encoding, decoding and reconstruction of the speech signal, thereby optimizing the speech signal reconstruction process and improving the audio quality of the reconstructed speech signal. On the other hand, by referring to the relationship between the number of preset fusion paths, the speech features of at least some paths are fused according to the corresponding fusion strategy to reconstruct the speech signal based on the fused speech features, thereby reducing the amount of data to be processed and lowering the complexity of the decoding process during voice interaction, and improving the fluency of multi-person synchronous voice interaction.
[0119] Those skilled in the art will understand that the methods disclosed in the above-described specific embodiments can be implemented in more specific ways. For example, the implementation described above, which fuses the speech features of at least a portion of the decoded bitstream data corresponding to multiple bitstream data based on the relationship between the number of paths corresponding to the multi-path bitstream data and the preset number of fusion paths, is merely illustrative.
[0120] For example, the terminal extracts speech features from the signal obtained after linear transformation of the speech signal; or performs inverse linear transformation processing based on fused speech features, etc. These are just one set of methods. In actual implementation, there may be other transformation methods. For example, the original multi-channel speech signals and the reconstructed speech signals can be combined or integrated into another system, or some features can be ignored or not executed.
[0121] In one embodiment, the speech features of each speech signal correspond to form a feature vector.
[0122] In some embodiments, the peer terminal reduces the dimensionality of speech features by quantizing, or discretizing, the speech features extracted from each speech signal. That is, before the peer terminal encodes the speech features of multiple speech signals, it needs to compress the speech features to reduce the bit rate.
[0123] For example, the speech features of the second path's speech signal are H = {h1, h2, ..., hT}, which include multiple speech feature vectors. By quantizing the speech features, multiple adjacent speech feature vectors are quantized into the same feature vector, ultimately obtaining the quantized speech feature Q = {q1, q2, ..., qT}. Discretizing the speech features before feature encoding achieves low-bitrate encoding.
[0124] In one embodiment, the bitstream data is an encoded bitstream obtained based on a vector encoding strategy or a scalar encoding strategy.
[0125] In some embodiments, the development engineer configures the encoding strategy for the bitstream data based on the encoding format, bit rate requirements, and sound quality requirements of each speech feature. When the peer terminal encodes the speech features, it uses a feature encoding network corresponding to the encoding strategy to perform the encoding of the speech features.
[0126] Among them, feature encoding networks include VQ (Vector Quantization) networks, SQ (Scalar Quantization) networks, etc.
[0127] In some embodiments, the VQ network is used to perform vector transformation on the feature vectors of speech features to encode them into a feature bitstream; the SQ network is used to perform scalar transformation on the feature vectors of speech features to encode them into a feature bitstream.
[0128] The advantages of VQ networks are high compression ratio, low bit rate requirement, and applicability to scenarios with large data volumes. However, their disadvantages are that individual feature vectors cannot be encoded independently, requiring commonalities between them. Furthermore, due to the high compression ratio of VQ networks, in high-concurrency speech data processing scenarios, a significant amount of speech features may be lost, resulting in lower sound quality.
[0129] The disadvantages of SQ networks are low compression rate, high bit rate requirement, and suitability for scenarios with small data volume; its advantage is that it represents a digital vector with multiple bits, that is, each dimension of the feature vector is equivalent to a number that can be used for independent encoding, and it does not require that the speech features have commonalities.
[0130] In one embodiment, the process of the terminal performing code stream decoding on multiple code stream data to obtain the speech features corresponding to each of the multiple code stream data may specifically include: performing code stream decoding on each code stream data according to the decoding strategy corresponding to the encoding strategy of each code stream data to obtain the decoding feature vector of the multiple code stream data, and using the decoding feature vector of the multiple code stream data as the speech features corresponding to each of the multiple code stream data.
[0131] In some embodiments, after configuring the encoding strategy for the bitstream data, the development engineer also needs to configure a decoding strategy that matches the encoding strategy, so that when the local terminal decodes the bitstream data, it can use the feature decoding network corresponding to the decoding strategy to perform the decoding processing of the bitstream data.
[0132] In some embodiments, the feature decoding network includes a VQ (vector decoding) network, an SQ (scalar decoding) network, etc.
[0133] Before decoding each stream of data, the local terminal needs to identify the encoding method corresponding to each feature stream, and then select the decoding method of the feature decoding network that matches the encoding method. That is, if the encoding method corresponding to the feature stream is vector encoding, then the feature decoding network needs to be the corresponding vector decoding network, and the vector decoding method corresponding to the vector decoding network is applied to decode the feature stream; or, if the encoding method corresponding to the feature stream is scalar encoding, then the feature decoding network needs to be the corresponding scalar decoding network, and the scalar decoding method corresponding to the scalar decoding network is applied to decode the feature stream.
[0134] In one embodiment, the terminal determines a target fusion strategy based on a comparison between the number of paths corresponding to the multi-stream data and the preset number of fusion paths, including: if the comparison result shows that the number of paths corresponding to the multi-stream data is less than or equal to the preset number of fusion paths, the terminal determines a preset first fusion strategy as the target fusion strategy.
[0135] The first fusion strategy is used to indicate the decoding feature vectors corresponding to the fusion of multiple bitstream data to obtain the fused feature vector.
[0136] Specifically, if the number of paths corresponding to the multi-stream data acquired by the local terminal is less than or equal to the preset number of fusion paths, the first fusion strategy is used to instruct the local terminal to fuse the decoding feature vectors corresponding to the multi-stream data into a single feature vector.
[0137] In the first fusion method, the local terminal fuses the decoding feature vectors corresponding to each of the multiple bitstream data to obtain a fused feature vector, including: when the comparison result shows that the number of paths of the multiple bitstream data is equal to the preset number of fusion paths, fusing the decoding feature vectors corresponding to each of the multiple bitstream data to obtain a fused feature vector.
[0138] In the second fusion method, the local terminal fuses the decoding feature vectors corresponding to each of the multiple bitstream data to obtain a fused feature vector, including: when the comparison result shows that the number of paths of the multiple bitstream data is less than the preset number of fusion paths, fusing the decoding feature vectors corresponding to each of the multiple bitstream data with the zero vector of the first number of paths to obtain a fused feature vector.
[0139] The first path number is the difference between the preset fusion path number and the path number corresponding to the multi-stream data, and the zero vector has the same dimension as the decoding feature vector.
[0140] As an example, if the number of first paths corresponding to the multi-stream data acquired by the local terminal is M, the preset number of fusion paths is L, and M is less than L, then the local terminal generates a zero vector corresponding to LM paths, and fills the zero vector of LM paths into the decoding feature vector corresponding to the acquired multi-stream data. Then, the local terminal merges the decoding feature vector corresponding to the multi-stream data with the filled zero vector of LM paths into a single feature vector.
[0141] In another embodiment, the terminal determines the target fusion strategy based on the comparison result between the number of paths corresponding to the multi-stream data and the preset number of fusion paths, including: if the comparison result shows that the number of paths corresponding to the multi-stream data is greater than the preset number of fusion paths, the terminal determines the preset second fusion strategy as the target fusion strategy.
[0142] The second fusion strategy is used to indicate the sorting result among multiple energy values corresponding to the speech signal based on multi-stream data, to filter at least some of the path data in the number of paths corresponding to the multi-stream data, and to fuse the decoding feature vectors corresponding to the filtered at least some path data to obtain a fused feature vector.
[0143] In the third fusion method, see [reference] Figure 3 , Figure 3 This is a flowchart illustrating an embodiment of fusing at least a portion of the path data in this application. In step S14, the terminal determines the target fusion strategy based on a comparison between the number of paths corresponding to the multi-path data and the preset number of fusion paths. This process can be implemented in the following way:
[0144] Step S141: Sort the multi-stream data according to the energy value of each stream data from largest to smallest to obtain the sorting result.
[0145] Among them, the energy value corresponding to each bitstream data is the energy value of the voice signal of the bitstream data.
[0146] In some embodiments, after the peer terminal acquires the collected multiple voice signals, the peer terminal calculates the energy value of each voice signal based on signal parameters such as power and signal period. Then, the peer terminal sends voice information (such as the bitstream data of each voice signal) containing the energy values of each voice signal to the local terminal. The local terminal then sorts the multiple bitstream data according to the energy values of each voice signal corresponding to each bitstream data from largest to smallest, and obtains the sorting result.
[0147] Step S142: Select the first preset number of bitstream data from the sorting results as the filtered bitstream data.
[0148] In some embodiments, the first preset number is equal to the preset number of fusion paths.
[0149] As an example, if the number of paths corresponding to the multi-stream data acquired by the local terminal is M, the preset number of fusion paths is L, and M is greater than L, then the local terminal sorts the energy values of the M voice signals to obtain the corresponding sorting results, and then determines the first L stream data in the sorting results as the filtered stream data.
[0150] Step S143: Merge the decoding feature vectors corresponding to a preset number of filtered bitstream data to obtain a fused feature vector.
[0151] In some embodiments, the local terminal merges the decoding feature vectors corresponding to the first preset number of selected filter bitstream data into a single feature vector to obtain a fused feature vector.
[0152] In the fourth fusion method, see [reference] Figure 4 , Figure 4 This is a flowchart illustrating another embodiment of the fusion of at least a portion of the path data in this application. In step S14, the terminal determines the target fusion strategy based on a comparison between the number of paths corresponding to the multi-path data and the preset number of fusion paths. This process can be implemented in the following way:
[0153] Step S144: Sort the multi-stream data according to the energy value of each stream from largest to smallest to obtain the sorting result.
[0154] Step S144 is similar to step S141 in the above embodiment, and will not be described again here.
[0155] Step S145: The first second preset number of bitstream data in the sorting results are determined as the filtered bitstream data.
[0156] In some embodiments, the second preset number is less than the preset number of fusion paths. For example, the second preset number is equal to the preset number of fusion paths minus one.
[0157] As an example, if the number of paths corresponding to the multi-stream data acquired by the local terminal is M, the preset number of fusion paths is L, and M is greater than L, then the local terminal sorts the energy values of the M voice signals to obtain the corresponding sorting results, and then determines the first L-1 stream data in the sorting results as the filtered stream data.
[0158] Step S146: In the multi-stream data, fuse the decoding feature vectors corresponding to the other stream data to obtain the subclass fused feature vector.
[0159] In one embodiment, the remaining bitstream data are bitstream data that do not belong to the filtered bitstream data among the multiple bitstream data.
[0160] In some embodiments, the local terminal fuses the bitstream data that does not belong to the filtered bitstream data from the multi-bitstream data into a single feature vector, and uses this fused feature vector as the subclass fused feature vector corresponding to the multi-bitstream data.
[0161] Step S127: Merge the decoding feature vectors and subclass fusion feature vectors corresponding to the filtered bitstream data to obtain the fusion feature vector.
[0162] In some embodiments, the local terminal merges the decoding feature vectors corresponding to the selected second preset number of filtered bitstream data with the subclass fusion feature vectors of the multiple bitstream data into a single feature vector.
[0163] In one embodiment, the speech features of the speech signal are extracted based on the first time-frequency domain data obtained after time-frequency transformation of the speech signal.
[0164] In some embodiments, after acquiring the collected voice signal, the peer terminal performs time-frequency transformation processing on the voice signal to form a transformed time-frequency domain signal. The peer terminal then extracts voice features from the time-frequency domain signal based on a preset feature extraction network.
[0165] Among them, time-frequency transformation processing methods include Fast Fourier Transform, (improved) Discrete Cosine Transform, Wavelet Transform, etc.
[0166] In some embodiments, before performing feature extraction on the transformed time-frequency domain signal, the peer terminal can first slice the time-frequency domain signal according to the time sequence corresponding to the speech signal to obtain multiple speech slice data, and then extract speech features from the time-frequency domain signal based on a preset feature extraction network.
[0167] In an exemplary embodiment, see Figure 5 , Figure 5 This is a schematic flowchart of an embodiment of reconstructing a speech signal in this application. In step S15, the terminal performs an inverse linear transform on the fused speech features to obtain the reconstructed speech signal corresponding to the multiple speech signals. This process can be implemented in the following way:
[0168] Step S151: Perform time-frequency decoding on the fused speech features to obtain the corresponding time-frequency domain data after time-frequency decoding.
[0169] In some embodiments, the time-frequency domain data is used to characterize the time-domain and frequency-domain properties of the fused speech features.
[0170] In some embodiments, the time-frequency domain data is a time-frequency domain signal corresponding to the fused speech feature (i.e., a speech signal generated by fusion). The time-frequency domain signal can characterize the time-domain and frequency-domain properties of each fused speech feature.
[0171] Step S152: Perform inverse time-frequency transform on the time-frequency domain data to obtain a speech signal with fused speech features, and use the speech signal with fused speech features as the reconstructed speech signal corresponding to the multi-channel speech signal.
[0172] In some embodiments, the local terminal's inverse time-frequency transformation processing of time-frequency domain data needs to match the time-frequency transformation processing method of the peer terminal for the multiple voice signals to be fused.
[0173] For example, if the peer terminal performs Fast Fourier Transform (FFT) processing on the multiple speech signals to be fused, then the local terminal performs Inverse Fast Fourier Transform (FFT) processing on the fused and time-frequency decoded time-frequency domain data; or, if the peer terminal performs (improved) Discrete Cosine Transform (DCT) processing on the multiple speech signals to be fused, then the local terminal performs (improved) Inverse Discrete Cosine Transform (DCT) processing on the fused and time-frequency decoded time-frequency domain data; or, if the peer terminal performs Wavelet Transform (WPT) processing on the multiple speech signals to be fused, then the local terminal performs Inverse Wavelet Transform (WPT) processing on the fused and time-frequency decoded time-frequency domain data.
[0174] To more clearly illustrate the multi-channel speech reconstruction method provided in the embodiments of this disclosure, the following specific embodiment will be used to describe the multi-channel speech reconstruction method in detail. In an exemplary embodiment, reference is made to... Figure 6 and Figure 7 , Figure 6 This is a flowchart illustrating a multi-channel speech reconstruction method according to another exemplary embodiment. Figure 7 This is a block diagram illustrating a multi-channel speech reconstruction method according to another exemplary embodiment. The multi-channel speech reconstruction method is used in a peer terminal 102a and / or a local terminal 102b, and specifically includes the following:
[0175] Step S21: In a multi-person call scenario, the sending end acquires M voice signals collected by the voice acquisition device.
[0176] The sending end can be multiple terminal devices, such as multiple smartphones, tablets, etc. Each terminal device is equipped with a voice acquisition device (including a camera, recorder, etc.).
[0177] As an example, in a scenario of 10 people making an online call, there are 10 smartphones that use their built-in voice acquisition devices to collect the voice signals of their respective voice paths in real time, thus obtaining 10 voice signals.
[0178] Step S22: The transmitting end performs time-frequency transformation on the M-channel voice signals to obtain the transformed M-channel time-frequency domain signals.
[0179] The time-frequency transformation process involves the following steps: Each transmitter, based on a preset first time-frequency transformation network, first decomposes the corresponding acquired signal into multiple signal frames, and then performs time-frequency transformation on each signal frame to obtain the transformed time-frequency domain signals of multiple signal frames, denoted as TF(n).
[0180] Among them, the methods for time-frequency transformation using time-frequency transformation networks include Fast Fourier Transform, (improved) Discrete Cosine Transform, Wavelet Transform, etc.
[0181] Among them, the time-frequency transformation methods are all linear transformations, because when the speech signal is linearly transformed, it can satisfy the property that the speech signal does not change after subsequent operations such as superposition and multiplication of speech features, which is beneficial to subsequent multi-channel signal fusion and single-channel decoding.
[0182] Step S23: The transmitting end inputs the time-frequency domain signals of the M channels into the deep learning network for feature extraction to obtain the speech feature vectors of the M channels.
[0183] Among them, the deep learning network is a pre-set feature extraction network for each sending end, and the network structure of the feature extraction network can be a combination of CNN, GRU, etc.
[0184] The deep learning network performs deep feature extraction on the time-frequency domain signal of each signal frame to obtain highly compressed speech features, denoted as TFfeature(n), which is a feature vector.
[0185] Among them, speech features are expressed in the form of feature vectors, which are used to characterize audio features such as MFCC.
[0186] The highly compressed speech feature refers to the following: each time-frequency domain signal has P1 data points (e.g., P1 = 320, meaning the speech acquisition device initially has 320 sampling points, resulting in 320 data points). The deep learning network compresses these P1 data points into P2 data points (e.g., P2 = 60, meaning the deep learning network downsamples the time-frequency domain signal, reducing the number of channels). Where P1 > P2.
[0187] Step S24: The transmitting end performs feature encoding on the M-channel speech feature vectors according to the preset encoding network to obtain the M-channel feature bitstream.
[0188] The sending end selects the corresponding feature coding network based on the form of feature coding, bit rate requirements, and audio quality requirements.
[0189] Feature coding networks include VQ (vector coding) networks, SQ (scalar coding) networks, and others.
[0190] The VQ network is used to perform vector transformation on the feature vectors to encode them into a feature bitstream, denoted as TFQ(n); the SQ network is used to perform scalar transformation on the feature vectors to encode them into a feature bitstream, denoted as TFQ(n).
[0191] Among them, VQ has the advantages of high compression ratio, low bit rate requirement, and can be used in scenarios with large data volume; the disadvantage is that each feature cannot be encoded independently, and features must have commonalities. In addition, due to the high compression ratio of VQ, more features are lost, resulting in lower sound quality.
[0192] Among them, the disadvantages of SQ are low compression rate, high bit rate requirement, and it can be used for scenarios with small data volume; the advantage is that a number is represented by multiple bits, and the feature of each dimension is equivalent to each number being independently coded, without the need for commonality.
[0193] Step S25: The sending end sends the M feature code streams to the receiving end through a preset network protocol.
[0194] In this process, the sending end transmits TFQ(n) over the network. It's important to note that during a multi-party call, multiple parties' feature streams are sent to the receiving end. Therefore, the different feature streams are denoted as TFQ1(n), TFQ2(n), ..., TFQ. M (n).
[0195] The default network protocol is VoIP (Voice over Internet Protocol).
[0196] Step S26: The receiving end performs feature decoding on the M-channel feature streams according to the preset decoding network to obtain the decoded speech feature vectors of the M-channels.
[0197] The receiving end can be a single terminal device, such as multiple smartphones, tablets, etc.
[0198] The receiver is equipped with a feature decoding network, which includes VQ (vector decoding) network, SQ (scalar decoding) network, etc.
[0199] The receiving end selects a matching feature decoding network based on the encoding method of the acquired M feature streams. That is, if the encoding method of the feature stream is vector encoding, the feature decoding network needs to be the corresponding vector decoding network; or, if the encoding method of the feature stream is scalar encoding, the feature decoding network needs to be the corresponding scalar decoding network.
[0200] Among them, the feature decoding network will use TFQ1(n), TFQ2(n)...TFQ M (n) Perform feature decoding to obtain the decoded multi-path features, denoted as: TFDQ1(n), TFDQ2(n), ..., TFDQ M (n).
[0201] Step S27: The receiver inputs the decoded M-channel speech feature vectors into the deep learning network for feature fusion to obtain the fused 1-channel speech feature vector.
[0202] Deep learning networks are a type of feature fusion network, and the network structure of feature fusion networks can be a combination of CNN, GRU, etc.
[0203] The deep learning network adds the M speech feature vectors together to obtain a fused 1-channel hybrid speech feature vector.
[0204] The deep learning network uses M-path speech feature vectors (i.e., TFDQ1(n), TFDQ2(n)...TFDQ) to generate the speech feature vectors. M (n) is fused into a single mixed speech feature equivalent to multiple audio streams, denoted as: MergedTFDQ(n).
[0205] The receiver has a preset maximum number of fusion paths L, meaning that the receiver can only fuse L speech feature vectors at a time.
[0206] In one approach, when M≤L, the receiver fills in the input signals in the M channels that do not belong to the L channels with all-zero characteristics.
[0207] In another approach, when M > L, the receiver sorts the energy of the M signals, fuses the feature vectors of the first L signals, and discards the rest. For example, a recommended value for L is 10. Generally, it is difficult for the receiver to distinguish the content when more than 10 voice signals occur simultaneously. Therefore, decoding more than 10 signal streams simultaneously is not very useful in practical applications.
[0208] In another approach, when M > L, the receiver first sorts the energy of the M signals, inputs the speech feature vectors of the Lth to Mth signals into a deep learning network for feature fusion, and obtains the speech feature vector of the M+1th signal. Then, it inputs the speech feature vectors of the 1st to L-1th and M+1th signals into a deep learning network for feature fusion, and obtains the final speech feature vector of the 1st signal.
[0209] Step S28: The receiver inputs the fused one-channel speech feature vector into the time-frequency speech decoding network for time-frequency transformation to obtain the transformed one-channel time-frequency domain signal.
[0210] The receiver is equipped with a time-frequency speech decoding network. The time-frequency speech decoding network performs time-frequency transformation on the one-channel speech feature vector MergedTFDQ(n) to obtain the transformed one-channel time-frequency domain signal denoted as TF'(n). The TF'(n) signal is similar to the TF(n) signal.
[0211] Step S29: The receiving end performs inverse time-frequency transformation on the transformed 1-channel time-frequency domain signal to obtain the inverse-transformed 1-channel voice signal.
[0212] The steps of the inverse time-frequency transformation are as follows: The receiving end, based on the preset inverse time-frequency transformation network, first combines the one-channel time-frequency domain signals of multiple signal frames acquired into one-channel time-frequency domain signal of a signal frame, and then performs an inverse time-frequency transformation on the one-channel time-frequency domain signal to obtain the transformed time-frequency domain signal.
[0213] Among them, the one-channel speech signal after inverse transformation is the final time-domain speech signal output to the user.
[0214] Specifically, the inverse time-frequency transform network needs to be matched with the time-frequency transform network. That is, if the time-frequency transform network is used to perform a Fast Fourier Transform (FFT) on the speech signal, then the inverse time-frequency transform network is used to perform an inverse FFT on one channel of the time-frequency domain signal; or, if the time-frequency transform network is used to perform a (modified) Discrete Cosine Transform (DCT) on the speech signal, then the inverse time-frequency transform network is used to perform a (modified) Inverse Discrete Cosine Transform (DCT) on one channel of the time-frequency domain signal; or, if the time-frequency transform network is used to perform a wavelet transform on the speech signal, then the inverse time-frequency transform network is used to perform an inverse wavelet transform on one channel of the time-frequency domain signal.
[0215] Among them, the time-frequency transform network, feature extraction network, feature encoding network, feature decoding network, feature fusion network, time-frequency speech decoding network, and time-frequency inverse transform network are all composed of corresponding network models.
[0216] The loss function for each network model can be: (improved) scale-invariant signal-to-noise ratio, spectral distance, temporal waveform distance, etc.
[0217] The loss function for each network model is used to describe the similarity between its input and output data.
[0218] For example, the input data of the time-frequency conversion module is the acquired speech signal, and the output data is the transformed time-frequency domain signal. The loss function is the time-domain waveform distance. If the time-domain waveform distance is larger, the similarity between the input data and the output data is smaller; if the time-domain waveform distance is smaller, the similarity between the input data and the output data is larger.
[0219] In the process of training each network model, the loss function of each network model is tested and an adversarial network is introduced to optimize each model. That is, a qualified model (the output similarity meets the requirements) is trained by using the loss function and the adversarial network.
[0220] The above scheme, on the one hand, utilizes the speech signal after linear transformation to encode and decode speech features, and uses inverse linear transformation to reconstruct the fused speech features into a speech signal. This effectively avoids data packet loss and data distortion during the speech signal encoding, decoding, and reconstruction process, thereby optimizing the speech signal reconstruction process and improving the audio quality of the reconstructed speech signal. On the other hand, by referencing the relationship between the number of preset fusion paths, the speech features of at least some paths are fused according to the corresponding fusion strategy to reconstruct the speech signal based on the fused speech features. This reduces the amount of data to be processed and lowers the complexity of the decoding process during voice interaction, improving the fluency of multi-person synchronous voice interaction.
[0221] It should be understood that, although Figures 2-7 The steps in the flowchart are shown sequentially as indicated by the arrows, but these steps are not necessarily executed in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order in which these steps are executed, and they can be performed in other orders. Figures 2-7 At least some of the steps in the process may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but may be executed at different times. The execution order of these steps or stages is not necessarily sequential, but may be executed in turn or alternately with other steps or at least some of the steps or stages in other steps.
[0222] It is understood that the same / similar parts between the various embodiments of the methods described above in this specification can be referred to each other. Each embodiment focuses on the differences from other embodiments, and relevant parts can be referred to the description of other method embodiments.
[0223] Figure 8 This is a block diagram of a multi-channel speech reconstruction system provided in an embodiment of this application. (Refer to...) Figure 8 The multi-channel speech reconstruction system 10 includes: multiple encoding end devices 11, and a decoding end device 12 that is communicatively connected to the multiple encoding end devices 11 via a communication network 13.
[0224] Each encoding terminal device 11 is configured to: acquire one speech signal; perform time-frequency transformation processing on the speech signal to obtain a corresponding linearly transformed speech signal; and extract speech features from the linearly transformed speech signal and encode the speech features to obtain a corresponding bitstream data.
[0225] like Figure 8As shown, there are three encoding end devices 11 in this exemplary embodiment, namely device A, device B and device C. After obtaining their respective corresponding bitstream data, they send the corresponding three bitstream data to the decoding end device 12 through a preset communication network 13.
[0226] In some embodiments, the communication network 13 (communication protocol) may include Ethernet, Bluetooth, Zigbee, Z-Wave, or the Smart Home Working Group standard (i.e., Project Connected Home over IP) that supports IPv6 networks, VoIP protocol (Voice over Internet Protocol), etc.
[0227] The decoding device 12 is configured to: acquire multi-stream data from multiple encoding devices; perform stream decoding on the multi-stream data to obtain speech features corresponding to each of the multi-stream data; determine a target fusion strategy based on a comparison between the number of paths corresponding to the multi-stream data and a preset number of fusion paths; fuse a target number of speech features based on the target fusion strategy to obtain fused speech features; the target number is the number of paths corresponding to the fused speech features corresponding to at least some paths according to the target fusion strategy; and perform inverse linear transform on the fused speech features to obtain reconstructed speech signals corresponding to the multi-stream speech signals.
[0228] Figure 9 This is a block diagram of a multi-channel speech reconstruction device provided in an embodiment of this application. (Refer to...) Figure 9 The multi-channel speech reconstruction device 20 includes: a data acquisition unit 21, a code stream decoding unit 22, a strategy judgment unit 23, a feature fusion unit 24, and a signal reconstruction unit 25.
[0229] The data acquisition unit 21 is configured to acquire multi-channel code stream data corresponding to multiple voice signals; the code stream data is an encoded code stream after encoding the voice features of the voice signals, and the voice features are signal features extracted from the corresponding voice signals after linear transformation.
[0230] The code stream decoding unit 22 is configured to perform code stream decoding processing on the multi-channel code stream data to obtain the speech features corresponding to each of the multi-channel code stream data.
[0231] The strategy judgment unit 23 is configured to determine the target fusion strategy by comparing the number of paths corresponding to the multi-stream data with the preset number of fusion paths.
[0232] The feature fusion unit 24 is configured to perform a target number of speech features based on the target fusion strategy to obtain fused speech features; the target number is the number of paths corresponding to the fused speech features corresponding to at least some paths according to the target fusion strategy.
[0233] The signal reconstruction unit 25 is configured to perform linear inverse transform processing on the fused speech features to obtain the reconstructed speech signal corresponding to the multi-channel speech signal.
[0234] In some embodiments, in the aspect of performing linear inverse transform processing on the fused speech features to obtain the reconstructed speech signal corresponding to the multi-channel speech signal, the signal reconstruction unit 25 is specifically used for:
[0235] The fused speech features are subjected to time-frequency decoding to obtain corresponding time-frequency domain data; the time-frequency domain data is used to characterize the time-domain and frequency-domain properties of the fused speech features.
[0236] The time-frequency domain data is subjected to inverse time-frequency transform to obtain the speech signal with fused speech features, and the speech signal with fused speech features is used as the reconstructed speech signal corresponding to the multi-channel speech signal.
[0237] In some embodiments, the speech features of each of the speech signals are expressed based on corresponding feature vectors; the bitstream data is an encoded bitstream obtained based on a vector coding strategy or a scalar coding strategy; in terms of performing bitstream decoding processing on the multiple bitstream data to obtain the speech features corresponding to each of the multiple bitstream data, the bitstream decoding unit 22 is specifically used for:
[0238] Based on the decoding strategy corresponding to the encoding strategy of each bitstream data, the bitstream data of each bitstream data is decoded respectively to obtain the decoding feature vector of the multi-bitstream data, and the decoding feature vector of the multi-bitstream data is used as the speech feature corresponding to each of the multi-bitstream data.
[0239] The decoding strategy corresponding to the encoding strategy of the bitstream data includes a vector decoding strategy corresponding to the vector encoding strategy or a scalar decoding strategy corresponding to the scalar encoding strategy.
[0240] In some embodiments, in determining the target fusion strategy based on the comparison result between the number of paths corresponding to the multi-stream data and the preset number of fusion paths, the strategy determination unit 23 is specifically used for:
[0241] If the comparison result indicates that the number of paths corresponding to the multi-stream data is less than or equal to the preset number of fusion paths, a preset first fusion strategy is determined as the target fusion strategy; the first fusion strategy is used to instruct the fusion of the decoding feature vectors corresponding to each of the multi-stream data to obtain a fused feature vector; or
[0242] If the comparison result shows that the number of paths corresponding to the multi-stream data is greater than the preset number of fusion paths, the preset second fusion strategy is determined as the target fusion strategy. The second fusion strategy is used to indicate the sorting result among multiple energy values corresponding to the speech signals of the multi-stream data, to filter the bitstream data of at least some paths from the number of paths corresponding to the multi-stream data, and to fuse the decoding feature vectors corresponding to the at least some paths of the filtered bitstream data to obtain a fusion feature vector.
[0243] In some embodiments, in obtaining a fused feature vector by fusing the decoding feature vectors corresponding to the various bitstream data, the feature fusion unit 24 is further configured to:
[0244] If the comparison result indicates that the number of paths in the multi-stream data is equal to the preset number of fusion paths, the decoding feature vectors corresponding to each of the multi-stream data are fused to obtain a fused feature vector; or
[0245] If the comparison result shows that the number of paths of the multi-stream data is less than the preset number of fusion paths, the decoding feature vectors corresponding to each of the multi-stream data are fused with the zero vector of the first number of paths to obtain a fused feature vector.
[0246] Wherein, the first path number is the difference between the preset fusion path number and the path number corresponding to the multi-stream data, and the zero vector has the same dimension as the decoding feature vector.
[0247] In some embodiments, based on the sorting results among multiple energy values corresponding to the speech signals of the multi-stream data, at least a portion of the path data corresponding to the multi-stream data is filtered, and the decoding feature vectors corresponding to the filtered at least a portion of the path data are fused to obtain a fused feature vector. Specifically, the feature fusion unit 24 is further configured to:
[0248] The multi-stream data are sorted from largest to smallest according to the energy value corresponding to each stream data to obtain the sorting result; the energy value corresponding to each stream data is the energy value of the speech signal of the stream data.
[0249] The first preset number of bitstream data in the sorting result are determined as the filtered bitstream data; the first preset number is equal to the preset number of fusion paths;
[0250] The fused feature vector is obtained by fusing the decoding feature vectors corresponding to the preset number of filtered bitstream data.
[0251] In some embodiments, based on the sorting results among multiple energy values corresponding to the speech signals of the multi-stream data, at least a portion of the path data corresponding to the multi-stream data is filtered, and the decoding feature vectors corresponding to the filtered at least a portion of the path data are fused to obtain a fused feature vector. Specifically, the feature fusion unit 24 is further configured to:
[0252] The multi-stream data are sorted from largest to smallest according to the energy value corresponding to each stream data to obtain the sorting result; the energy value corresponding to each stream data is the energy value of the speech signal of the stream data.
[0253] The first second preset number of bitstream data in the sorting result are determined as the filtered bitstream data; the second preset number is less than the preset number of fusion paths;
[0254] In the multi-stream data, the decoding feature vectors corresponding to the remaining stream data are fused to obtain a subclass fused feature vector; the remaining stream data are the stream data that do not belong to the filtered stream data among the multi-stream data.
[0255] The fused feature vector is obtained by fusing the decoding feature vectors corresponding to each of the filtered bitstream data and the fused feature vector of the subclass.
[0256] Regarding the apparatus in the above embodiments, the specific manner in which each module performs its operation has been described in detail in the embodiments related to the method, and will not be elaborated upon here.
[0257] Figure 10 This is a block diagram of an electronic device provided in an embodiment of this application. For example, the electronic device 30 can be a server, an electronic component, or a server array, etc. (Refer to...) Figure 10 The electronic device 30 includes a processor 31, which may be a collection of processors, including one or more processors. The electronic device 30 also includes memory resources represented by a memory 32, on which computer programs, such as application programs, are stored. The computer programs stored in the memory 32 may include one or more modules, each corresponding to a set of executable instructions. Furthermore, the processor 31 is configured to implement the multi-channel speech reconstruction method described above when executing the computer program.
[0258] In some embodiments, electronic device 30 is a server, and the computing system within this server may run one or more operating systems, including any of the operating systems discussed above and any commercially available server operating system. Electronic device 30 may also run any of a variety of additional server applications and / or middleware applications, including HTTP (Hypertext Transfer Protocol) servers, FTP (File Transfer Protocol) servers, CGI (Common Gateway Interface) servers, super servers, database servers, etc. Exemplary database servers include, but are not limited to, commercially available database servers from companies such as IBM.
[0259] In some embodiments, the processor 31 typically controls the overall operation of the electronic device 30, such as operations associated with display, data processing, and data communication. The processor 31 may include one or more processors to execute computer programs to perform all or part of the steps of the methods described above. Furthermore, the processor 31 may include one or more modules to facilitate interaction between the processor 31 and other components. For example, the processor 31 may include a multimedia module to facilitate control of the interaction between the user electronic device 30 and the processor 31 using multimedia components.
[0260] In some embodiments, the processor component in processor 31 may also be referred to as a CPU (Central Processing Unit). The processor component may be an electronic chip with signal processing capabilities. The processor component may also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. The general-purpose processor may be a microprocessor, or the processor component may be any conventional processor. Furthermore, the processing component may be implemented using integrated circuit chips.
[0261] In some embodiments, memory 32 is configured to store various types of data to support operation of electronic device 30. Examples of such data include instructions for any application or method operating on electronic device 30, acquired data, messages, images, videos, etc. Memory 32 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, optical disk, or graphene storage.
[0262] In some embodiments, the memory 32 can be a memory stick, TF card, etc., and can store all the information in the electronic device 30, including the original input data, computer program, intermediate running results and final running results stored in the memory 32.
[0263] In some embodiments, the memory 32 stores and retrieves information according to the location specified by the processor 31. In some embodiments, the memory 32 enables the electronic device 30 to have a memory function and ensure normal operation.
[0264] In some embodiments, the memory 32 of the electronic device 30 can be classified into main memory (RAM) and auxiliary memory (external memory) according to its purpose. There are also classification methods that divide it into external memory and internal memory. External memory is typically magnetic media or optical discs, which can store information for a long time. RAM refers to the storage components on the motherboard, used to store currently executing data and programs, but it is only used for temporary storage of programs and data; the data will be lost when the power is turned off or disconnected.
[0265] In some embodiments, the electronic device 30 may further include: a power supply component 33 configured to perform power management of the electronic device 30, a wired or wireless network interface 34 configured to connect the electronic device 30 to a network, and an input / output (I / O) interface 35. The electronic device 30 may operate on an operating system stored in memory 32, such as Windows Server, MacOS X, Unix, Linux, FreeBSD, or similar.
[0266] In some embodiments, power supply component 33 provides power to various components of electronic device 30. Power supply component 33 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to electronic device 30.
[0267] In some embodiments, the wired or wireless network interface 34 is configured to facilitate wired or wireless communication between the electronic device 30 and other devices. The electronic device 30 may access wireless networks based on communication standards, such as WiFi, carrier networks (such as 2G, 3G, 4G, or 5G), or combinations thereof.
[0268] In some embodiments, the wired or wireless network interface 34 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the wired or wireless network interface 34 also includes a near-field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
[0269] In some embodiments, the input / output (I / O) interface 35 provides an interface between the processing component 31 and peripheral interface modules, such as a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to, a home button, volume buttons, a power button, and a lock button.
[0270] Figure 11 This is a block diagram of a computer-readable storage medium 40 provided in an embodiment of this application. The computer-readable storage medium 40 stores a computer program 41, which, when executed by a processor of an electronic device, implements the multi-channel speech reconstruction method described above.
[0271] If the integrated units of the various functional units in the various embodiments of this application are implemented as software functional units and sold or used as independent products, they can be stored in the computer-readable storage medium 40. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. The computer-readable storage medium 40 includes a computer program 41, which includes several instructions to cause a computer device (which may be a personal computer, system server, or network device, etc.), an electronic device (e.g., MP3, MP4, etc., or a mobile phone, tablet computer, wearable device, etc., or a desktop computer, etc.) or a processor to execute all or part of the steps of the methods of the various embodiments of this application.
[0272] Figure 12 This is a block diagram of a computer program product 50 provided in an embodiment of this application. The computer program product 50 includes program instructions 51, which can be executed by a processor of an electronic device to implement the multi-channel speech reconstruction method described above.
[0273] Those skilled in the art will understand that embodiments of this application may provide a multi-channel speech reconstruction method, a multi-channel speech reconstruction system 10, a multi-channel speech reconstruction apparatus 20, an electronic device 30, a computer-readable storage medium 40, or a computer program product 50. Therefore, this application may take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application may take the form of a computer program product 50 embodied on one or more computer program instructions 51 (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0274] This application is described with reference to flowchart illustrations and / or block diagrams of a multi-channel speech reconstruction method, a multi-channel speech reconstruction system 10, a multi-channel speech reconstruction apparatus 20, an electronic device 30, a computer-readable storage medium 40, or a computer program product 50 according to embodiments of this application. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by the computer program product 50. These computer program products 50 can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing apparatus to produce a machine, such that program instructions 51, executable by the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the multi-channel speech reconstruction method. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0275] These computer program products 50 may also be stored in a computer-readable storage medium capable of directing a computer or other programmable data processing device to function in a particular manner, such that program instructions 51 stored in the computer program product 50 produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0276] These program instructions 51 may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process, thereby providing the program instructions 51 that execute on the computer or other programmable apparatus for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0277] It should be noted that the various methods, apparatuses, electronic devices, computer-readable storage media, computer program products, etc. described above may also include other implementation methods according to the description of the method embodiments. For specific implementation methods, please refer to the description of the relevant method embodiments, which will not be elaborated here.
[0278] Other embodiments of this disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this disclosure are indicated by the claims.
[0279] It should be understood that this disclosure is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this disclosure is limited only by the appended claims.
Claims
1. A multi-channel speech reconstruction method, characterized in that, include: Acquire multi-channel bitstream data corresponding to multiple audio signals; The bitstream data is an encoded bitstream obtained by encoding the speech features of the speech signal, and the speech features are signal features extracted from the corresponding speech signal after linear transformation. The multi-stream data is decoded to obtain the speech features corresponding to each of the multi-stream data. Based on the comparison between the number of paths corresponding to the multi-stream data and the preset number of fusion paths, the target fusion strategy is determined. Based on the target fusion strategy, the speech features of the target number are fused to obtain the fused speech features; The target number is the number of paths corresponding to the fusion of speech features corresponding to at least a portion of the paths according to the target fusion strategy; The fused speech features are subjected to inverse linear transform processing to obtain the reconstructed speech signal corresponding to the multi-channel speech signal.
2. The method according to claim 1, characterized in that, The step of performing a linear inverse transform on the fused speech features to obtain the reconstructed speech signal corresponding to the multi-channel speech signal includes: The fused speech features are subjected to time-frequency decoding to obtain corresponding time-frequency domain data; the time-frequency domain data is used to characterize the time-domain and frequency-domain properties of the fused speech features. The time-frequency domain data is subjected to inverse time-frequency transform to obtain the speech signal with fused speech features, and the speech signal with fused speech features is used as the reconstructed speech signal corresponding to the multi-channel speech signal.
3. The method according to claim 1, characterized in that, The speech features of each of the aforementioned speech signals are expressed based on corresponding feature vectors; the bitstream data is an encoded bitstream obtained based on a vector coding strategy or a scalar coding strategy. The step of decoding the multi-stream data to obtain the speech features corresponding to each of the multi-stream data includes: Based on the decoding strategy corresponding to the encoding strategy of each bitstream data, the bitstream data of each bitstream data is decoded respectively to obtain the decoding feature vector of the multi-bitstream data, and the decoding feature vector of the multi-bitstream data is used as the speech feature corresponding to each of the multi-bitstream data. The decoding strategy corresponding to the encoding strategy of the bitstream data includes a vector decoding strategy corresponding to the vector encoding strategy or a scalar decoding strategy corresponding to the scalar encoding strategy.
4. The method according to claim 3, characterized in that, The step of determining the target fusion strategy based on the comparison between the number of paths corresponding to the multi-stream data and the preset number of fusion paths includes: If the comparison result indicates that the number of paths corresponding to the multi-stream data is less than or equal to the preset number of fusion paths, a preset first fusion strategy is determined as the target fusion strategy; the first fusion strategy is used to instruct the fusion of the decoding feature vectors corresponding to each of the multi-stream data to obtain a fused feature vector; or If the comparison result shows that the number of paths corresponding to the multi-stream data is greater than the preset number of fusion paths, the preset second fusion strategy is determined as the target fusion strategy. The second fusion strategy is used to indicate the sorting result among multiple energy values corresponding to the speech signals of the multi-stream data, to filter the bitstream data of at least some paths from the number of paths corresponding to the multi-stream data, and to fuse the decoding feature vectors corresponding to the at least some paths of the filtered bitstream data to obtain a fusion feature vector.
5. The method according to claim 4, characterized in that, The process of fusing the decoding feature vectors corresponding to the multiple bitstream data to obtain a fused feature vector includes: If the comparison result indicates that the number of paths in the multi-stream data is equal to the preset number of fusion paths, the decoding feature vectors corresponding to each of the multi-stream data are fused to obtain a fused feature vector; or If the comparison result shows that the number of paths of the multi-stream data is less than the preset number of fusion paths, the decoding feature vectors corresponding to each of the multi-stream data are fused with the zero vector of the first number of paths to obtain a fused feature vector. Wherein, the first path number is the difference between the preset fusion path number and the path number corresponding to the multi-stream data, and the zero vector has the same dimension as the decoding feature vector.
6. The method according to claim 4, characterized in that, The ranking result of multiple energy values corresponding to the speech signals based on the multi-stream data is used to filter at least a portion of the path data from the number of paths corresponding to the multi-stream data, and the decoding feature vectors corresponding to each of the filtered at least a portion of the path data are fused to obtain a fused feature vector, including: The multi-stream data are sorted from largest to smallest according to the energy value corresponding to each stream data to obtain the sorting result; the energy value corresponding to each stream data is the energy value of the speech signal of the stream data. The first preset number of bitstream data in the sorting result are determined as the filtered bitstream data; the first preset number is equal to the preset number of fusion paths; The fused feature vector is obtained by fusing the decoding feature vectors corresponding to the preset number of filtered bitstream data.
7. The method according to claim 4, characterized in that, The ranking result of multiple energy values corresponding to the speech signals based on the multi-stream data is used to filter at least a portion of the path data from the number of paths corresponding to the multi-stream data, and the decoding feature vectors corresponding to each of the filtered at least a portion of the path data are fused to obtain a fused feature vector, including: The multi-stream data are sorted from largest to smallest according to the energy value corresponding to each stream data to obtain the sorting result; the energy value corresponding to each stream data is the energy value of the speech signal of the stream data. The first second preset number of bitstream data in the sorting result are determined as the filtered bitstream data; the second preset number is less than the preset number of fusion paths; In the multi-stream data, the decoding feature vectors corresponding to the remaining stream data are fused to obtain a subclass fused feature vector; the remaining stream data are the stream data that do not belong to the filtered stream data among the multi-stream data. The fused feature vector is obtained by fusing the decoding feature vectors corresponding to each of the filtered bitstream data and the fused feature vector of the subclass.
8. A multi-channel speech reconstruction system, characterized in that, It includes multiple encoding end devices and a decoding end device that is communicatively connected to each of the multiple encoding end devices, wherein: Each of the encoding end devices is configured to perform the following: acquire one speech signal; perform linear transformation processing on the speech signal to obtain a corresponding linearly transformed speech signal; extract speech features from the linearly transformed speech signal and encode the speech features to obtain a corresponding bitstream data. The decoding device is configured to: acquire multi-stream data from the plurality of encoding devices; perform stream decoding on the multi-stream data to obtain speech features corresponding to each of the multi-stream data; determine a target fusion strategy based on a comparison between the number of paths corresponding to the multi-stream data and a preset number of fusion paths; fuse a target number of speech features based on the target fusion strategy to obtain fused speech features; the target number is the number of paths corresponding to the fused speech features corresponding to at least some paths according to the target fusion strategy; and perform inverse linear transform on the fused speech features to obtain reconstructed speech signals corresponding to the multi-stream speech signals.
9. A multi-channel speech reconstruction device, characterized in that, include: The data acquisition unit is configured to acquire multi-channel code stream data corresponding to multiple voice signals; The bitstream data is an encoded bitstream obtained by encoding the speech features of the speech signal, and the speech features are signal features extracted from the corresponding speech signal after linear transformation. The stream decoding unit is configured to perform stream decoding processing on the multi-stream data to obtain the speech features corresponding to each of the multi-stream data. The strategy determination unit is configured to determine the target fusion strategy by comparing the number of paths corresponding to the multi-stream data with the preset number of fusion paths. The feature fusion unit is configured to perform a target number of speech features based on the target fusion strategy to obtain fused speech features; The target number is the number of paths corresponding to the fusion of speech features corresponding to at least a portion of the paths according to the target fusion strategy; The signal reconstruction unit is configured to perform linear inverse transform processing on the fused speech features to obtain the reconstructed speech signal corresponding to the multi-channel speech signal.
10. An electronic device, characterized in that, include: processor; Memory for storing the executable instructions of the processor; The processor is configured to execute the executable instructions to implement the multi-channel speech reconstruction method as described in any one of claims 1 to 7.
11. A computer-readable storage medium, characterized in that, When the computer program in the computer-readable storage medium is executed by the processor of the electronic device, the electronic device is enabled to perform the multi-channel speech reconstruction method as described in any one of claims 1 to 7.