Method, training method and device for estimating room acoustic impulse response

CN116861182BActive Publication Date: 2026-06-19DINGTALK (CHINA) INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
DINGTALK (CHINA) INFORMATION TECH CO LTD
Filing Date
2023-06-14
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing technologies, field measurement methods are time-consuming and labor-intensive and not applicable to all scenarios, blind estimation methods cannot accurately reconstruct the acoustic impulse response of a room based on only a few acoustic parameters, and neural networks assume attenuation models, leading to inaccurate estimations.

Method used

By using a segmented room acoustic impulse response estimation model, feature vectors of speech signals are extracted and processed in segments. Neural networks are then used for blind estimation and splicing to generate accurate room acoustic impulse responses.

Benefits of technology

It reduces model complexity and computational difficulty, improves the estimation accuracy of room acoustic impulse response, and approximates the real response as closely as possible.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116861182B_ABST
    Figure CN116861182B_ABST
Patent Text Reader

Abstract

This specification provides one or more embodiments of a method, training method, and apparatus for estimating the acoustic impulse response of a room. The method includes: extracting acoustic environment information related to the transmission environment of a speech signal acquired by a speech acquisition device, and generating a feature vector corresponding to the speech signal based on the acoustic environment information; segmenting the feature vector to obtain multiple vector segments; sequentially inputting the multiple vector segments into a pre-trained room acoustic impulse response estimation model to estimate the room acoustic impulse response, thereby obtaining impulse response vector segments corresponding to each vector segment; wherein the impulse response vector represents the acoustic characteristics of the speech signal transmitted from a sound source to the speech acquisition hardware; and concatenating the impulse response segments to obtain the room acoustic impulse response corresponding to the speech signal.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This specification relates to one or more embodiments in the field of audio data processing, and more particularly to a method, training method and apparatus for estimating room acoustic impulse response. Background Technology

[0002] Room impulse response (RIR) can fully represent the acoustic characteristics between the sound source and the receiving voice acquisition device, such as a microphone. These acoustic characteristics play a very important role in signal analysis and speech processing, such as augmented reality audio, speech quality assessment, speech dereverberation, and far-field speech recognition.

[0003] Related technologies can employ on-site measurements, using manual intervention and equipment to play a special signal (such as a frequency sweep signal) at the sound source location to extract the transfer function between the sound source location and the speech acquisition device. However, on-site measurement is time-consuming, labor-intensive, and impractical, and it cannot even be performed in every scenario. Therefore, blind estimation methods for room acoustic impulse response have emerged, estimating several acoustic parameters from the speech signal acquired by the speech acquisition device. However, based solely on a few acoustic parameters, it is impossible to reconstruct the acoustic features and obtain an accurate room acoustic impulse response. Summary of the Invention

[0004] In view of this, one or more embodiments of this specification provide a method, training method, and apparatus for estimating the acoustic impulse response of a room, in order to solve the problems existing in the related art.

[0005] To achieve the above objectives, one or more embodiments of this specification provide the following technical solutions:

[0006] According to a first aspect of one or more embodiments of this specification, a method for estimating the acoustic impulse response of a room is provided, comprising:

[0007] From the acquired speech signal, acoustic environment information related to the transmission environment of the speech signal is extracted, and a feature vector corresponding to the speech signal is generated based on the acoustic environment information.

[0008] The feature vector is segmented to obtain multiple vector segments;

[0009] The multiple vector slices are input into the room acoustic impulse response estimation model to estimate the room acoustic impulse response, so as to obtain the impulse response vector slices corresponding to each vector slice; wherein, the impulse response vector is used to represent the acoustic characteristics of the speech signal transmitted from the sound source to the speech acquisition hardware; the room acoustic impulse response estimation model is a machine learning model trained with the feature vector of the speech signal sample as the training sample feature and the room acoustic impulse response corresponding to the speech signal sample as the sample label;

[0010] The individual impulse responses are spliced ​​together to obtain the room acoustic impulse response corresponding to the speech signal.

[0011] According to a second aspect of one or more embodiments of this specification, a training method for a room acoustic impulse response estimation model is proposed, comprising:

[0012] From the speech signal samples in the speech signal sample set, acoustic environment information related to the transmission environment of the speech signal samples is extracted, and feature vectors corresponding to the speech signals are generated based on the acoustic environment information; wherein, the speech signal sample set includes speech signal samples, and room acoustic impulse responses corresponding to the speech signal samples as sample labels;

[0013] The feature vector corresponding to the speech signal sample is segmented to obtain multiple vector segments corresponding to the speech signal sample.

[0014] Multiple vector slices corresponding to the speech signal sample are input into the room acoustic impulse response estimation model to perform room acoustic impulse response estimation. The impulse response vector slices corresponding to each vector slice are output and then concatenated to obtain the room acoustic impulse response corresponding to the speech signal sample.

[0015] The model loss is obtained based on the room acoustic impulse response output by the room acoustic impulse response estimation model and the room acoustic impulse response used as a label, and the room acoustic impulse response estimation model is trained based on the model loss.

[0016] According to a third aspect of one or more embodiments of this specification, an estimation apparatus for room acoustic impulse response is provided, comprising:

[0017] The voice acquisition module is used to extract acoustic environment information related to the transmission environment of the acquired voice signal, and generate a feature vector corresponding to the voice signal based on the acoustic environment information.

[0018] The sharding module is used to shard the feature vector to obtain multiple vector shards;

[0019] The model estimation module is used to input the multiple vector slices into the room acoustic impulse response estimation model to estimate the room acoustic impulse response, so as to obtain the impulse response vector slices corresponding to each vector slice; wherein, the impulse response vector is used to represent the acoustic characteristics of the speech signal transmitted from the sound source to the speech acquisition hardware; the room acoustic impulse response estimation model is a machine learning model trained with the feature vector of the speech signal sample as the training sample feature and the room acoustic impulse response corresponding to the speech signal sample as the sample label;

[0020] The result output module is used to splice together the various impulse response segments to obtain the room acoustic impulse response corresponding to the speech signal.

[0021] According to a third aspect of one or more embodiments of this specification, a training apparatus for a room acoustic impulse response estimation model is provided, comprising:

[0022] The sample processing module is used to extract acoustic environment information related to the transmission environment of the speech signal samples from the speech signal samples in the speech signal sample set, and generate feature vectors corresponding to the speech signals based on the acoustic environment information; wherein, the speech signal sample set includes speech signal samples and room acoustic impulse responses corresponding to the speech signal samples as sample labels;

[0023] The feature segmentation module is used to segment the feature vector corresponding to the speech signal sample to obtain multiple vector segments corresponding to the speech signal sample.

[0024] The model calculation module is used to input multiple vector slices corresponding to the speech signal sample into the room acoustic impulse response estimation model, perform room acoustic impulse response estimation, output impulse response vector slices corresponding to each vector slice, and splice the impulse response vector slices to obtain the room acoustic impulse response corresponding to the speech signal sample.

[0025] The model training module is used to obtain the model loss based on the room acoustic impulse response output by the room acoustic impulse response estimation model and the room acoustic impulse response as a label, and to train the room acoustic impulse response estimation model based on the model loss.

[0026] According to a fifth aspect of one or more embodiments of this specification, an electronic device is provided, comprising:

[0027] processor;

[0028] Memory used to store processor-executable instructions;

[0029] The processor implements the method as described in the first or second aspect by running the executable instructions.

[0030] According to a sixth aspect of one or more embodiments of this specification, a computer-readable storage medium is provided that stores computer instructions thereon, which, when executed by a processor, implement the steps of the method as described in the first or second aspect.

[0031] In the above technical solution, a segmented room acoustic impulse response estimation model is used. First, the feature vectors corresponding to the speech signal are segmented. The room acoustic impulse response estimation model blindly estimates the impulse response segments corresponding to each vector segment. Then, the segments are spliced ​​together to obtain the complete room acoustic impulse response. This utilizes the powerful modeling capabilities of neural networks to estimate the accurate room acoustic impulse response. Furthermore, the segmented calculation process allows each vector segment to share the network parameters in the room acoustic impulse response estimation model, which is equivalent to generating multiple sub-band networks with independent outputs. This greatly reduces the complexity and computational difficulty of the model, and approximates the real room acoustic impulse response to the greatest extent possible. Attached Figure Description

[0032] Figure 1 This is a schematic diagram of the architecture of an estimation system for room acoustic impulse response provided in an exemplary embodiment;

[0033] Figure 2 This is a flowchart of an exemplary embodiment of a method for estimating the acoustic impulse response of a room;

[0034] Figure 3 This is a schematic diagram of the model structure of an estimation method for room acoustic impulse response provided in an exemplary embodiment;

[0035] Figure 4 This is a schematic diagram of the model structure of another method for estimating the acoustic impulse response of a room, provided in an exemplary embodiment.

[0036] Figure 5 This is a flowchart illustrating a training method for a room acoustic impulse response estimation model, provided in an exemplary embodiment.

[0037] Figure 6 This is a schematic diagram of the model structure of another method for estimating the acoustic impulse response of a room, provided in an exemplary embodiment.

[0038] Figure 7 This is a schematic diagram of the model structure of another method for estimating the acoustic impulse response of a room, provided in an exemplary embodiment.

[0039] Figure 8This is a schematic diagram of the structure of an estimation device for room acoustic impulse response provided in an exemplary embodiment;

[0040] Figure 9 This is a schematic diagram of the structure of a training device for estimating a room acoustic impulse response model, provided in an exemplary embodiment.

[0041] Figure 10 This is a schematic diagram of the structure of an electronic device provided in an exemplary embodiment. Detailed Implementation

[0042] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numerals in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with one or more embodiments of this specification. Rather, they are merely examples of apparatuses and methods consistent with some aspects of one or more embodiments of this specification as detailed in the appended claims.

[0043] It should be noted that the steps of the corresponding methods are not necessarily performed in the order shown and described in this specification in other embodiments. In some other embodiments, the methods may include more or fewer steps than described in this specification. Furthermore, a single step described in this specification may be broken down into multiple steps in other embodiments; and multiple steps described in this specification may be combined into a single step in other embodiments.

[0044] The room acoustic impulse response can fully represent the acoustic characteristics between the sound source and the receiving voice acquisition device. These acoustic characteristics play a very important role in signal analysis and speech processing, such as augmented reality audio, speech quality assessment, speech dereverberation, and far-field speech recognition.

[0045] Related technologies can employ on-site measurements, using manual intervention and equipment to play a specific signal, such as a frequency sweep signal, at the sound source location to extract the transfer function between the sound source location and the voice acquisition device, thereby generating the corresponding room acoustic impulse response. However, on-site measurement methods are time-consuming, labor-intensive, and impractical, and cannot be performed in every real-world scenario.

[0046] Therefore, blind estimation methods for the acoustic impulse response of a room have emerged. Several important acoustic parameters can be estimated from the speech signal acquired by the speech acquisition device, and these parameters are assumed to represent the acoustic characteristics of the current environment, such as reverberation time or direct reverberation ratio. However, relying solely on these few acoustic parameters is far from sufficient to reconstruct the acoustic characteristics of the current environment.

[0047] In some recent related technologies, it has been proposed to estimate the room acoustic impulse response based on the powerful modeling capabilities of neural networks. However, due to the problem of the room acoustic impulse response being too long (usually tens of thousands of samples), in order to simplify the model, it is assumed that the reverberation follows a preset attenuation model, thus failing to obtain an accurate room acoustic impulse response.

[0048] In view of this, this specification proposes a segmented room acoustic impulse response estimation model, which divides the feature vector of the speech signal into segments, passes the resulting vector segments sequentially through the room acoustic impulse response estimation model to obtain impulse response segments, and then splices the various impulse response segments to obtain the room acoustic impulse response corresponding to the speech signal.

[0049] In implementation, acoustic environment information related to the transmission environment of the voice signal is extracted from the voice acquisition device, and a feature vector corresponding to the voice signal is generated based on the acoustic environment information; the feature vector is segmented to obtain multiple vector segments; the multiple vector segments are sequentially input into a pre-trained room acoustic impulse response estimation model to estimate the room acoustic impulse response, so as to obtain impulse response vector segments corresponding to each vector segment; the impulse response segments are concatenated to obtain the room acoustic impulse response corresponding to the voice signal.

[0050] In the above technical solution, a segmented room acoustic impulse response estimation model is used. First, the feature vectors corresponding to the speech signal are segmented. The room acoustic impulse response estimation model blindly estimates the impulse response segments corresponding to each vector segment. Then, the segments are spliced ​​together to obtain the complete room acoustic impulse response. This utilizes the powerful modeling capabilities of neural networks to estimate the accurate room acoustic impulse response. Furthermore, the segmented calculation process allows each vector segment to share the network parameters in the room acoustic impulse response estimation model, which is equivalent to generating multiple sub-band networks with independent outputs. This greatly reduces the complexity and computational difficulty of the model, and approximates the real room acoustic impulse response to the greatest extent possible.

[0051] Please see Figure 1 , Figure 1 This is a schematic diagram of the architecture of an estimation system for the acoustic impulse response of a room, provided in an exemplary embodiment. Figure 1 As shown, the system may include a network 10, a server 11, and several electronic devices, such as terminal devices 12, 13, and 14.

[0052] Server 11 can be a physical server containing an independent host, or server 11 can be a virtual server, cloud server, etc., hosted by a host cluster. Terminal devices 12-14 are just one type of electronic device that users can use. In reality, users can obviously also use electronic devices such as mobile phones, tablets, laptops, PDAs (Personal Digital Assistants), wearable devices (such as smart glasses, smartwatches, etc.), etc., and one or more embodiments in this specification do not limit this. Network 10 can include various types of wired or wireless networks.

[0053] In one embodiment, the server 11 can cooperate with terminal devices 12-14; wherein, the terminal devices 12-14 can acquire voice signals through a voice acquisition device and upload the acquired voice signals to the server 11 via network 10, and then the server 11 processes the received voice signals based on the room acoustic impulse response estimation method of this specification to obtain the room acoustic impulse response. In another embodiment, the terminal devices 12-14 can independently implement the room acoustic impulse response estimation method of this specification; wherein, the terminal devices 12-14 acquire voice signals and process the acquired voice signals based on the room acoustic impulse response estimation method of this specification to obtain the room acoustic impulse response.

[0054] To enable those skilled in the art to better understand the technical solutions in this application, the technical solutions in this specification will be clearly and completely described below with reference to the accompanying drawings in the embodiments of this application.

[0055] Please see Figure 2 , Figure 2 This is an exemplary embodiment of a method for estimating the acoustic impulse response of a room. The method can be executed by a server or terminal device as described above. The method for estimating the acoustic impulse response of a room includes the following steps.

[0056] S210. Extract acoustic environment information related to the transmission environment of the acquired speech signal from the acquired speech signal, and generate a feature vector corresponding to the speech signal based on the acoustic environment information.

[0057] Speech signals are acquired using pre-deployed speech acquisition equipment. The acquired speech signals can be reverberant speech signals containing reverberation components. The location of the sound source of the speech signal can be random or pre-set.

[0058] Acoustic environment information related to the transmission environment of the speech signal is extracted from the speech signal, and a feature vector (embedding) corresponding to the speech signal is generated based on the acoustic environment information. The acoustic environment information may include reverberation time, direct reverberation ratio, attenuation parameters, and distance to the signal source, etc. Accordingly, the feature vector generated based on the acoustic environment information can be a feature vector obtained from multiple acoustic environment information sources, or multiple feature vectors can be obtained separately from various acoustic environment information sources and then summarized.

[0059] There are many ways to extract feature vectors from speech signals, for example, Figure 3 As shown, a preset encoder can be used to directly extract the feature vector z from the speech signal.

[0060] It should be noted that since the speech signal acquired by the speech acquisition device is a time-domain signal, the acquired time-domain speech signal can first be converted into a spectral speech signal x through methods such as short-time Fourier transform. The spectral speech signal is then input into the encoder, and the output is a multi-dimensional feature vector z corresponding to the speech signal, which can be, for example, a 128-dimensional or 256-dimensional feature vector. For simplicity, the speech signals mentioned in the following embodiments are all spectral speech signals.

[0061] It should be noted that the encoder can be a pre-trained neural network model, such as a Residual Neural Network (ResNet). A ResNet network can include multiple residual convolutional blocks (ResConvBlocks), each convolutional block including one or more convolutional computation processes (Conv->BN-ReLU). For example, as... Figure 4 As shown, the encoder includes a two-dimensional convolutional layer (Conv2d), a batch normalization layer (BatchNorm), a max pooling layer (MaxPool), two residual convolutional groups (ResConvBlocks), an average pooling layer (AvgPool), and a linear layer (Linear).

[0062] S220. The feature vector is segmented to obtain multiple vector segments.

[0063] S230. The multiple vector slices are input into the room acoustic impulse response estimation model to estimate the room acoustic impulse response, so as to obtain impulse response vector slices corresponding to each vector slice; wherein, the impulse response vector is used to represent the acoustic characteristics of the speech signal transmitted from the sound source to the speech acquisition hardware; the room acoustic impulse response estimation model is a machine learning model trained with the feature vector of the speech signal sample as the training sample feature and the room acoustic impulse response corresponding to the speech signal sample as the sample label.

[0064] S240. The various impulse response segments are spliced ​​together to obtain the room acoustic impulse response corresponding to the speech signal.

[0065] After obtaining the feature vector z corresponding to the speech signal through the encoder, the feature vector z can be input into a pre-trained room acoustic impulse response estimation model to estimate the room acoustic impulse response and calculate the room acoustic impulse response corresponding to the speech signal. However, since the room acoustic impulse response often has tens of thousands of samples, for example, the room acoustic impulse response obtained from a 1-second speech signal at a 48kHz sampling rate generally requires 48,000 samples to represent. Such a large dimension poses a significant challenge to the parameter setting, training, and inference of the room acoustic impulse response estimation model.

[0066] This application proposes a method for generating room acoustic impulse response in a segmented manner.

[0067] like Figure 3 As shown, the feature vector z corresponding to the speech signal obtained by the encoder is segmented to obtain N vector segments, and the position of each vector segment is marked according to the order of the segments. The N vector segments are sequentially input into a pre-trained room acoustic impulse response estimation model to estimate the room acoustic impulse response, and the impulse response vector segments corresponding to each vector segment are calculated. Then, the obtained impulse response vector segments are concatenated according to the order of the position marks corresponding to each vector segment to obtain the complete room acoustic impulse response corresponding to the speech signal.

[0068] There are various ways to mark each vector piece based on its sequential order. For example, the positions of each vector piece can be marked directly according to its arrangement in the feature vector, using numbers from 1 to N. Figure 3As shown, positional encoding can be used to obtain positional identifiers representing the arrangement order of each vector slice in the feature vector. These positional identifiers are then added as positional markers to the corresponding vector slices. Next, the positional markers, along with the vector slices, are input into the room acoustic impulse response estimation model to estimate the room acoustic impulse response, calculating the impulse response vector slices corresponding to each position-marked vector slice. Finally, the obtained impulse response vector slices are concatenated according to the positional identifiers to obtain the complete room acoustic impulse response corresponding to the speech signal.

[0069] To avoid biases in the estimation of room acoustic impulse response (MAI) by using a room acoustic impulse response estimation model, caused by the lack of context at the ends of vector segments, or by abrupt changes and artifacts during the splicing of the resulting MAI vector segments, a partial overlap approach can be used when segmenting the feature vectors corresponding to the speech signal. This ensures that adjacent vector segments overlap after segmentation. Similarly, after estimating the room acoustic impulse response for each vector segment using the room acoustic impulse response estimation model, adjacent MAI vector segments also overlap. Therefore, when splicing the resulting MAI vector segments according to the order of their corresponding positional markers, the overlapping portion can be used. The size of the overlapping portion can be set according to actual needs.

[0070] It should be noted that the segmentation of the feature vector and the concatenation of the calculated impulse response vectors corresponding to each segment can also be implemented by the room acoustic impulse response estimation model. After obtaining the feature vector corresponding to the speech signal, the feature vector is directly input into the room acoustic impulse response estimation model, and the room acoustic impulse response corresponding to the speech signal is output.

[0071] The room acoustic impulse response estimation model can be a machine learning model, and its model architecture can be set according to actual needs. For example, it can be a neural network (NN) model, etc.

[0072] For example, in one implementation, the machine learning model can be a generative neural network. The network architecture of the generative neural network can be configured according to actual needs, for example, such as... Figure 4As shown, the generative neural network includes three Long Short Term Memory (LSTM) layers with Tanh activation.

[0073] Please see Figure 5 , Figure 5 A training method for a room acoustic impulse response estimation model is shown, which may include the following steps.

[0074] S510. Extract acoustic environment information related to the transmission environment of the speech signal samples from the speech signal samples in the speech signal sample set, and generate feature vectors corresponding to the speech signals based on the acoustic environment information; wherein, the speech signal sample set includes speech signal samples and room acoustic impulse responses corresponding to the speech signal samples as sample labels;

[0075] S520. The feature vector corresponding to the speech signal sample is segmented to obtain multiple vector segments corresponding to the speech signal sample.

[0076] S530. Input multiple vector slices corresponding to the speech signal sample into the room acoustic impulse response estimation model, perform room acoustic impulse response estimation, output impulse response vector slices corresponding to each vector slice, and splice the impulse response vector slices to obtain the room acoustic impulse response corresponding to the speech signal sample.

[0077] S540. Based on the room acoustic impulse response output by the room acoustic impulse response estimation model and the room acoustic impulse response as a label, obtain the model loss, and train the room acoustic impulse response estimation model based on the model loss.

[0078] The room acoustic impulse response estimation model can be trained based on pre-acquired speech signal samples from a speech signal sample set. This speech signal sample set may include a certain number of speech signal samples x(l), where l is the sample label. Inputting the speech signal sample x(l) into the encoder yields a feature vector z(l) corresponding to the speech signal sample x(l). This feature vector z(l) can serve as the training sample feature for that speech signal sample. Additionally, the speech signal sample set may also include the actual room acoustic impulse response h(l) corresponding to each speech signal sample, serving as the sample label for that speech signal sample.

[0079] like Figure 4As shown, an encoder can extract acoustic environment information related to the transmission environment of the speech signal samples from the speech signal sample set, and generate a feature vector z(l) corresponding to each speech signal sample based on the acoustic environment information.

[0080] Then, the generated feature vector z(l) corresponding to each speech signal sample x(l) is segmented to obtain multiple vector segments corresponding to each speech signal sample x(l). When segmenting the feature vector corresponding to the speech signal sample, a partial overlap method can be used, so that there is partial overlap between adjacent vector segments in the multiple vector segments obtained after segmentation.

[0081] The obtained vector slices are repeatedly input into the room acoustic impulse response estimation model. Each vector slice can share the model parameters of the room acoustic impulse response estimation model, and the impulse response vector slices corresponding to each vector slice can be calculated. Then, by concatenating the various impulse response vector slices, the room acoustic impulse response corresponding to the speech signal sample x(l) can be obtained.

[0082] To represent the positional relationships between the vector slices, a positional identifier can be added to each vector slice through positional encoding before inputting them into the room acoustic impulse response estimation model. Then, the vector slices containing the corresponding positional identifiers are repeatedly input into the room acoustic impulse response estimation model to calculate the impulse response vector slices containing positional identifiers corresponding to each vector slice. Finally, by concatenating the various impulse response vector slices, the room acoustic impulse response corresponding to the speech signal sample x(l) can be obtained.

[0083] Finally, the model output based on the room acoustic impulse response estimation model. By combining the sample label h(l) of the speech signal sample, the model loss used to train the room acoustic impulse response estimation model can be obtained, so as to schedule the model parameters in the room acoustic impulse response estimation model and enter the next round of training.

[0084] The model loss used to train the room acoustic impulse response estimation model may include the model output representing the room acoustic impulse response estimation model. The first model loss and the second model loss are the difference between the sample label h(l) and the speech signal sample. The first model loss estimates the room acoustic impulse response of the model output using the room acoustic impulse response. The difference between the room acoustic impulse response h(l) and the corresponding room acoustic impulse response used as sample labels is used to represent the second model loss; the second model loss is represented by the room acoustic impulse response output by the room acoustic impulse response estimation model. It is represented by the squared difference between the room acoustic impulse response h(l) and the corresponding sample label.

[0085] Wherein, the first model loss L1 can be expressed as the signal-to-distortion ratio (SDR), and its corresponding loss function can be expressed as follows:

[0086]

[0087] in, and ||·|| 2 Let represent the expectation and L2 norm, respectively.

[0088] The second model loss L2 is calculated based on differentiable DRR, and its corresponding loss function can be expressed as follows:

[0089]

[0090] in, Is it Y(l) = h(l) or DDR formula at time, l d This represents the boundary sample separating direct acoustic and RIR signals.

[0091] like Figure 6 As shown, to facilitate the training of the aforementioned room acoustic impulse response estimation model, a discriminator model can also be introduced. The room acoustic impulse response estimation model can be a generative neural network within a generative adversarial network (GAN). The room acoustic impulse response estimation model and the discriminator model are trained adversarially using this adversarial learning method.

[0092] The input to the discrimination model can be the room acoustic impulse response output by the aforementioned room acoustic impulse response estimation model. The discriminant model is used to determine whether the input room acoustic impulse response is the room acoustic impulse response output by the room acoustic impulse response estimation model, or the room acoustic impulse response used as a sample label. For example, in one embodiment, the output of the discriminant model can be 1 or 0, respectively, to indicate whether the input room acoustic impulse response is the room acoustic impulse response output by the room acoustic impulse response estimation model, or the room acoustic impulse response used as a sample label.

[0093] The model structure of the discrimination model can be set according to actual needs. For example, in one embodiment, the discrimination model may include two residual convolutional groups (ResConvBlocks), a one-dimensional convolutional layer (Conv1d), and an average pooling layer (AvgPool).

[0094] In this case, the model loss used to train the room acoustic impulse response estimation model may also include a third model loss; the third model loss is represented by the discrimination error of a preset discrimination model and can be called adversarial loss.

[0095] For example, in one implementation, the discriminant model can be a classification model; such as Figure 6 As shown, the room acoustic impulse response output by the above room acoustic impulse response estimation model is... Then, a classification model is added, and the loss of the third model can be represented by the classification loss of the classification model, such as by the cross-entropy loss.

[0096] Furthermore, to improve the accuracy of the discrimination model's results, the feature vector z(l) corresponding to each speech signal sample can be used as an additional label and input into the discrimination model. In this case, the input to the discrimination model can include the feature vector z(l) input to the room acoustic impulse response estimation model and the room acoustic impulse response output by the room acoustic impulse response estimation model. And the room acoustic impulse response h(l) as a sample label, used to determine whether the input room acoustic impulse response is the same as the room acoustic impulse response output by the room acoustic impulse response estimation model. Or, as a sample label, the room acoustic impulse response h(l). Figure 7 As shown, the discrimination model can concatenate the feature vector z(l) in the intermediate calculation results, and then pass it through a one-dimensional convolutional layer and an average pooling layer to obtain the discrimination result.

[0097] It should be noted that in the generative adversarial neural network described above, the optimization objective for the room acoustic impulse response estimation model is to make the room acoustic impulse response output by the model equal to the room acoustic impulse response. The room acoustic impulse response h(l) should be as close as possible to the sample label; while for the discriminant model, since its function is to distinguish whether the input room acoustic impulse response is true or false, the room acoustic impulse response output by the room acoustic impulse response estimation model is true. Alternatively, the room acoustic impulse response h(l) can be used as a sample label; therefore, the optimization objective of the discriminant model is exactly the opposite of the optimization objective of the room acoustic impulse response estimation model.

[0098] For example, the optimization objective of the discrimination model could be to minimize the error of the discrimination result; while the optimization objective of the room acoustic impulse response estimation model is the room acoustic impulse response output by the room acoustic impulse response estimation model. Another possibility is that the room acoustic impulse response h(l) which is close to the sample label is obtained, which is equivalent to requiring the discrimination result of the discrimination model to have the largest possible error.

[0099] As mentioned above, the loss function corresponding to the third model loss L3 used in training the room acoustic impulse response estimation model can be expressed as follows:

[0100]

[0101] in, Used to indicate that the input is The discrimination result obtained by the discriminant model when z(l) is used; D(h(l), z(l)) represents the input as The discrimination results obtained by the discrimination model when z(l) are used.

[0102] Accordingly, the loss function corresponding to the fourth model loss L4 used in training the discriminative model can be expressed as follows:

[0103]

[0104] Based on the above embodiments, the loss function used to train the room acoustic impulse response estimation model is expressed by the following formula:

[0105] L=αL1+βL2+γL3

[0106] Wherein, L represents the model loss used to train the room acoustic impulse response estimation model; L1 represents the first model loss; L2 represents the second model loss; L3 represents the third model loss; and α, β, and γ represent the weighting coefficients corresponding to L1, L2, and L3, respectively. If γ is 0, it indicates that no discriminant model was introduced when training the room acoustic impulse response estimation model.

[0107] It should be noted that the specific values ​​of the weighting coefficients of L1, L2 and L3 in the above loss function are not specifically limited in this specification. In practical applications, they can be a fixed value or an adjustable dynamic value.

[0108] When training the room acoustic impulse response estimation model, the encoder and the room acoustic impulse response estimation model described above can be used as generators, and the discrimination model described above can be used as discriminators, and adversarial training can be performed based on the speech signal sample set.

[0109] The encoder can be a pre-trained neural network; the encoder can also be trained synchronously with the room acoustic impulse response estimation model using the model loss L described above during the adversarial training process described above.

[0110] For example, such as Figure 7 As shown, a speech signal sample x(l) from a speech signal sample set can be input into the encoder to output the feature vector z(l) corresponding to each speech signal sample x(l). The feature vector z(l) is segmented to obtain multiple vector segments, and a position identifier is added to each vector segment through position encoding. Each vector segment containing the position identifier is sequentially and repeatedly input into the room acoustic impulse response estimation model to obtain the impulse response vector segments corresponding to each vector segment. Finally, by concatenating the impulse response vector segments, the room acoustic impulse response corresponding to the speech signal sample x(l) is obtained. Then, the eigenvector z(l) and the room acoustic impulse response output from the room acoustic impulse response estimation model are used to estimate the room acoustic impulse response. The room acoustic impulse response h(l), which serves as the sample label, is input into the discriminant model. Based on the discrimination results of the discriminant model, adversarial training is performed on the encoder, the room acoustic impulse response estimation model, and the discriminant model.

[0111] In the adversarial training process, the model loss used by the encoder and the room acoustic impulse response estimation model can be the model loss L as described above; the model loss used by the discriminator model can be the fourth model loss L4 as described above.

[0112] After training is complete, the discriminant model can be removed, and the trained encoder and room acoustic impulse response estimation model can be deployed. The speech signal acquired from the speech acquisition device is input into the trained encoder to generate a feature vector corresponding to the speech signal. The feature vector is then segmented to obtain multiple vector segments. These multiple vector segments, each containing a location identifier, are sequentially and repeatedly input into the trained room acoustic impulse response estimation model to obtain impulse response vector segments corresponding to each vector segment. Finally, by concatenating these impulse response vector segments, the room acoustic impulse response corresponding to the acquired speech signal is obtained and output.

[0113] As can be seen from the technical solutions of the above embodiments, the embodiments of this application use a segmented room acoustic impulse response estimation model. First, the feature vectors corresponding to the speech signal are segmented. The room acoustic impulse response estimation model blindly estimates the impulse response segments corresponding to each vector segment. Then, the segments are spliced ​​together to obtain the complete room acoustic impulse response. This utilizes the powerful modeling capabilities of neural networks to estimate the accurate room acoustic impulse response. Furthermore, through the segmented calculation process, each vector segment can share the network parameters in the room acoustic impulse response estimation model, which is equivalent to generating multiple sub-band networks with independent outputs. This greatly reduces the complexity and computational difficulty of the model, and approximates the real room acoustic impulse response to the greatest extent.

[0114] Corresponding to the aforementioned embodiments of the method for estimating the acoustic impulse response of a room, this application also provides embodiments of a device for estimating the acoustic impulse response of a room.

[0115] like Figure 8 As shown, the room acoustic impulse response estimation device includes: a voice acquisition module 801, a segmentation processing module 802, a model estimation module 803, and a result output module 804.

[0116] The voice acquisition module 801 is used to extract acoustic environment information related to the transmission environment of the voice signal from the voice acquisition device, and generate a feature vector corresponding to the voice signal based on the acoustic environment information.

[0117] The segmentation processing module 802 is used to segment the feature vector to obtain multiple vector segments.

[0118] The model estimation module 803 is used to sequentially input the multiple vector slices into a pre-trained room acoustic impulse response estimation model to estimate the room acoustic impulse response, so as to obtain impulse response vector slices corresponding to each vector slice; wherein, the impulse response vector is used to represent the acoustic characteristics of the speech signal transmitted from the sound source to the speech acquisition hardware; the room acoustic impulse response estimation model is a machine learning model trained with the feature vector of the speech signal sample as the training sample feature and the room acoustic impulse response corresponding to the speech signal sample as the sample label;

[0119] The result output module 804 is used to splice together the various impulse response segments to obtain the room acoustic impulse response corresponding to the speech signal.

[0120] Optionally, the vector slices include an identifier for representing the order in which the vector slices are arranged in the feature vector.

[0121] Optionally, there may be partial overlap between adjacent vector slices among the plurality of vector slices.

[0122] Corresponding to the aforementioned embodiments of the training method for the room acoustic impulse response estimation model, this application also provides embodiments of the training apparatus for the room acoustic impulse response estimation model.

[0123] like Figure 9 As shown, the training device for the room acoustic impulse response estimation model may include: a sample processing module 901, a feature segmentation module 902, a model calculation module 903, and a model training module 904.

[0124] The sample processing module 901 is used to extract acoustic environment information related to the transmission environment of the speech signal samples from the speech signal samples in the speech signal sample set, and generate feature vectors corresponding to the speech signals based on the acoustic environment information; wherein, the speech signal sample set includes speech signal samples and room acoustic impulse responses corresponding to the speech signal samples as sample labels;

[0125] The feature segmentation module 902 is used to segment the feature vector corresponding to the speech signal sample to obtain multiple vector segments corresponding to the speech signal sample.

[0126] The model calculation module 903 is used to input multiple vector slices corresponding to the speech signal sample into the room acoustic impulse response estimation model, perform room acoustic impulse response estimation, output impulse response vector slices corresponding to each vector slice, and splice the impulse response vector slices to obtain the room acoustic impulse response corresponding to the speech signal sample.

[0127] The model training module 904 is used to obtain the model loss based on the room acoustic impulse response output by the room acoustic impulse response estimation model and the room acoustic impulse response as a label, and to train the room acoustic impulse response estimation model based on the model loss.

[0128] Optionally, the model loss includes a first model loss and a second model loss;

[0129] The first model loss is represented by the difference between the room acoustic impulse response output by the room acoustic impulse response estimation model and the corresponding room acoustic impulse response used as sample labels;

[0130] The second model loss is represented by the squared difference between the room acoustic impulse response output by the room acoustic impulse response estimation model and the corresponding room acoustic impulse response used as sample labels.

[0131] Optionally, the room acoustic impulse response estimation model is a generative neural network in a generative adversarial neural network;

[0132] The model loss also includes a third model loss; the third model loss is represented by the discrimination error of a preset discrimination model; wherein, the discrimination model is used to take the room acoustic impulse response output by the room acoustic impulse response estimation model and the room acoustic impulse response used as sample labels as input, and determine whether the input room acoustic impulse response is the output of the room acoustic impulse response estimation model or the room acoustic impulse response used as sample labels;

[0133] The model training module 904 is used to perform adversarial learning training on the room acoustic impulse response estimation model and the discriminant model based on the model loss in an adversarial learning training manner.

[0134] Optionally, the discrimination model is used to take the feature vector input to the room acoustic impulse response estimation model, the output room acoustic impulse response, and the room acoustic impulse response used as sample labels as inputs, and to determine whether the input room acoustic impulse response is the output of the room acoustic impulse response estimation model or the room acoustic impulse response used as sample labels.

[0135] Optionally, the loss function used to train the room acoustic impulse response estimation model is expressed by the following formula:

[0136] L=αL1+βL2+γL3

[0137] Wherein, L represents the model loss used to train the room acoustic impulse response estimation model; L1 represents the first model loss; L2 represents the second model loss; L3 represents the third model loss; and α, β, and γ represent the weighting coefficients corresponding to L1, L2, and L3, respectively.

[0138] Optionally, the vector slices include an identifier for representing the order in which the vector slices are arranged in the feature vector.

[0139] Optionally, there may be partial overlap between adjacent vector slices among the plurality of vector slices.

[0140] In the above technical solution, a segmented room acoustic impulse response estimation model is used. First, the feature vectors corresponding to the speech signal are segmented. The room acoustic impulse response estimation model blindly estimates the impulse response segments corresponding to each vector segment. Then, the segments are spliced ​​together to obtain the complete room acoustic impulse response. This utilizes the powerful modeling capabilities of neural networks to estimate the accurate room acoustic impulse response. Furthermore, the segmented calculation process allows each vector segment to share the network parameters in the room acoustic impulse response estimation model, which is equivalent to generating multiple sub-band networks with independent outputs. This greatly reduces the complexity and computational difficulty of the model, and approximates the real room acoustic impulse response to the greatest extent possible.

[0141] Figure 10 This is a schematic structural diagram of a device provided in an exemplary embodiment. Please refer to... Figure 10 At the hardware level, the device includes a processor 102, an internal bus 104, a network interface 106, memory 108, and non-volatile memory 110, and may also include other necessary hardware. One or more embodiments of this specification can be implemented in software, for example, the processor 102 reads the corresponding computer program from the non-volatile memory 110 into memory 108 and then runs it. Of course, in addition to software implementation, one or more embodiments of this specification do not exclude other implementation methods, such as logic devices or a combination of hardware and software, etc. That is to say, the execution subject of the following processing flow is not limited to each logic unit, but can also be hardware or logic devices.

[0142] The systems, devices, modules, or units described in the above embodiments can be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer, which can take the form of a personal computer, laptop computer, cellular phone, camera phone, smartphone, personal digital assistant, media player, navigation device, email sending and receiving device, game console, tablet computer, wearable device, or any combination of these devices.

[0143] In a typical configuration, a computer includes one or more processors (CPU), input / output interfaces, network interfaces, and memory.

[0144] Memory may include non-persistent storage in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.

[0145] Computer-readable media, including both permanent and non-permanent, removable and non-removable media, can store information using any method or technology. Information can be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, disk storage, quantum memory, graphene-based storage media or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include transient computer-readable media, such as modulated data signals and carrier waves.

[0146] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use and processing of the relevant data must comply with the relevant laws, regulations and standards of the relevant countries and regions, and corresponding operation portals are provided for users to choose to authorize or refuse.

[0147] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0148] The foregoing has described specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired result. In some embodiments, multitasking and parallel processing are possible or may be advantageous.

[0149] The terminology used in one or more embodiments of this specification is for the purpose of describing particular embodiments only and is not intended to limit the scope of one or more embodiments of this specification. The singular forms “a,” “described,” and “the” used in one or more embodiments of this specification and in the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used herein refers to and includes any or all possible combinations of one or more associated listed items.

[0150] It should be understood that although the terms first, second, third, etc., may be used to describe various information in one or more embodiments of this specification, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, first information may also be referred to as second information without departing from the scope of one or more embodiments of this specification, and similarly, second information may also be referred to as first information. Depending on the context, the word "if" as used herein may be interpreted as "when," "in response to a determination," or "when," or "in the event of a determination."

[0151] The above description is merely a preferred embodiment of one or more embodiments of this specification and is not intended to limit the scope of one or more embodiments of this specification. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of one or more embodiments of this specification should be included within the protection scope of one or more embodiments of this specification.

Claims

1. A method of estimating the acoustic impulse response of a room, characterized in that, include: From the acquired speech signal, acoustic environment information related to the transmission environment of the speech signal is extracted, and a feature vector corresponding to the speech signal is generated based on the acoustic environment information. The feature vector is a feature vector obtained based on multiple acoustic environment information, or the feature vector is obtained by summarizing multiple feature vectors obtained based on various acoustic environment information respectively. The feature vector is segmented to obtain multiple vector segments; The multiple vector slices are sequentially input into the room acoustic impulse response estimation model to perform room acoustic impulse response estimation, so as to obtain the impulse response vector slices corresponding to each vector slice; the room acoustic impulse response estimation model is a machine learning model trained with the feature vector of the speech signal sample as the training sample feature and the room acoustic impulse response corresponding to the speech signal sample as the sample label. The impulse response vectors are segmented and spliced ​​together to obtain the room acoustic impulse response corresponding to the speech signal, wherein the room acoustic impulse response is used to represent the acoustic characteristics of the speech signal transmitted from the sound source to the speech acquisition hardware.

2. The method of claim 1, wherein, The vector slices include an identifier that indicates the order in which the vector slices are arranged in the feature vector.

3. The method of claim 1, wherein, There is partial overlap between adjacent vector slices among the multiple vector slices.

4. A method of training a room acoustic impulse response estimation model, characterized by, include: From the speech signal samples in the speech signal sample set, acoustic environment information related to the transmission environment of the speech signal samples is extracted, and feature vectors corresponding to the speech signals are generated based on the acoustic environment information; wherein, the speech signal sample set includes speech signal samples and room acoustic impulse responses corresponding to the speech signal samples as sample labels, and the feature vectors are feature vectors obtained based on multiple acoustic environment information, or the feature vectors are obtained by summing up multiple feature vectors obtained based on various acoustic environment information respectively; The feature vector corresponding to the speech signal sample is segmented to obtain multiple vector segments corresponding to the speech signal sample. Multiple vector slices corresponding to the speech signal sample are sequentially input into the room acoustic impulse response estimation model to perform room acoustic impulse response estimation. The impulse response vector slices corresponding to each vector slice are output and then concatenated to obtain the room acoustic impulse response corresponding to the speech signal sample. The model loss is obtained based on the room acoustic impulse response output by the room acoustic impulse response estimation model and the room acoustic impulse response used as a label, and the room acoustic impulse response estimation model is trained based on the model loss.

5. The method according to claim 4, characterized in that, The model loss includes a first model loss and a second model loss; The first model loss is represented by the difference between the room acoustic impulse response output by the room acoustic impulse response estimation model and the corresponding room acoustic impulse response used as sample labels; The second model loss is represented by the squared difference between the room acoustic impulse response output by the room acoustic impulse response estimation model and the corresponding room acoustic impulse response used as sample labels.

6. The method of claim 5, wherein, The room acoustic impulse response estimation model is a generative neural network in generative adversarial neural networks. The model loss also includes a third model loss; the third model loss is represented by the discrimination error of the discrimination model; wherein, the discrimination model is used to, taking the room acoustic impulse response output by the room acoustic impulse response estimation model and the room acoustic impulse response used as sample labels as input, determine whether the input room acoustic impulse response is the output of the room acoustic impulse response estimation model or the room acoustic impulse response used as sample labels; The training of the room acoustic impulse response estimation model based on the model loss includes: Based on the model loss, the room acoustic impulse response estimation model and the discriminant model are trained adversarially using adversarial learning.

7. The method according to claim 6, characterized in that, The discrimination model is used to determine whether the input room acoustic impulse response is the output of the room acoustic impulse response estimation model or the room acoustic impulse response used as a sample label, by taking the feature vector input to the room acoustic impulse response estimation model, the output room acoustic impulse response, and the room acoustic impulse response used as a sample label as input.

8. The method of claim 6, wherein, The loss function used to train the room acoustic impulse response estimation model is expressed by the following formula: Wherein, L represents the model loss used to train the room acoustic impulse response estimation model; L1 represents the first model loss; L2 represents the second model loss; L3 represents the third model loss; and the... These represent the weighting coefficients corresponding to L1, L2, and L3, respectively.

9. The method according to claim 4, characterized in that, The vector slices include an identifier that indicates the order in which the vector slices are arranged in the feature vector.

10. The method of claim 4, wherein, There is partial overlap between adjacent vector slices among the multiple vector slices.

11. An apparatus for estimating an acoustic impulse response of a room, characterized by include: The voice acquisition module is used to extract acoustic environment information related to the transmission environment of the acquired voice signal, and generate a feature vector corresponding to the voice signal based on the acoustic environment information. The feature vector is a feature vector obtained based on multiple acoustic environment information, or the feature vector is obtained by summing up multiple feature vectors obtained based on various acoustic environment information. The sharding module is used to shard the feature vector to obtain multiple vector shards; The model estimation module is used to sequentially input the multiple vector slices into the room acoustic impulse response estimation model to estimate the room acoustic impulse response, so as to obtain the impulse response vector slices corresponding to each vector slice; the room acoustic impulse response estimation model is a machine learning model trained with the feature vector of the speech signal sample as the training sample feature and the room acoustic impulse response corresponding to the speech signal sample as the sample label. The result output module is used to splice together the various impulse response vectors to obtain the room acoustic impulse response corresponding to the speech signal, wherein the room acoustic impulse response is used to represent the acoustic characteristics of the speech signal transmitted from the sound source to the speech acquisition module.

12. A training device for a room acoustic impulse response estimation model, characterized in that, include: The sample processing module is used to extract acoustic environment information related to the transmission environment of the speech signal samples from the speech signal samples in the speech signal sample set, and generate feature vectors corresponding to the speech signals based on the acoustic environment information; wherein, the speech signal sample set includes speech signal samples and room acoustic impulse responses corresponding to the speech signal samples as sample labels, and wherein, the feature vectors are feature vectors obtained based on multiple acoustic environment information, or the feature vectors are obtained by summing up multiple feature vectors obtained based on various acoustic environment information respectively; The feature segmentation module is used to segment the feature vector corresponding to the speech signal sample to obtain multiple vector segments corresponding to the speech signal sample. The model calculation module is used to sequentially input multiple vector slices corresponding to the speech signal sample into the room acoustic impulse response estimation model, perform room acoustic impulse response estimation, output impulse response vector slices corresponding to each vector slice, and splice the impulse response vector slices to obtain the room acoustic impulse response corresponding to the speech signal sample. The model training module is used to obtain the model loss based on the room acoustic impulse response output by the room acoustic impulse response estimation model and the room acoustic impulse response as a label, and to train the room acoustic impulse response estimation model based on the model loss.

13. An electronic device, comprising: include: processor; Memory used to store processor-executable instructions; The processor implements the method as described in any one of claims 1-10 by executing the executable instructions.

14. A computer-readable storage medium, characterized in that, It stores computer instructions that, when executed by a processor, implement the steps of the method as described in any one of claims 1-10.