Audio processing method and apparatus, model training method and apparatus, and electronic device

By performing spectral processing and model training on the audio recorded by the rotating device, and using a noise filtering model to process the audio recorded by the rotating device, the noise problem caused by the Doppler effect during rotation is solved, and a more accurate noise reduction effect is achieved.

WO2026130538A1PCT designated stage Publication Date: 2026-06-25BEIJING CO WHEELS TECH CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
BEIJING CO WHEELS TECH CO LTD
Filing Date
2025-12-19
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

The rotating sound receiver produces significant noise in the recorded audio due to the Doppler effect during rotation, and existing technologies struggle to accurately acquire the relative motion trajectory to effectively eliminate the noise.

Method used

By acquiring the target audio recorded by the rotating device, the device's rotation speed, and the ambient background audio, the spectrum is extracted and input into the noise filtering model. The spectrogram is then processed using feature extraction, attention mechanism, and correction module to generate a denoised spectrogram and convert it into a denoised frequency.

Benefits of technology

It achieves more precise noise reduction for audio recorded from rotating devices, improves noise cancellation, and reduces frequency shift noise caused by the Doppler effect.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025143977_25062026_PF_FP_ABST
    Figure CN2025143977_25062026_PF_FP_ABST
Patent Text Reader

Abstract

The present application relates to the technical field of computer processing. Provided are an audio processing method and apparatus, a model training method and apparatus, and an electronic device. The audio processing method comprises: acquiring target audio recorded by means of a rotating device, a device rotation speed and environmental background audio; separately performing spectrum extraction processing on the target audio and the environmental background audio, so as to obtain a target spectrogram and a background audio spectrogram; inputting the target spectrogram, the device rotation speed and the background audio spectrogram into a noise filtering model, so as to obtain a noise-reduced spectrogram, wherein the noise filtering model is used for performing noise-filtering processing on the target spectrogram on the basis of the device rotation speed and the background audio spectrogram and outputting the processed noise-reduced spectrogram; and performing audio conversion processing on the noise-reduced spectrogram, so as to obtain noise-reduced audio corresponding to the target audio. By means of the present application, more accurate noise reduction can be implemented, thereby improving the effect of noise cancellation for audio recorded by means of a rotating device.
Need to check novelty before this filing date? Find Prior Art

Description

Audio processing methods, model training methods, devices and electronic equipment

[0001] Cross-reference to related applications

[0002] This application claims priority to Chinese Patent Application No. 202411897517.5, filed on December 20, 2024, entitled "Audio Processing Method, Model Training Method, Apparatus and Electronic Device", the entire contents of which are incorporated herein by reference. Technical Field

[0003] This application relates to the field of computer processing technology, and in particular to an audio processing method, a model training method, an apparatus, and an electronic device. Background Technology

[0004] The Doppler effect refers to the change in the frequency of the wave received by an observer when there is relative motion between the wave source and the observer, thus producing noise. Therefore, when there is relative motion between the sound receiver and the sound source device, the sound receiver will receive more noise in the audio due to the Doppler effect.

[0005] In particular, for rotating sound receivers, the faster the position changes and the faster the rotation speed, the more drastic the changes in the relative distance and direction of the sound receiver and the sound source device will be. As a result, the sound receiver will receive sound waves with large frequency shifts due to the severe Doppler effect, leading to a problem of high noise in the audio recorded by the sound receiver.

[0006] Currently, noise in the audio recorded by the sound receiver is typically eliminated by collecting the relative motion trajectory between the sound receiver and the sound source device. However, collecting the relative motion trajectory between the sound receiver and the sound source device is difficult, making it hard to obtain accurate relative motion trajectory data, resulting in poor effectiveness of noise elimination based on relative motion trajectory. Summary of the Invention

[0007] To overcome the problems existing in related technologies, this application provides an audio processing method, a model training method, an apparatus, and an electronic device.

[0008] According to a first aspect of the embodiments of this application, an audio processing method is provided, the method comprising:

[0009] The target audio recorded by the rotating device, the device rotation speed, and the ambient background audio are acquired. The ambient background audio refers to the background sound of the environment in which the rotating device is located when it is rotating.

[0010] The target audio and the ambient background audio are subjected to spectrum extraction processing to obtain the target spectrum map and the background audio spectrum map, respectively.

[0011] The target spectrum, the equipment rotation speed, and the background audio spectrum are input into the noise filtering model to obtain a noise-reduced spectrum. The noise filtering model is used to perform noise filtering on the target spectrum based on the equipment rotation speed and the background audio spectrum and output the processed noise-reduced spectrum.

[0012] The noise reduction spectrogram is subjected to audio conversion processing to obtain the noise reduction frequency corresponding to the target audio.

[0013] Optionally, the noise filtering model includes a feature extraction module, an attention mechanism module, and a correction module; the step of inputting the target spectrogram, the device rotation speed, and the background audio spectrogram into the noise filtering model to obtain the denoised spectrogram includes:

[0014] The target spectrogram and the background audio spectrogram are input into the feature extraction module, and fusion feature extraction is performed on the target spectrogram and the background audio spectrogram to obtain spectral feature data;

[0015] The spectral feature data and the device rotation speed are input into the attention mechanism module, and the spectral feature data is processed by channel enhancement or suppression according to the device rotation speed to obtain spectral feature adjustment data.

[0016] The spectral feature adjustment data is input into the correction module, and nonlinear mapping processing is performed on the spectral feature adjustment data to obtain the noise reduction spectrum.

[0017] Optionally, the step of fusing features from the target spectrogram and the background audio spectrogram to obtain spectral feature data includes:

[0018] The target spectrogram and the background audio spectrogram are input into the convolutional layer of the feature extraction module, and the target spectrogram and the background audio spectrogram are convolved to obtain the fused feature data of the target spectrogram and the background audio spectrogram.

[0019] The fused feature data is input into the normalization layer in the feature extraction module to perform normalization processing on the fused feature data, thereby obtaining spectral feature data.

[0020] Optionally, the step of performing channel enhancement or suppression processing on the spectral feature data according to the device rotation speed through the attention mechanism module to obtain spectral feature adjustment data includes:

[0021] The spectral feature data and the device rotation speed are input into the dot product attention layer in the attention mechanism module. The device rotation speed is used to weight the spectral feature data to obtain the spectral feature adjustment data.

[0022] Optionally, the step of performing nonlinear mapping processing on the spectral feature adjustment data through the correction module to obtain the noise-reduced spectrum includes:

[0023] The spectral feature adjustment data is input into the correction linear unit layer in the correction module, and the noise-reduced spectrum is output.

[0024] Optionally, the noise filtering model is a model obtained by training a preset neural network based on multiple training data. Each training data includes: sample audio, sample rotation speed, and sample environmental background audio. The sample audio is the audio recorded by the rotating device of the sound source device. Each sample audio is the audio recorded by the rotating device at a different distance from the sound source device, and / or the audio recorded by the rotating device at different rotation speeds. The sample rotation speed is the rotation speed of the rotating device when recording the sample audio. The sample environmental background audio is the audio obtained by extracting background sound from the environmental audio. The environmental audio is the audio recorded by a reference recording device at a fixed distance from the sound source device of the sound source device when the rotating device rotates.

[0025] According to a second aspect of the embodiments of this application, a model training method is provided, the method comprising:

[0026] Acquire multiple training data sets, each of which includes sample audio, sample rotation speed, and sample environmental background audio.

[0027] Spectrum extraction processing is performed on the sample audio and the sample environmental background audio in each of the training data to obtain the sample spectrum map corresponding to the sample audio and the sample background audio spectrum map corresponding to the sample environmental background audio.

[0028] Update each of the training data, and each updated training data includes: the sample spectrogram corresponding to the sample audio, the sample rotation speed, and the sample background audio spectrogram corresponding to the sample environmental background audio;

[0029] The preset neural network is trained based on multiple updated training data sets, outputting a predicted denoised spectrum, until the loss function converges, resulting in a noise filtering model. This noise filtering model is used to perform noise filtering on the target spectrum recorded by the rotating equipment based on the equipment rotation speed and background audio spectrum, and outputs a processed denoised spectrum. The denoised spectrum undergoes audio conversion processing to obtain the denoised frequency corresponding to the target audio.

[0030] The loss function represents the difference between the predicted denoised spectrogram output by the preset neural network model during training and the high-quality spectrogram corresponding to the sample spectrogram. The high-quality spectrogram is the spectrogram of the audio sample after denoising processing that meets the requirements.

[0031] The sample audio is audio recorded by a rotating device of the sound source device. Each sample audio is audio recorded by the rotating device at a different distance from the sound source device, and / or audio recorded by the rotating device at different rotation speeds. The sample rotation speed is the rotation speed of the rotating device when recording the sample audio. The sample ambient background audio is audio obtained by extracting background sound from ambient audio. The ambient audio is audio recorded by a reference recording device at a fixed distance from the sound source device of the sound source device as the rotating device rotates.

[0032] According to a third aspect of the embodiments of this application, an audio processing apparatus is provided, the apparatus comprising:

[0033] The first acquisition module is used to acquire the target audio recorded by the rotating device, the device rotation speed, and the ambient background audio, wherein the ambient background audio refers to the background sound of the environment in which the rotating device is rotating.

[0034] The first spectrum extraction module is used to perform spectrum extraction processing on the target audio and the environmental background audio respectively, to obtain the target spectrum map and the background audio spectrum map respectively;

[0035] The model processing module is used to input the target spectrum, the equipment rotation speed and the background audio spectrum into the noise filtering model to obtain the noise-reduced spectrum. The noise filtering model is used to perform noise filtering processing on the target spectrum based on the equipment rotation speed and the background audio spectrum and output the processed noise-reduced spectrum.

[0036] An audio conversion module is used to perform audio conversion processing on the noise reduction spectrogram to obtain the noise reduction frequency corresponding to the target audio.

[0037] According to a fourth aspect of the embodiments of this application, a model training apparatus is provided, the apparatus comprising:

[0038] The second acquisition module is used to acquire multiple training data, each of which includes sample audio, sample rotation speed and sample environmental background audio.

[0039] The second spectrum extraction module is used to perform spectrum extraction processing on the sample audio and the sample environment background audio in each of the training data to obtain the sample spectrum map corresponding to the sample audio and the sample background audio spectrum map corresponding to the sample environment background audio.

[0040] The update module is used to update each of the training data, and each updated training data includes: the sample spectrogram corresponding to the sample audio, the sample rotation speed, and the sample background audio spectrogram corresponding to the sample environmental background audio.

[0041] The model training module is used to train a preset neural network based on multiple updated training data sets, outputting a predicted denoised spectrum map until the loss function converges, thus obtaining a noise filtering model. This noise filtering model is used to perform noise filtering processing on the target spectrum map recorded by the rotating equipment based on the equipment rotation speed and background audio spectrum map, and outputs the processed denoised spectrum map. The denoised spectrum map undergoes audio conversion processing to obtain the denoised frequency corresponding to the target audio.

[0042] The loss function represents the difference between the predicted denoised spectrogram output by the preset neural network model during training and the high-quality spectrogram corresponding to the sample spectrogram. The high-quality spectrogram is the spectrogram of the audio sample after denoising processing that meets the requirements.

[0043] The sample audio is audio recorded by a rotating device of the sound source device. Each sample audio is audio recorded by the rotating device at a different distance from the sound source device, and / or audio recorded by the rotating device at different rotation speeds. The sample rotation speed is the rotation speed of the rotating device when recording the sample audio. The sample ambient background audio is audio obtained by extracting background sound from ambient audio. The ambient audio is audio recorded by a reference recording device at a fixed distance from the sound source device of the sound source device as the rotating device rotates.

[0044] According to a fifth aspect of the embodiments of this application, an electronic device is provided, including a processor, a memory, and a computer program stored in the memory and executable on the processor. When the computer program is executed by the processor, it implements the steps of the audio processing method of the first aspect or the steps of the model training method of the second aspect.

[0045] According to a sixth aspect of the embodiments of this application, the embodiments of this application provide a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps of the audio processing method of the first aspect or the steps of the model training method of the second aspect.

[0046] The technical solutions provided by the embodiments of this application bring at least the following beneficial effects:

[0047] By acquiring the target audio recorded by the rotating device, the device's rotation speed, and the ambient background audio, and performing spectral extraction processing on both the target audio and the ambient background audio, a target spectrogram corresponding to the target audio and a background audio spectrogram corresponding to the ambient background audio are obtained. This target spectrogram, device rotation speed, and background audio spectrogram can then be input into a noise filtering model to obtain a denoised spectrogram. Furthermore, by performing audio conversion processing on the denoised spectrogram, the denoised frequency corresponding to the target audio can be obtained, achieving effective noise reduction of the target audio. Moreover, since the device's rotation speed and ambient background audio are easier to determine accurately than the relative motion trajectory between the sound receiver and the sound source device, by calling the noise filtering model to learn the complex relationship between the device's rotation speed and noise generated by factors such as the Doppler effect and device rotation, noise filtering processing of the target audio recorded by the rotating device can be performed based on the device's rotation speed and ambient background audio, achieving more accurate noise reduction and improving the noise cancellation effect on audio recorded by rotating devices.

[0048] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and do not limit this application. Attached Figure Description

[0049] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application, and do not constitute an undue limitation of this application.

[0050] Figure 1 is a schematic diagram of the implementation environment of an audio processing method provided in an embodiment of this application;

[0051] Figure 2 is a flowchart of an audio processing method provided in an embodiment of this application;

[0052] Figure 3 is a schematic diagram of the noise filtering model provided in an embodiment of this application;

[0053] Figure 4 is a flowchart of another audio processing method provided in an embodiment of this application;

[0054] Figure 5 is a flowchart of a model training method provided in an embodiment of this application;

[0055] Figure 6 is a block diagram of an audio processing device provided in an embodiment of this application;

[0056] Figure 7 is a block diagram of a model training device provided in an embodiment of this application;

[0057] Figure 8 is a block diagram of an electronic device provided in an embodiment of this application. Specific Implementation

[0058] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.

[0059] Please refer to Figure 1, which illustrates a schematic diagram of the implementation environment for an audio processing method provided in this application embodiment. As shown in Figure 1, the implementation environment may include a processing device 10 and a rotating device 20. The rotating device 20 refers to a device capable of rotational movement, used to receive audio signals, i.e., for recording audio. Alternatively, a recording device may be installed on the rotating device 20 for recording audio. Figure 1 illustrates an example where a recording device 201 is installed on the rotating device 20.

[0060] The rotating device 20 and the processing device 10 can be connected via a wired or wireless connection. A wired connection means that the rotating device 20 and the processing device 10 can be connected via a data transmission cable. The rotating device 20 is used to transmit recorded audio to the processing device 10 via the data transmission cable. A wireless connection means that the rotating device 20 and the processing device 10 can be connected via a wireless network. The rotating device 20 is used to transmit recorded audio to the processing device 10 via a wireless network. Optionally, the wireless network can be Bluetooth, Wi-Fi, or various communication networks.

[0061] In one alternative scenario, the rotating device 20 can be a robotic arm 20 on a device such as a robot or engineering machine. A recording device 201 is mounted on the robotic arm 20. The recording device 201 can record ambient sound waves to generate audio during the stationary or rotating motion of the rotating device 20 and output it to the processing device 10. In another alternative scenario, the rotating device 20 can be a vehicle wheel 20. A recording device 201 is mounted on the wheel 20. The recording device 201 can record ambient sound waves to generate audio during the rotation of the wheel 20 and output it to the processing device 10.

[0062] In some embodiments of this application, as shown in FIG1, the implementation environment may further include: a sound source device 30. The sound source device 30 is used to emit sound. The rotating device 20 can collect the sound from the sound source device 30 to record and form audio.

[0063] Further optionally, referring to Figure 1, the implementation environment may also include a reference recording device 40. The reference recording device 40 can be positioned in a fixed location to capture ambient background audio from the rotating device 10. Because the reference recording device 40 is in a fixed position, the amount of audio with significant Doppler noise captured by the reference recording device 40 due to the Doppler effect can be reduced to some extent.

[0064] Here, ambient background audio refers to the background sound of the environment in which the rotating device 10 is located when it rotates. For example, in the presence of the sound source device 30, the reference recording device 40 can be set at a fixed distance from the sound source device 30 to avoid the reference recording device 40 from acquiring audio with Doppler noise due to the triggering of the Doppler effect.

[0065] Please refer to Figure 2, which shows a flowchart of an audio processing method provided in an embodiment of this application. The audio processing method can be applied to the implementation environment shown in Figure 1 and executed by the processing device 10. Optionally, the processing device 10 can be an electronic device or a component within an electronic device. Optionally, the electronic device can be a mobile phone, tablet, or computer, etc. The component in the electronic device that executes the audio processing method can be a processor, microcontroller, etc. As shown in Figure 2, the audio processing method includes:

[0066] Step 201: Obtain the target audio, device rotation speed, and ambient background audio recorded by the rotating device.

[0067] In this embodiment, the ambient background audio refers to the audio of the background sound of the environment in which the rotating device is located when it rotates. Optionally, the environment in which the rotating device is located is relatively fixed. Correspondingly, the ambient background audio of the rotating device is also relatively fixed. Therefore, the ambient background audio of the rotating device can be pre-collected and stored. Based on this, the processing device can directly obtain the target audio recorded by the rotating device, the current rotation speed of the rotating device, and the pre-stored ambient background audio.

[0068] In another alternative scenario, the environment in which the rotating device is located may change. Accordingly, it is necessary to acquire the ambient background audio of the rotating device in real time during the recording of the target audio, so that the processing device 10 can obtain the target audio recorded by the rotating device, the current rotation speed of the rotating device, and the acquired ambient background audio.

[0069] It should be noted that the acquisition process for the ambient background audio is the same regardless of whether it is pre-acquired or acquired in real-time. Specifically, optionally, a reference recording device fixedly positioned in the environment of the rotating device can acquire first ambient audio while the device is rotating and send it to the processing device. Furthermore, the reference recording device can also acquire second ambient audio while the device is not rotating and send it to the processing device. Upon receiving the first and second ambient audio, the processing device can perform background sound extraction processing using spectral subtraction to extract the audio components that differ from the second ambient audio, thus obtaining the ambient background audio. The ambient background audio can be understood to some extent as the different audio components between the first and second ambient audio.

[0070] Step 202: Perform spectrum extraction processing on the target audio and the ambient background audio respectively to obtain the target spectrum map and the background audio spectrum map respectively.

[0071] In this embodiment of the application, the processing device can perform spectrum extraction processing on the target audio to obtain the target spectrum map corresponding to the target audio, and perform spectrum extraction processing on the environmental background audio to obtain the background audio spectrum map corresponding to the environmental background audio.

[0072] In one optional implementation, the processing device may employ a target spectrum extraction algorithm to perform spectrum extraction processing on the target audio to obtain a target spectrogram, and then employ the same algorithm to perform spectrum extraction processing on the ambient background audio to obtain a background audio spectrogram. The target spectrum extraction algorithm may be a short-time Fourier transform (STFT), a Mel-frequency cepstral coefficient (MFCC) algorithm, or a linear predictive coding (LPC) algorithm, etc.

[0073] In another alternative implementation, the processing device can sequentially perform frame segmentation, windowing, and Fourier transform processing on the target audio to obtain the target spectrogram. Similarly, the processing device can sequentially perform frame segmentation, windowing, and Fourier transform processing on the ambient background audio to obtain the background audio spectrogram.

[0074] Step 203: Input the target spectrum, equipment rotation speed and background audio spectrum into the noise filtering model to obtain the noise reduction spectrum.

[0075] The noise filtering model is used to filter noise from the target audio spectrum based on the equipment rotation speed and background audio spectrum, and outputs the processed noise-reduced audio spectrum. The noise-reduced audio spectrum can be converted back into a noise-reduced frequency, which is the audio after noise filtering of the target audio.

[0076] Optionally, as shown in Figure 3, the noise filtering model 300 includes a feature extraction module 301, an attention mechanism module 302, and a correction module 303. The feature extraction module 301 performs fusion feature extraction on the target spectrogram and the background audio spectrogram to obtain spectral feature data. The attention mechanism module 302 performs channel enhancement or suppression processing on the spectral feature data according to the device rotation speed to obtain spectral feature adjustment data. The correction module 303 performs nonlinear mapping processing on the spectral feature adjustment data to obtain a denoised spectrogram.

[0077] Based on this, as shown in Figure 4, step 203, which inputs the target spectrum, equipment rotation speed, and background audio spectrum into the noise filtering model to obtain the noise-reduced spectrum, may include:

[0078] Step 401: Input the target spectrogram and background audio spectrogram into the feature extraction module, perform fusion feature extraction on the target spectrogram and background audio spectrogram, and obtain spectral feature data.

[0079] In some embodiments of this application, fusion feature extraction can be understood as performing feature extraction on the target spectrogram to obtain a first spectral feature, and performing feature extraction on the background audio spectrogram to obtain a second spectral feature, and then merging the first and second spectral features to obtain spectral feature data with a more comprehensive feature representation. Optionally, the first spectral feature may include the spectral energy distribution feature, power spectrum feature, etc. of the target spectrogram. Similarly, the second spectral feature may include the spectral energy distribution feature, power spectrum feature, etc. of the background audio spectrogram. Here, the spectral energy distribution feature refers to the energy distribution feature at different frequencies in the spectrogram. The power spectrum feature refers to the power distribution feature at different frequencies in the spectrogram.

[0080] Optionally, the process by which the processing device extracts fused features from the target spectrogram and the background audio spectrogram to obtain spectral feature data may include: sequentially performing convolution and normalization processing on the target spectrogram and the background audio spectrogram to obtain spectral feature data. Specifically, the processing device may perform convolution processing on the target spectrogram and the background audio spectrogram to obtain fused feature data of the target spectrogram and the background audio spectrogram. Then, the fused feature data is normalized to obtain spectral feature data.

[0081] For example, as shown in Figure 3, the feature extraction module 301 may include a convolutional layer (CNN) 3011 and a normalization layer 3012. The convolutional layer 3011 is used to perform convolution processing on the input data. The normalization layer 3012 is used to normalize the input data. The normalization layer 3012 may be an instance normalization (IN) layer to accelerate model training and improve the model's generalization ability. Alternatively, the normalization layer 3012 may be a batch normalization (BN) layer.

[0082] Based on this, the processing device sequentially performs convolution and normalization processing on the target spectrogram and the background audio spectrogram to obtain spectral feature data. The process can be as follows: The processing device inputs the target spectrogram and the background audio spectrogram into the convolutional layer 3011 of the feature extraction module 301, performs convolution processing on the target spectrogram and the background audio spectrogram, and obtains fused feature data of the target spectrogram and the background audio spectrogram. The processing device then inputs the fused feature data into the normalization layer 3012 of the feature extraction module 301, performs normalization processing on the fused feature data, and obtains the spectral feature data. It should be noted that in some implementations, the specific structures of the convolutional layer 3011 and the normalization layer 3012 can be set according to actual needs.

[0083] Step 402: Input the spectral feature data and device rotation speed into the attention mechanism module, and perform channel enhancement or suppression processing on the spectral feature data according to the device rotation speed to obtain spectral feature adjustment data.

[0084] In this embodiment, the device rotation speed can be used as the key vector K and value V under the attention mechanism, and the spectral feature data can be used as the query vector Q. The attention mechanism module can use the attention mechanism to learn the correlation between the device rotation speed and the spectral feature data, thereby realizing the weighted summation processing of the spectral feature data according to the device rotation speed to eliminate the noise influence caused by the Doppler effect, the rotation of the rotating device itself, etc.

[0085] In some embodiments, channel enhancement or suppression processing of spectral feature data based on device rotation speed can be understood as follows: based on device rotation speed, suppressing the portion of channel data in spectral feature data that is associated with device rotation speed, and enhancing the portion of channel data that is weakly associated with device rotation speed, thereby reducing noise spectral data in spectral feature data and eliminating noise effects caused by Doppler effect, rotation of the rotating device itself, etc.

[0086] Optionally, as shown in Figure 3, the attention mechanism module 302 may include a dot-product attention layer 3021. Correspondingly, the process by which the processing device performs channel enhancement or suppression processing on the spectral feature data according to the device rotation speed through the attention mechanism module to obtain spectral feature adjustment data may include: inputting the spectral feature data and the device rotation speed into the dot-product attention layer 3021 in the attention mechanism module 302, and performing weighted processing on the spectral feature data according to the device rotation speed to obtain the spectral feature adjustment data.

[0087] In the dot-product attention layer 3021, the device rotation speed can be used as the key vector K and the value V, and the spectral feature data as the query vector Q. A dot product operation is performed on the key vector K (device rotation speed) and the query vector Q (spectral feature data) to calculate the similarity between them, generating an attention weight map, i.e., a weight tensor. The attention weight map includes the weight corresponding to each channel in the query vector Q; this weight represents the channel similarity. Then, the product of the attention weight map and the value V (device rotation speed) is calculated to obtain the attention feature map. Finally, the attention feature map and the query vector Q (spectral feature data) are added together to obtain the spectral feature adjustment data.

[0088] Step 403: Input the spectrum feature adjustment data into the correction module, perform nonlinear mapping processing on the spectrum feature adjustment data, and obtain the noise-reduced spectrum.

[0089] Optionally, the correction module can perform nonlinear mapping processing on the spectral features according to the target mapping function to obtain a denoised spectrogram. By introducing nonlinear mapping processing into the model, the model can better capture complex and nonlinear relationships in the input data, thereby enabling the model to learn more complex feature representations and improving the accuracy of model processing. The target mapping function can be a Rectified Linear Unit (ReLU) activation function or a sigmoid activation function, etc.

[0090] In some embodiments of this application, as shown in FIG3, the correction module 303 includes a correction linear unit layer 3031. Based on this, the process by which the processing device performs nonlinear mapping processing on the spectral feature adjustment data to obtain a denoised spectrum may include: inputting the spectral feature adjustment data into the correction linear unit layer 3031 in the correction module 303, and outputting a denoised spectrum. The correction linear unit layer 3031 is used to call the ReLU activation function to perform nonlinear mapping processing on the spectral feature adjustment data to obtain the denoised spectrum.

[0091] For example, as shown in Figure 3, the noise filtering model 300 includes a feature extraction module 301, an attention mechanism module 302, and a correction module 303. The feature extraction module 301 includes a convolutional layer 3011 and a normalization layer 3012. The attention mechanism module 302 includes a dot-product attention layer 3021. The correction module 303 includes a corrected linear unit layer 3031. Furthermore, the noise filtering model 300 also includes an input layer 304. The input layer receives the target spectrogram, device rotation speed, and background audio spectrogram, and inputs these data to the convolutional layer 3011, causing the convolutional layer 3011 to output fused feature data of the target spectrogram and background audio spectrogram to the normalization layer 3012. Then, the normalization layer 3012 outputs spectral feature data to the dot-product attention layer 3021, causing the dot-product attention layer 3021 to output spectral feature adjustment data to the corrected linear unit layer 3031, which in turn outputs a denoised spectrogram.

[0092] Step 204: Perform audio conversion processing on the noise reduction spectrogram to obtain the noise reduction frequency corresponding to the target audio.

[0093] In this application, audio conversion processing is used to inversely convert the spectrogram into audio. Optionally, the processing device may employ Inverse Short-Time Fourier Transform (ISTFT) to perform audio conversion processing on the denoised spectrogram to obtain the denoised frequency corresponding to the target audio.

[0094] In this embodiment, the target audio recorded by the rotating device, the device's rotation speed, and the ambient background audio are acquired. Spectral extraction processing is then performed on both the target audio and the ambient background audio to obtain the target spectrogram corresponding to the target audio and the ambient background audio spectrogram corresponding to the ambient background audio. The target spectrogram, device rotation speed, and ambient background audio spectrogram can then be input into a noise filtering model to obtain a denoised spectrogram. Furthermore, audio conversion processing is performed on the denoised spectrogram to obtain the denoised frequency corresponding to the target audio, achieving effective noise reduction of the target audio. Moreover, since the device rotation speed and ambient background audio are easier to determine accurately than the relative motion trajectory between the sound receiver and the sound source device, the noise filtering model can learn the complex relationship between the device rotation speed and noise generated by factors such as the Doppler effect and device self-rotation. By applying the noise filtering model to the target audio recorded by the rotating device based on the device rotation speed and ambient background audio, more accurate noise reduction can be achieved to a certain extent, improving the noise cancellation effect on audio recorded by the rotating device.

[0095] In some embodiments of this application, the noise filtering model can be a model obtained by training a preset neural network based on multiple training data. Each training data includes: sample audio, sample rotation speed, and sample environmental background audio.

[0096] Wherein, the sample rotation speed is the rotation speed of the rotating device when recording the sample audio. The sample environmental background audio is the audio obtained by extracting background sound from the environmental audio (i.e., the aforementioned first environmental audio). The environmental audio is the audio obtained by a reference recording device at a fixed distance from the sound source device recording the sound emitted by the sound source device while the rotating device is rotating. In other words, the environmental audio is the ambient sound of the environment in which the rotating device is located, collected by the reference recording device while the rotating device is rotating and the sound source device is emitting sound.

[0097] Optionally, the process of generating the sample environment background audio may include: acquiring a first environmental audio obtained by recording the sound emitted by the sound source device while the rotating device is rotating; and acquiring a second environmental audio obtained by recording the sound emitted by the sound source device while the rotating device is not rotating. Then, based on the first and second environmental audios, spectral subtraction can be used to extract background sound, thereby extracting audio that differs from the second environmental audio from the first environmental audio, to obtain the sample environment background audio.

[0098] The sample audio is the audio recorded by a rotating device recording the sound emitted by the sound source device. Each sample audio is the audio recorded by a rotating device at a different distance from the sound source device, and / or the audio recorded by a rotating device rotating at different speeds.

[0099] Optionally, multiple sample audio files can be obtained by repeatedly adjusting the distance between the rotating device and the sound source device, recording the sound emitted by the sound source device after each adjustment. Alternatively, multiple sample audio files can be obtained by repeatedly adjusting both the distance between the rotating device and the sound source device, and adjusting the rotation speed of the rotating device, recording the sound emitted by the sound source device after each adjustment.

[0100] Further, optionally, the training process of the noise filtering model may include steps 01 to 0x.

[0101] In step 001, multiple training data are acquired, each training data including sample audio, sample rotation speed, and sample environmental background audio.

[0102] Optionally, the processing device can adjust the rotating device multiple times as described above to obtain multiple sample audios collected by the rotating device, and obtain the sample rotation speed and sample environment background audio when the rotating device collects each sample audio.

[0103] In step 002, the sample audio and sample environmental background audio in each training data are subjected to spectrum extraction processing to obtain the sample spectrum map corresponding to the sample audio and the sample background audio spectrum map corresponding to the sample environmental background audio.

[0104] In this embodiment of the application, the processing device can perform spectrum extraction processing on the sample audio to obtain the sample spectrum map corresponding to the sample audio, and perform spectrum extraction processing on the sample environmental background audio to obtain the sample background audio spectrum map corresponding to the sample environmental background audio.

[0105] In one optional implementation, the processing device may employ a target spectrum extraction algorithm to perform spectrum extraction processing on the sample audio to obtain a sample spectrogram, and then employ the same algorithm to perform spectrum extraction processing on the sample environmental background audio to obtain a sample background audio spectrogram. The target spectrum extraction algorithm may be a short-time Fourier transform, a Mel-frequency cepstral coefficient algorithm, or a linear predictive coding algorithm, etc.

[0106] In another alternative implementation, the processing device can sequentially perform frame segmentation, windowing, and Fourier transform processing on the sample audio to obtain the sample spectrogram. Similarly, the processing device can sequentially perform frame segmentation, windowing, and Fourier transform processing on the sample ambient background audio to obtain the sample background audio spectrogram.

[0107] In step 003, each training data is updated. Each updated training data includes: the sample spectrogram corresponding to the sample audio, the sample rotation speed, and the sample background audio spectrogram corresponding to the sample environmental background audio.

[0108] In this embodiment, the processing device can sequentially update each training data point to obtain an updated training data point. Each updated training data point includes: the sample spectrogram corresponding to the sample audio, the sample rotation speed, and the sample background audio spectrogram corresponding to the sample environmental background audio.

[0109] In step 004, the preset neural network is trained based on multiple updated training data to output a predicted noise reduction spectrum until the loss function converges, thus obtaining the noise filtering model.

[0110] In some embodiments of this application, the processing device can input multiple updated training data into a preset neural network to obtain a predicted denoised spectrum output by the preset neural network for each training data. Based on the loss function, the predicted denoised spectrum corresponding to each training data, and the high-quality spectrum, the loss value corresponding to each training data is calculated to obtain the loss values ​​for multiple training data. Then, the hyperparameters of the preset neural network can be adjusted according to the loss values ​​of the training data until the loss function converges, indicating that the preset neural network training is complete. The trained preset neural network is the noise filtering model.

[0111] In an alternative implementation, multiple training data sets can be divided into a training set and a validation set. The processing device can input multiple training data sets from the training set into a pre-defined neural network to obtain a predicted denoised spectrum output by the pre-defined neural network for each training data set. Furthermore, the processing device can input multiple training data sets from the validation set into the pre-defined neural network to obtain a predicted denoised spectrum output by the pre-defined neural network for each training data set. Based on the loss function, the predicted denoised spectrum output by each training data set in the validation set, and the high-quality spectrum output, the loss value for each training data set in the validation set is calculated to obtain the loss values ​​for multiple training data sets. The hyperparameters of the pre-defined neural network can then be adjusted based on the loss values ​​of the training data until the loss function converges, indicating that the pre-defined neural network training is complete. The trained pre-defined neural network is the noise filtering model.

[0112] The loss function characterizes the difference between the predicted denoised spectrogram output by the preset neural network model during training and the high-quality spectrogram corresponding to the sample spectrogram. The high-quality spectrogram is the spectrogram of the sample audio after denoising according to requirements. For example, the high-quality spectrogram could be the spectrogram of the second environmental audio corresponding to the sample background audio spectrogram. The second environmental audio is the audio recorded by a reference recording device while the rotating device is not rotating, capturing the sound emitted by the sound source device. This second environmental audio can be considered noise-free audio and can be used as the sample audio after denoising according to requirements.

[0113] Optionally, loss function convergence refers to the difference between the predicted denoised spectrum output by the preset neural network model and the high-quality spectrum corresponding to the sample spectrum being less than a preset difference during training. Alternatively, the preset neural network is trained a predetermined number of times. In some embodiments, when loss function convergence refers to the difference between the predicted denoised spectrum and the high-quality spectrum corresponding to the sample spectrum being less than a preset difference, convergence can be determined based on the loss values ​​of multiple training data. For example, convergence can be determined when the loss values ​​of multiple training data are all less than the target loss value. For another example, convergence can be determined when the proportion of loss values ​​less than the target loss value among the loss values ​​of multiple training data is less than a target proportion.

[0114] In some embodiments of this application, the loss function can satisfy:

[0115] Here, Loss represents the loss value output by the loss function. MSE_Loss represents the mean squared error loss function. SI_SNR represents the scale-invariant signal-to-noise ratio, which is used to measure the quality of the denoised audio signal. A higher scale-invariant signal-to-noise ratio indicates better audio quality. α and β are weight values, usually set to constant values ​​based on the actual situation.

[0116] Alternatively, the mean squared error loss function MSE_Loss satisfies:

[0117] Where y represents a high-quality spectrogram. This represents the predicted denoising spectrum of the preset neural network output. N represents the total number of points in the sample spectrum.

[0118] Alternatively, the scale-invariant signal-to-noise ratio SI_SNR satisfies:

[0119] Where: s represents the second ambient audio corresponding to the high-quality spectrogram. This represents the predicted noise reduction frequency after the predicted noise reduction spectrum has been inversely transformed.

[0120] In summary, the audio processing method provided in this application obtains the target audio recorded by the rotating device, the device's rotation speed, and the ambient background audio. It then performs spectral extraction processing on both the target audio and the ambient background audio to obtain the target spectrum map corresponding to the target audio and the background audio spectrum map corresponding to the ambient background audio. This allows the target spectrum map, device rotation speed, and background audio spectrum map to be input into a noise filtering model to obtain a denoised spectrum map. Furthermore, audio conversion processing is performed on the denoised spectrum map to obtain the denoised frequency corresponding to the target audio, achieving effective noise reduction of the target audio. Moreover, since the device rotation speed and ambient background audio are easier to determine accurately than the relative motion trajectory between the sound receiver and the sound source device, by calling the noise filtering model to learn the complex relationship between the device rotation speed and noise generated by factors such as the Doppler effect and device self-rotation, noise filtering processing can be performed on the target audio recorded by the rotating device based on the device rotation speed and ambient background audio. This can achieve more accurate noise reduction to a certain extent and improve the noise cancellation effect on audio recorded by the rotating device.

[0121] Please refer to Figure 5, which shows a flowchart of a model training method provided in an embodiment of this application. The model training method can be applied to the implementation environment shown in Figure 1 and executed by the processing device 10. Optionally, the processing device 10 can be an electronic device or a component in an electronic device. Optionally, the electronic device can be a mobile phone, tablet, or computer, etc. The component in the electronic device that executes the audio processing method can be a processor, microcontroller, etc. As shown in Figure 5, the model training method includes:

[0122] Step 501: Obtain multiple training data sets, each including sample audio, sample rotation speed, and sample environmental background audio.

[0123] The explanation and implementation of this step can be found in the explanation and implementation of step 001 above, and will not be repeated in this embodiment.

[0124] Step 502: Perform spectrum extraction processing on the sample audio and sample environment background audio in each training data to obtain the sample spectrum map corresponding to the sample audio and the sample background audio spectrum map corresponding to the sample environment background audio.

[0125] The explanation and implementation of this step can be found in the explanation and implementation of step 002 above, and will not be repeated in this embodiment.

[0126] Step 503: Update each training data. Each updated training data includes: the sample spectrogram corresponding to the sample audio, the sample rotation speed, and the sample background audio spectrogram corresponding to the sample environmental background audio.

[0127] The explanation and implementation of this step can be found in the explanation and implementation of step 003 above, and will not be repeated in this embodiment.

[0128] Step 504: Train the preset neural network based on multiple updated training data, output the predicted noise reduction spectrum, until the loss function converges, and obtain the noise filtering model.

[0129] The explanation and implementation of this step can refer to the explanation and implementation of step 004 above, and will not be repeated here. Specifically, the noise filtering model is used to perform noise filtering processing on the target spectrum recorded by the rotating equipment based on the equipment rotation speed and background audio spectrum, and outputs the processed noise-reduced spectrum. The noise-reduced spectrum is then processed by audio conversion to obtain the noise-reduced frequency corresponding to the target audio. It should be noted that the specific processing procedure for audio noise reduction using the noise filtering model can refer to the audio processing method provided in the embodiments of this application, and will not be repeated here.

[0130] The loss function represents the difference between the predicted denoised spectrum output by the preset neural network model during training and the high-quality spectrum corresponding to the sample spectrum. The high-quality spectrum is the spectrum of the audio after the sample audio has undergone denoising processing that meets the requirements.

[0131] The sample audio is the audio recorded by the rotating device of the sound source device. Each sample audio is the audio recorded by the rotating device at a different distance from the sound source device, and / or the audio recorded by the rotating device at different rotation speeds. The sample rotation speed is the rotation speed of the rotating device when recording the sample audio. The sample ambient background audio is the audio obtained by extracting background sound from the ambient audio. The ambient audio is the audio recorded by the reference recording device at a fixed distance from the sound source device when the rotating device is rotating.

[0132] In summary, the model training method provided in this application acquires multiple training data sets and performs spectrum extraction processing on the sample audio and sample background audio in each training data set to obtain the sample spectrum map corresponding to the sample audio and the sample background audio spectrum map corresponding to the sample background audio. This updates each training data set, ensuring that each updated training data set includes: the sample spectrum map corresponding to the sample audio, the sample rotation speed, and the sample background audio spectrum map corresponding to the sample background audio. This allows for training a preset neural network based on multiple updated training data sets, outputting a predicted denoised spectrum map until the loss function converges, resulting in a noise filtering model. Using this noise filtering model, noise filtering processing can be performed on the target spectrum map recorded by the rotating device based on its rotation speed and background audio spectrum map, outputting a processed denoised spectrum map. The denoised spectrum map undergoes audio conversion processing to obtain the denoised frequency corresponding to the target audio, achieving effective noise reduction of the target audio. Furthermore, the rotation speed of the rotating device and the background audio are easier to determine accurately than the relative motion trajectory between the sound receiver and the sound source device. Therefore, by calling the noise filtering model to learn the complex relationship between the equipment rotation speed and noise generated by factors such as the Doppler effect and equipment self-rotation, noise filtering processing can be performed on the target audio recorded by the rotating equipment based on the equipment rotation speed and the ambient background audio. This can achieve more accurate noise reduction to a certain extent and improve the noise elimination effect on the audio recorded by the rotating equipment.

[0133] The audio processing apparatus provided in this application embodiment can execute the audio processing method provided in any embodiment of this application, and has the corresponding functional modules and beneficial effects for executing the audio processing method.

[0134] Please refer to Figure 6, which shows a block diagram of an audio processing device provided in an embodiment of this application. As shown in Figure 6, the audio processing device 600 includes: a first acquisition module 601, a first spectrum extraction module 602, a model processing module 603, and an audio conversion module 604.

[0135] The first acquisition module 601 is used to acquire the target audio recorded by the rotating device, the device rotation speed, and the ambient background audio. The ambient background audio refers to the background sound of the environment in which the rotating device is rotating.

[0136] The first spectrum extraction module 602 is used to perform spectrum extraction processing on the target audio and the environmental background audio respectively, to obtain the target spectrum map and the background audio spectrum map respectively.

[0137] The model processing module 603 is used to input the target spectrum, equipment rotation speed and background audio spectrum into the noise filtering model to obtain the noise-reduced spectrum. The noise filtering model is used to perform noise filtering on the target spectrum based on the equipment rotation speed and background audio spectrum and output the processed noise-reduced spectrum.

[0138] The audio conversion module 604 is used to perform audio conversion processing on the noise reduction spectrogram to obtain the noise reduction frequency corresponding to the target audio.

[0139] Optionally, the noise filtering model includes a feature extraction module, an attention mechanism module, and a correction module; the model processing module 603 is also used for:

[0140] The target spectrogram and background audio spectrogram are input into the feature extraction module, and the target spectrogram and background audio spectrogram are fused to extract features, thereby obtaining spectral feature data;

[0141] The spectral feature data and device rotation speed are input into the attention mechanism module. Based on the device rotation speed, the spectral feature data is processed to perform channel enhancement or suppression to obtain spectral feature adjustment data.

[0142] The spectral feature adjustment data is input into the correction module, and nonlinear mapping processing is performed on the spectral feature adjustment data to obtain the noise-reduced spectrum.

[0143] Optionally, the model processing module 603 is also used to: sequentially perform convolution and normalization processing on the target spectrogram and the background audio spectrogram to obtain spectral feature data.

[0144] Optionally, the model processing module 603 is also used for:

[0145] The target spectrogram and background audio spectrogram are input into the convolutional layer of the feature extraction module. The target spectrogram and background audio spectrogram are convolved to obtain the fused feature data of the target spectrogram and background audio spectrogram.

[0146] The fused feature data is input into the normalization layer in the feature extraction module to perform normalization processing, thereby obtaining spectral feature data.

[0147] Optionally, the model processing module 603 is further configured to: input the spectral feature data and the device rotation speed into the dot product attention layer in the attention mechanism module, and perform weighted processing on the spectral feature data according to the device rotation speed to obtain spectral feature adjustment data.

[0148] Optionally, the model processing module 603 is also used to: input the spectral feature adjustment data into the correction linear unit layer in the correction module, and output a denoised spectrum.

[0149] In summary, the audio processing apparatus provided in this application acquires the target audio recorded by the rotating device, the device's rotation speed, and the ambient background audio. It then performs spectral extraction processing on both the target audio and the ambient background audio to obtain a target spectrogram corresponding to the target audio and a background audio spectrogram corresponding to the ambient background audio. This allows the target spectrogram, device rotation speed, and background audio spectrogram to be input into a noise filtering model to obtain a denoised spectrogram. Furthermore, audio conversion processing is performed on the denoised spectrogram to obtain the denoised frequency corresponding to the target audio, achieving effective noise reduction of the target audio. Moreover, since the device rotation speed and ambient background audio are easier to determine accurately than the relative motion trajectory between the sound receiver and the sound source device, by calling the noise filtering model to learn the complex relationship between the device rotation speed and noise generated by factors such as the Doppler effect and device self-rotation, noise filtering processing can be performed on the target audio recorded by the rotating device based on the device rotation speed and ambient background audio. This can achieve more accurate noise reduction to a certain extent and improve the noise cancellation effect on audio recorded by the rotating device.

[0150] The model training apparatus provided in this application embodiment can execute the model training method provided in any embodiment of this application, and has the corresponding functional modules and beneficial effects for executing the model training method.

[0151] Please refer to Figure 7, which shows a block diagram of a model training device provided in an embodiment of this application. As shown in Figure 7, the model training device 700 includes: a second acquisition module 701, a second spectrum extraction module 702, an update module 703, and a model training module 704.

[0152] The second acquisition module 701 is used to acquire multiple training data, each training data including sample audio, sample rotation speed and sample environmental background audio;

[0153] The second spectrum extraction module 702 is used to perform spectrum extraction processing on the sample audio and sample environmental background audio in each training data to obtain the sample spectrum map corresponding to the sample audio and the sample background audio spectrum map corresponding to the sample environmental background audio.

[0154] The update module 703 is used to update each training data. Each updated training data includes: the sample spectrogram corresponding to the sample audio, the sample rotation speed, and the sample background audio spectrogram corresponding to the sample environmental background audio.

[0155] The model training module 704 is used to train a preset neural network based on multiple updated training data, output a predicted denoised spectrum, until the loss function converges, and obtain a noise filtering model. The noise filtering model is used to perform noise filtering on the target spectrum recorded by the rotating equipment according to the equipment rotation speed and background audio spectrum, and output the processed denoised spectrum. The denoised spectrum is processed by audio conversion to obtain the denoised frequency corresponding to the target audio.

[0156] The loss function represents the difference between the predicted denoised spectrogram output by the preset neural network model during training and the high-quality spectrogram corresponding to the sample spectrogram. The high-quality spectrogram is the spectrogram of the audio sample after denoising processing that meets the requirements.

[0157] The sample audio is the audio recorded by the rotating device of the sound source device. Each sample audio is the audio recorded by the rotating device at a different distance from the sound source device, and / or the audio recorded by the rotating device at different rotation speeds. The sample rotation speed is the rotation speed of the rotating device when recording the sample audio. The sample ambient background audio is the audio obtained by extracting background sound from the ambient audio. The ambient audio is the audio recorded by the reference recording device at a fixed distance from the sound source device when the rotating device is rotating.

[0158] In summary, the model training device provided in this application acquires multiple training data sets and performs spectrum extraction processing on the sample audio and sample environmental background audio in each training data set to obtain the sample spectrum map corresponding to the sample audio and the sample background audio spectrum map corresponding to the sample environmental background audio. This updates each training data set, ensuring that each updated training data set includes: the sample spectrum map corresponding to the sample audio, the sample rotation speed, and the sample background audio spectrum map corresponding to the sample environmental background audio. This allows for training a preset neural network based on multiple updated training data sets, outputting a predicted denoised spectrum map until the loss function converges, resulting in a noise filtering model. Using the noise filtering model, the target spectrum map recorded by the rotating device can be processed for noise filtering based on the rotating device's rotation speed and background audio spectrum map, and the processed denoised spectrum map can be output. The denoised spectrum map, after audio conversion processing, yields the denoised frequency corresponding to the target audio, achieving effective noise reduction of the target audio. Furthermore, the rotating device's rotation speed and environmental background audio are easier to determine accurately than the relative motion trajectory between the sound receiver and the sound source device. Therefore, by calling the noise filtering model to learn the complex relationship between the equipment rotation speed and noise generated by factors such as the Doppler effect and equipment self-rotation, noise filtering processing can be performed on the target audio recorded by the rotating equipment based on the equipment rotation speed and the ambient background audio. This can achieve more accurate noise reduction to a certain extent and improve the noise elimination effect on the audio recorded by the rotating equipment.

[0159] This application also provides an electronic device. As shown in FIG8, the electronic device 800 includes: a processor 801, a memory 802, and a computer program stored in the memory 801 and executable on the processor 802. When the computer program is executed by the processor 802, it implements the various processes of the above-described audio processing method embodiments and achieves the same technical effects. To avoid repetition, it will not be described again here.

[0160] This application also provides a computer-readable storage medium storing a computer program. When executed by a processor, the computer program implements the various processes of the above-described audio processing method embodiments and achieves the same technical effects. To avoid repetition, it will not be described again here. The computer-readable storage medium may be a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.

[0161] The embodiments of this application have been described above with reference to the accompanying drawings. However, this application is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms under the guidance of this application without departing from the spirit and scope of the claims, and all of these forms are within the protection scope of this application.

Claims

1. A method of audio processing, wherein, The method includes: The target audio recorded by the rotating device, the device rotation speed, and the ambient background audio are acquired. The ambient background audio refers to the background sound of the environment in which the rotating device is located when it is rotating. The target audio and the ambient background audio are subjected to spectrum extraction processing to obtain the target spectrum map and the background audio spectrum map, respectively. The target spectrum, the equipment rotation speed, and the background audio spectrum are input into the noise filtering model to obtain a noise-reduced spectrum. The noise filtering model is used to perform noise filtering on the target spectrum based on the equipment rotation speed and the background audio spectrum and output the processed noise-reduced spectrum. The noise reduction spectrogram is subjected to audio conversion processing to obtain the noise reduction frequency corresponding to the target audio.

2. The method of claim 1, wherein, The noise filtering model includes a feature extraction module, an attention mechanism module, and a correction module; The step of inputting the target spectrum, the device rotation speed, and the background audio spectrum into the noise filtering model to obtain the noise-reduced spectrum includes: The target spectrogram and the background audio spectrogram are input into the feature extraction module, and fusion feature extraction is performed on the target spectrogram and the background audio spectrogram to obtain spectral feature data; The spectral feature data and the device rotation speed are input into the attention mechanism module, and the spectral feature data is processed by channel enhancement or suppression according to the device rotation speed to obtain spectral feature adjustment data. The spectral feature adjustment data is input into the correction module, and nonlinear mapping processing is performed on the spectral feature adjustment data to obtain the noise reduction spectrum.

3. The method according to claim 2, wherein, The step of fusing and extracting features from the target spectrogram and the background audio spectrogram to obtain spectral feature data includes: The target spectrogram and the background audio spectrogram are input into the convolutional layer of the feature extraction module, and the target spectrogram and the background audio spectrogram are convolved to obtain the fused feature data of the target spectrogram and the background audio spectrogram. The fused feature data is input into the normalization layer in the feature extraction module to perform normalization processing on the fused feature data, thereby obtaining spectral feature data.

4. The method of claim 2 or 3, wherein, The step of using the attention mechanism module to perform channel enhancement or suppression processing on the spectral feature data according to the device rotation speed to obtain spectral feature adjustment data includes: The spectral feature data and the device rotation speed are input into the dot product attention layer in the attention mechanism module. The spectral feature data is then weighted according to the device rotation speed to obtain the spectral feature adjustment data.

5. The method of claim 2 or 3, wherein, The step of performing nonlinear mapping processing on the spectral feature adjustment data through the correction module to obtain the noise-reduced spectrum includes: The spectral feature adjustment data is input into the correction linear unit layer in the correction module, and the noise-reduced spectrum is output.

6. A model training method in which, The method includes: Acquire multiple training data sets, each of which includes sample audio, sample rotation speed, and sample environmental background audio. Spectrum extraction processing is performed on the sample audio and the sample environmental background audio in each of the training data to obtain the sample spectrum map corresponding to the sample audio and the sample background audio spectrum map corresponding to the sample environmental background audio. Update each of the training data, and each updated training data includes: the sample spectrogram corresponding to the sample audio, the sample rotation speed, and the sample background audio spectrogram corresponding to the sample environmental background audio; The preset neural network is trained based on multiple updated training data to output a predicted denoised spectrum until the loss function converges, thus obtaining a noise filtering model. The noise filtering model is used to perform noise filtering on the target spectrum recorded by the rotating equipment based on the equipment rotation speed and background audio spectrum and output the processed denoised spectrum. The denoised spectrum is then processed by audio conversion to obtain the denoised frequency corresponding to the target audio.

7. An audio processing apparatus, wherein, The device includes: The first acquisition module is used to acquire the target audio recorded by the rotating device, the device rotation speed, and the ambient background audio, wherein the ambient background audio refers to the background sound of the environment in which the rotating device is rotating. The first spectrum extraction module is used to perform spectrum extraction processing on the target audio and the environmental background audio respectively, to obtain the target spectrum map and the background audio spectrum map respectively; The model processing module is used to input the target spectrum, the equipment rotation speed and the background audio spectrum into the noise filtering model to obtain the noise-reduced spectrum. The noise filtering model is used to perform noise filtering processing on the target spectrum based on the equipment rotation speed and the background audio spectrum and output the processed noise-reduced spectrum. An audio conversion module is used to perform audio conversion processing on the noise reduction spectrogram to obtain the noise reduction frequency corresponding to the target audio.

8. A model training apparatus, wherein, The device includes: The second acquisition module is used to acquire multiple training data, each of which includes sample audio, sample rotation speed and sample environmental background audio. The second spectrum extraction module is used to perform spectrum extraction processing on the sample audio and the sample environment background audio in each of the training data to obtain the sample spectrum map corresponding to the sample audio and the sample background audio spectrum map corresponding to the sample environment background audio. The update module is used to update each of the training data, and each updated training data includes: the sample spectrogram corresponding to the sample audio, the sample rotation speed, and the sample background audio spectrogram corresponding to the sample environmental background audio. The model training module is used to train a preset neural network based on multiple updated training data, output a predicted denoised spectrum, until the loss function converges to obtain a noise filtering model. The noise filtering model is used to perform noise filtering processing on the target spectrum recorded by the rotating equipment according to the equipment rotation speed and background audio spectrum, and output the processed denoised spectrum. The denoised spectrum is processed by audio conversion to obtain the denoised frequency corresponding to the target audio.

9. An electronic device, comprising: It includes a processor, a memory, and a computer program stored in the memory and executable on the processor. When executed by the processor, the computer program implements the steps of the audio processing method as described in any one of claims 1 to 5, or implements the steps of the model training method as described in claim 6.

10. A readable storage medium, wherein, A computer program is stored on a readable storage medium, which, when executed by a processor, implements the steps of the audio processing method as claimed in any one of claims 1 to 5, or the steps of the model training method as claimed in claim 6.