Data processing method and device for audio-video call, electronic device and medium
By detecting and processing noise and unknown audio in the audio, the problem of noise interference in audio and video calls is solved, improving call quality and user experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BAIDU INT TECH (SHENZHEN) CO LTD
- Filing Date
- 2022-12-13
- Publication Date
- 2026-06-19
AI Technical Summary
Noise interference during audio and video calls leads to a poor user experience, and existing technologies have failed to effectively handle noise during calls.
By detecting the presence of noise and unknown audio in the target audio, the system outputs prompts and removes or weakens the noise according to user instructions, thereby improving the quality of audio and video calls.
It effectively removes noise and unknown audio, improving the call quality and user experience of audio and video calls.
Smart Images

Figure CN116013342B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of artificial intelligence technology, and more particularly to the field of data processing technology, specifically to a data processing method, apparatus, electronic device, and medium for audio and video calls. Background Technology
[0002] With the continuous development of audio and video calling technology, more and more apps (applications) for audio and video calls, as well as collaborative office software that supports calling functions, have emerged. However, noise often inevitably interferes with the call process.
[0003] In related technologies, the audio obtained by capturing audio from each party in a call is played directly to the other party in that call. Summary of the Invention
[0004] This disclosure provides a data processing method, apparatus, electronic device, and medium for audio and video calls.
[0005] According to a first aspect of this disclosure, a data processing method for audio and video calls is provided, comprising:
[0006] Acquire the target audio obtained by capturing audio from the target caller;
[0007] Detect whether there is any Class I audio data that belongs to noise in the target audio;
[0008] If present, a first prompt message is output in the call interface; wherein, the first prompt message is used to prompt whether to perform noise removal processing on the target caller’s first type of audio data;
[0009] In response to the removal instruction obtained based on the first prompt information, the specified audio is subjected to removal processing for the first type of audio data before the specified audio is played;
[0010] The specified audio is the audio obtained by collecting audio from the target caller and is to be played by the other end of the target caller.
[0011] Optionally, it also includes:
[0012] Detect whether there is a second type of audio data belonging to unknown audio in the target audio; wherein, the unknown audio is audio that is neither noise nor belongs to the user of the target caller;
[0013] If present, a second prompt message is output in the call interface; wherein, the second prompt message is used to prompt whether to perform weakening processing on the second type of audio data for the target caller;
[0014] In response to the weakening instruction obtained based on the second prompt information, the specified audio is weakened for the second type of audio data before the specified audio is played.
[0015] Optionally, acquiring the target audio obtained by audio capture of the target caller includes:
[0016] Acquire the target audio obtained by audio capture of the target caller during a specified call phase;
[0017] The specified call phase includes the call phase before the call begins, and / or the call phase during the call.
[0018] Optionally, the call interface is the call interface of the target caller, and / or the call interface of the counterparty of the target caller.
[0019] Optionally, detecting whether there is a first type of audio data belonging to noise in the target audio includes:
[0020] Based on a predetermined noise feature library, detect whether there is any first-class audio data belonging to noise in the target audio;
[0021] The noise feature library contains audio features of audio data that belong to noise.
[0022] Optionally, detecting whether there is a first type of audio data belonging to noise in the target audio based on a predetermined noise feature library includes:
[0023] Obtain the various audio data obtained after performing a specified audio decomposition on the target audio; wherein, the specified audio decomposition is a method of decomposing according to different sound sources;
[0024] Based on a predetermined noise feature library and the audio features of each audio data, it is determined whether there is a first type of audio data belonging to noise in the target audio.
[0025] Optionally, detecting whether there is a second type of audio data belonging to unknown audio in the target audio includes:
[0026] Based on the audio features of a pre-defined user feature library and other audio data, detect whether there is a second type of audio data belonging to unknown audio in the target audio;
[0027] The other audio data refers to the audio data in the target audio other than the first type of audio data that belongs to noise.
[0028] The user feature database contains audio features of users belonging to the target caller.
[0029] Optionally, the method further includes:
[0030] In response to the weakening instruction obtained based on the second prompt information, the audio features of the second type of audio data are added to the noise feature library.
[0031] According to a second aspect of this disclosure, a data processing apparatus for audio and video calls is provided, comprising:
[0032] The acquisition module is used to acquire the target audio obtained by capturing audio from the target caller.
[0033] The first detection module is used to detect whether there is a first type of audio data that belongs to noise in the target audio;
[0034] The first output module is used to output a first prompt message in the call interface if the condition exists; wherein the first prompt message is used to prompt whether to perform noise removal processing on the target caller for the first type of audio data.
[0035] The removal module is used to, in response to the removal instruction obtained based on the first prompt information, perform removal processing on the specified audio for the first type of audio data before the specified audio is played;
[0036] The specified audio is the audio obtained by collecting audio from the target caller and is to be played by the other end of the target caller.
[0037] According to a third aspect of this disclosure, an electronic device is provided, comprising:
[0038] At least one processor; and
[0039] A memory communicatively connected to the at least one processor; wherein,
[0040] The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform any of the data processing methods for audio and video calls described above.
[0041] According to a fourth aspect of this disclosure, a non-transitory computer-readable storage medium is provided storing computer instructions, wherein the computer instructions are used to cause the computer to perform any of the data processing methods for audio and video calls described above.
[0042] According to a fifth aspect of this disclosure, a computer program product is provided, comprising a computer program that, when executed by a processor, implements any of the data processing methods for audio and video calls described above.
[0043] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description
[0044] The accompanying drawings are provided to better understand this solution and do not constitute a limitation of this disclosure. Wherein:
[0045] Figure 1 This is a flowchart illustrating the data processing method for audio and video calls provided in this disclosure;
[0046] Figure 2 This is another flowchart illustrating the data processing method for audio and video calls provided in this disclosure;
[0047] Figure 3 A schematic diagram of an embodiment of the data processing method for audio and video calls provided in this disclosure;
[0048] Figure 4 Based on the structural diagram of the data processing device for audio and video calls provided in this disclosure;
[0049] Figure 5 This is a block diagram of an electronic device used to implement the data processing method for audio and video calls according to the embodiments of this disclosure. Detailed Implementation
[0050] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding, and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.
[0051] With the development of audio and video call technology, users have increasingly higher requirements for the quality of audio and video calls. During calls in apps or collaborative office software that support call functions, unavoidable noise often interferes with the call process.
[0052] In related technologies, the audio of the caller is usually recorded directly and then played back to the other end of the call without processing the noise generated during the call, resulting in a poor user experience.
[0053] Based on this, the present disclosure provides a data processing method, apparatus, electronic device and medium for audio and video calls, so as to remove noise in audio and video calls in a user-friendly manner, thereby improving call quality and enhancing user experience during audio and video calls.
[0054] The following section first introduces the data processing methods for audio and video calls provided in this disclosure.
[0055] The data processing method for audio and video calls disclosed herein can be applied to electronic devices. For example, the electronic device can be a server or a terminal device, such as a mobile phone or computer; this disclosure does not limit the specific form of the electronic device. Furthermore, the data processing method for audio and video calls provided herein can be applied to both video call scenarios and voice call scenarios. In other words, any call scenario involving audio transmission falls under the category of audio and video call scenarios, and the method provided herein can be applied to improve call quality during audio and video calls.
[0056] Specifically, the entity executing the data processing method for audio and video calls can be a data processing device for audio and video calls. For example, when the data processing method for audio and video calls is applied to a terminal device, the data processing device can be functional software running on the terminal device, such as a client for making audio and video calls; the data processing device can also be a plugin for an existing client, such as a plugin in a collaborative office client that supports call functionality. For example, when the data processing method for audio and video calls is applied to a server, the data processing device can be a computer program running on the server, such as a functional module in the server-side program corresponding to a collaborative office client that supports call functionality.
[0057] The data processing method for audio and video calls provided in this disclosure may include the following steps:
[0058] Acquire the target audio obtained by capturing audio from the target caller;
[0059] Detect whether there is any Class I audio data that belongs to noise in the target audio;
[0060] If present, a first prompt message is output in the call interface; wherein, the first prompt message is used to prompt whether to perform noise removal processing on the target caller’s first type of audio data;
[0061] In response to the removal instruction obtained based on the first prompt information, the specified audio is subjected to removal processing for the first type of audio data before the specified audio is played;
[0062] The specified audio is the audio obtained by collecting audio from the target caller and is to be played by the other end of the target caller.
[0063] In this solution, after obtaining the target audio from the target caller, if the target audio contains noise (Type I audio data), a user-friendly prompt message is displayed on the call interface to ask if noise removal is desired. Responding to the removal command based on the prompt message, the specified audio undergoes noise removal processing for Type I audio data before playback, thus achieving noise reduction for the target caller's audio. Therefore, this solution effectively removes noise from audio and video calls in a user-friendly manner, improving call quality and user experience.
[0064] The following description, in conjunction with the accompanying drawings, provides an exemplary method for data processing in audio and video calls provided in this disclosure.
[0065] like Figure 1 As shown, the present disclosure provides a data processing method for audio and video calls, which may include the following steps.
[0066] S101: Obtain the target audio obtained by audio capture of the target caller;
[0067] The data processing method for audio and video calls disclosed herein can first acquire the target audio obtained by audio capture of the target caller before processing the audio and video call data. Then, by executing subsequent steps, the audio played by the target caller to the other end of the call can be processed. Furthermore, the duration of the target audio is not limited, provided that its audio characteristics are accurately represented.
[0068] Understandably, in one implementation, the target caller can be any one of the multiple callers involved in the call. That is, each caller in the call can be considered the target caller, thus executing the scheme of this disclosure. In another implementation, the target caller can be the speaker among the multiple callers involved in the call. For example, in an audio / video call involving parties A, B, and C, if party A is the speaker at a certain moment, then party A can be the target caller; if party B is the speaker at a certain moment, then party B can be the target caller. Furthermore, if the data processing method for audio / video calls disclosed herein is applied to a terminal device, the terminal device can be the device on the target caller's side.
[0069] For example, in one implementation, acquiring the target audio obtained by audio capture of the target caller includes:
[0070] Acquire the target audio obtained by audio capture of the target caller during a specified call phase;
[0071] The specified call phase includes the call phase before the call begins, and / or the call phase during the call.
[0072] In audio and video calls, there are usually multiple call phases, such as the call phase before the call begins, the call during the call, and the call after the call ends. To improve call quality, target audio can be acquired by capturing audio from the target party within a specified call phase. The specified call phase can be the call phase before the call begins, the call phase during the call, or both. Furthermore, for scenarios where the audio and video generated during the call are used after the call ends, the specified call phase can also be the end of the call. This means that the audio and video during the call can be backed up, and after the call ends, the backed-up audio and video can be processed and used as the target audio.
[0073] By acquiring the target audio within a specified call phase, subsequent processing of the target audio at each phase can be flexibly performed as needed. When acquiring target audio from multiple phases, processing can be performed on the target audio from multiple phases, which can improve the audio quality received by the other end of the call and further enhance the call quality during audio and video calls.
[0074] S102: Detect whether there is any first-class audio data that belongs to noise in the target audio;
[0075] After obtaining the target audio, in order to remove noise from the specified audio, we can first identify whether there is noise in the target audio, that is, detect whether there is first-class audio data that belongs to noise in the target audio, and then remove the first-class audio data that belongs to noise based on the detection results.
[0076] It should be noted that any implementation method capable of detecting whether there is first-class audio data belonging to noise in the target audio can be applied to the embodiments of this disclosure.
[0077] Optionally, in one implementation, detecting whether there is a first type of audio data belonging to noise in the target audio includes:
[0078] Based on a predetermined noise feature library, detect whether there is any first-class audio data belonging to noise in the target audio;
[0079] The noise feature library contains audio features of audio data that belong to noise.
[0080] When identifying whether Class I audio data belonging to noise exists in a target audio file, a pre-established noise feature library can be used to detect the presence of such data. This noise feature library can be built using common everyday noises. Since noise is a short-lived, transient high-frequency sound disturbance, it is significantly different from the voice of the speaker during audio / video calls. Therefore, a noise feature library can be established using common everyday noises, such as the sound of a power drill or a hair dryer. When identifying Class I audio data in a target audio file, this feature library can be used to detect the presence of such data, thus enabling the rapid identification of Class I noise in the target audio file.
[0081] For example, in one implementation, detecting whether there is a first type of audio data belonging to noise in the target audio based on a predetermined noise feature library includes:
[0082] Obtain the various audio data obtained after performing a specified audio decomposition on the target audio; wherein, the specified audio decomposition is a method of decomposing according to different sound sources;
[0083] Based on a predetermined noise feature library and the audio features of each audio data, it is determined whether there is a first type of audio data belonging to noise in the target audio.
[0084] When using a pre-defined noise feature library to detect whether a target audio contains noise-related audio data (Category I), since the target audio may be a fusion of multiple sounds, such as noise and a speaker's voice, it's possible to first decompose the target audio into individual audio data. Each audio data has its corresponding audio features. Then, by utilizing the audio features of each audio data and the features of each noise in the noise feature library, it's possible to identify whether the target audio contains noise-related audio data. Furthermore, during the decomposition, the target audio can be decomposed into individual audio data according to the frequency of the sound; other methods can also be used, which are not limited here.
[0085] For example, the audio features of each audio data point can be matched with the features of each noise in a noise feature library. If the audio feature of any audio data point successfully matches a noise feature in the noise feature library, then that audio data point can be detected as belonging to the first category of noise audio data. A successful match is defined as a similarity score between the audio features of the audio data and the noise features exceeding a certain threshold. The similarity score can be calculated using the feature vectors of the audio data and the noise, in a manner similar to existing technologies, which will not be elaborated upon here. Furthermore, if the audio features of a certain audio data point are not identified as belonging to the first category of noise audio data by the predetermined noise feature library, but the audio features of the audio data point exhibit short-term, transient high-frequency sound disturbances, then the audio data point can still be identified as belonging to the first category of noise audio data, and the audio features of the audio data point can be added to the noise library as noise features.
[0086] By performing specified audio decomposition on the target audio, we can obtain the various audio data contained in the target audio. Through the audio features of each audio data and a predetermined noise feature library, we can accurately identify whether there is any first-class audio data belonging to noise in the target audio, and accurately identify the first-class audio data belonging to noise in each audio data.
[0087] S103: If present, output the first prompt message in the call interface;
[0088] The first prompt message is used to indicate whether the target caller should undergo noise removal processing for the first type of audio data.
[0089] If the target audio is found to contain first-class audio data that belongs to noise, in order to improve the audio and video call quality and enhance the user experience, a prompt message can be displayed on the call interface indicating whether to remove the first-class audio data that belongs to noise.
[0090] For example, in one implementation, the call interface is the call interface of the target caller, and / or the call interface of the counterparty of the target caller.
[0091] Since both the target caller and the receiving caller require improved call quality for the target caller, this implementation method can output a first prompt message in the target caller's and / or receiving caller's interfaces. This allows both parties to control whether to improve the call quality for the target caller. Specifically, a pop-up window can be displayed in the call interface asking the caller whether to perform noise removal. The location of the first prompt message in the call interface can be anywhere within the call interface, as long as it does not affect the caller's experience; it is not limited here.
[0092] It should be noted that, when the data processing method for audio and video calls provided in this disclosure is applied to a server, if the target audio contains first-type audio data that belongs to noise, the server can output a first prompt message to the call interface of each user on multiple clients of the audio and video call, or it can output the first prompt message only to the call interface of a single user of the audio and video call; when applied to a client, if the target audio contains first-type audio data that belongs to noise, the client can output the first prompt message to the call interface of that client user; subsequently, it can respond to the removal instruction obtained based on the first prompt message to realize the removal processing of the first-type audio data.
[0093] Specifically, if the solution disclosed herein is applied to a server, when the presence of first-type audio data belonging to noise in the target audio is detected, a first prompt message can be output to the call interface of each party in the audio / video call, i.e., the target party and its counterpart. Alternatively, the first prompt message can be output to the call interface of any party in the audio / video call, i.e., the target party or its counterpart. If the solution disclosed herein is applied to a client in a terminal device, and if the client is a client targeting the target party, then when the presence of first-type audio data belonging to noise in the target audio is detected, a first prompt message for the first-type audio data can be output to the target party's call interface.
[0094] The call interface can be the call interface of the target party and / or the counterparty of the target party. After the first prompt message is displayed on the call interface, the target party and / or the counterparty of the target party can respond to the first prompt message and choose whether to issue a removal command. If either party issues a removal command, subsequent removal of the first type of audio data that belongs to noise can be performed on that party. The first prompt message disclosed in this disclosure is used to prompt either party whether to remove the first type of audio data that belongs to noise. Users can choose to remove noise according to their own wishes, which can improve the call quality of audio and video calls and enhance the user experience.
[0095] S104: In response to the removal instruction obtained based on the first prompt information, before the specified audio is played, the specified audio is subjected to removal processing for the first type of audio data;
[0096] The specified audio is the audio obtained by collecting audio from the target caller and is to be played by the other end of the target caller.
[0097] After the first prompt message is output in the call interface, the caller receiving the first prompt message can issue a removal command based on the first prompt message. The solution provided in this disclosure can respond to the removal command obtained based on the first prompt message, and can remove the first type of audio data that belongs to noise in the specified audio before playing the specified audio, thereby improving the call quality of audio and video calls.
[0098] For example, the party receiving the first prompt message can issue a removal command by clicking, long-pressing, swiping, or performing a specified operation. This disclosure can then respond to the removal command and remove the first type of audio data before the specified audio is played. It should be noted that the first prompt message may include not only a prompt asking whether to remove the first type of audio data, but also a prompt guiding the user on how to issue the removal command, so that the user can issue the removal command based on the first prompt message, thereby removing the first type of audio data from the specified audio.
[0099] When removing the first type of audio data, filtering can be used to remove the first type of audio data that is noise in the specified audio. Of course, any method that can remove the first type of audio data is applicable to this disclosure, and the specific method for removing the first type of audio data is not limited here.
[0100] The collection, storage, use, processing, transmission, provision, and disclosure of user personal information involved in the technical solution disclosed herein comply with the provisions of relevant laws and regulations and do not violate public order and good morals.
[0101] In this solution, after obtaining the target audio from the target caller, if the target audio contains noise (Type I audio data), a user-friendly prompt message is displayed on the call interface to ask if noise removal is desired. Responding to the removal command based on the prompt message, the specified audio undergoes noise removal processing for Type I audio data before playback, thus achieving noise reduction for the target caller's audio. Therefore, this solution effectively removes noise from audio and video calls in a user-friendly manner, improving call quality and user experience.
[0102] Alternatively, in another embodiment of this disclosure, such as Figure 2 As shown, the data processing method for audio and video calls provided in this disclosure further includes steps S201-S203;
[0103] S201: Detect whether there is a second type of audio data belonging to unknown audio in the target audio;
[0104] The unknown audio refers to audio that is neither noise nor belongs to the user of the target caller;
[0105] In some scenarios, the target audio may contain other unknown audio besides noise and the target caller's audio, such as audio data generated by network anomalies. In this case, unknown audio can also interfere with audio and video calls. Therefore, it is also possible to detect whether there is a second type of audio data that belongs to unknown audio in the target audio.
[0106] Optionally, detecting whether there is a second type of audio data belonging to unknown audio in the target audio includes:
[0107] Based on the audio features of a pre-defined user feature library and other audio data, detect whether there is a second type of audio data belonging to unknown audio in the target audio;
[0108] The other audio data refers to the audio data in the target audio other than the first type of audio data that belongs to noise.
[0109] The user feature database contains audio features of users belonging to the target caller.
[0110] When detecting whether there is a second type of audio data belonging to unknown audio in the target audio, a predetermined user feature library and audio features of other audio data can be used to detect whether there is a second type of audio data in the target audio. Specifically, other audio data refers to audio data in the target audio other than the first type of audio data, which includes user audio data and may also include unknown audio. When identifying unknown audio, each audio feature of other audio data can be matched with each audio feature of the user in the predetermined user feature library. If any audio feature of other audio data fails to match any of the user's audio features, then the audio data to which that audio feature belongs is unknown audio.
[0111] Additionally, it should be noted that for any target caller, a unique user feature database can be pre-established. For the target audio of that target caller, the user feature database corresponding to that target caller can be used to detect whether there is any second type of audio data belonging to unknown audio in the target audio.
[0112] The user feature library of the target caller can be pre-built. It can be obtained through machine learning or deep learning. For example, based on a pre-trained neural network voice recognition model, the voice of the target caller in daily use of software can be modeled to extract the audio features of the target caller, thereby obtaining the user feature library of the target caller.
[0113] By using the user feature library, audio features in the feature library can be matched with audio features of other audio data, thereby quickly detecting whether there is a second type of audio data belonging to unknown audio in the target audio.
[0114] S202: If present, output the second prompt message on the call interface;
[0115] The second prompt message is used to indicate whether to perform weakening processing on the second type of audio data for the target caller.
[0116] If the target audio contains second-type audio data that belongs to an unknown audio category, a second prompt message can be output in the call interface. It should be noted that the method for outputting the second prompt message in the call interface can be similar to the method for outputting the first prompt message described above.
[0117] If the target audio contains both type 1 and type 2 audio data, the first and second prompt messages can be output simultaneously on the call interface. Alternatively, they can be output sequentially in either order. Furthermore, the placement of the first and second prompt messages on the call interface is not limited and can be flexibly adjusted as needed.
[0118] S203: In response to the weakening instruction obtained based on the second prompt information, weakening processing is performed on the specified audio for the second type of audio data before the specified audio is played.
[0119] After the second prompt message is output in the call interface, the caller who receives the second prompt message can issue a weakening command based on the second prompt message. The solution provided in this disclosure can respond to the weakening command obtained based on the second prompt message and weaken the second type of audio data before the specified audio is played, thereby improving the call quality of audio and video calls.
[0120] For example, the party receiving the second prompt message can issue a removal command by clicking, long-pressing, swiping, or performing a specified operation. This disclosure can then respond to the weakening command and weaken the second type of audio data before the specified audio is played. It should be noted that the second prompt message may include not only a prompt asking whether to weaken the second type of audio data, but also a prompt guiding the user on how to issue the weakening command, so that the user can issue the weakening command based on the second prompt message, thereby weakening the second type of audio data in the specified audio.
[0121] When weakening the second type of audio data, filtering can be used to weaken the second type of audio data that belongs to unknown audio in the specified audio; of course, the second type of audio data can also be removed according to the user's instructions, which is not limited here; and any method that can weaken the second type of audio data is applicable to this disclosure, and the specific method of weakening the second type of audio data is not limited here.
[0122] Optionally, the method further includes:
[0123] In response to the weakening instruction obtained based on the second prompt information, the audio features of the second type of audio data are added to the noise feature library.
[0124] Understandably, if the party receiving the second prompt message issues a weakening instruction, the second type of audio data, which belongs to unknown audio, will not play a role in audio and video calls. At this time, the second type of audio data can be identified as the first type of audio data, that is, the unknown audio can be identified as noise, and the audio features of the second type of audio data can be added to the noise feature library to expand the noise feature library. This will improve the accuracy of noise identification when using the noise library to identify noise in the future.
[0125] By detecting the second type of audio data of unknown audio, outputting a second prompt message, and subsequently weakening the second type of audio data, it is possible to further identify useless audio in the target audio and weaken or even remove the second type of audio data. This can weaken unknown audio in audio and video calls in a friendly way, thereby further improving the call quality and enhancing the user experience during audio and video calls.
[0126] It should be noted that the terms "first" and "second" in "first type of audio data, second type of audio data, first prompt information and second prompt information" are only used to distinguish different audio data and prompt information in terms of naming, and do not have any limiting meaning.
[0127] To facilitate understanding of the methods provided in this disclosure, a specific example is provided below to illustrate the methods provided in this disclosure.
[0128] To implement the method provided in this disclosure, there are two preprocessing stages: noise library construction and user voice feature library construction.
[0129] Noise database construction; this involves creating a noise database, corresponding to the method described above for building a noise feature database. Noise itself has relatively obvious characteristics: short-duration, transient high-frequency sound disturbances, and a significant difference between the noise and the tone of voice of the speaker during a call. Examples include the sound of a power drill during renovations and a hair dryer. A feature database of everyday noises can be created and stored for subsequent noise identification. The noise database construction process does not rely on user information; common everyday noises can be labeled and stored to build a noise feature database.
[0130] Establish a user voice feature database; that is, establish a user voice feature database, corresponding to the method described above for establishing a user feature database. Different people have different voice characteristics. When establishing a voice feature database for any given user, a pre-trained neural network voice recognition model can be used. The voices collected from the user's daily software usage are used as input to the neural network voice recognition model to extract various voice features of that user, and these features are then categorized to obtain the voice feature database for that user. As the duration of a user's software usage increases, the collected user voices become richer, and the user voice feature database contains a wider range of user voice features, which can improve the accuracy of noise reduction in subsequent noise reduction processing.
[0131] After the preprocessing stage is completed, the aforementioned noise library and user voice feature library can be used for data processing:
[0132] Before the call begins, ambient noise is detected. Typically, there's a dialing phase before the call starts, where the user waits to connect – the pre-call phase. This solution fully utilizes this idle period by collecting audio from the caller's surroundings and matching it against a pre-established noise database and user voice feature database to identify noise. If noise is detected, the user is prompted to remove it. If unknown audio is detected, further detection is performed to determine if it's noise, or a user-friendly graphical interface prompts the user that it might be noise and suggests attenuation. This removes noise and unknown audio present before the call, improving audio and video call quality and enhancing the user experience. Collecting ambient audio corresponds to the target audio collected during the pre-call phase; matching the collected audio with the noise database and user voice feature database corresponds to the noise and unknown audio detection methods; and the user-friendly graphical interface corresponds to the steps of outputting the first and second prompts in the call interface.
[0133] Noise detection and processing during calls: Due to the short-lived nature of noise, some noise may not be detected before the call begins but may appear during the call, severely degrading call quality. Therefore, noise detection can be performed during the call, using the noise database and user voice feature database from the previous step. When collecting audio data during the call, audio can be collected at specific time intervals. The collected audio features, along with the noise and user feature databases, are used to identify noise and unknown sounds. Subsequently, user-friendly prompts can be used to remove or weaken noise or unknown sounds, thereby improving call quality and user experience. This corresponds to the steps described above: identifying noise and unknown audio using the noise and user feature databases, outputting first and second prompts on the call interface, and then removing noise and weakening unknown audio.
[0134] Audio data processing after a call ends: The end of a call does not mean the audio and video data is free of noise. Since most software supporting audio and video calls backs up the call content, and this backed-up content is played multiple times, it is essential to detect and process noise in the backed-up audio and video data. At this point, the same methods described above can be used to identify whether noise or unknown audio is present in the backed-up audio and video data. A user-friendly prompt can then be provided to the user using the backed-up audio and video data to remove existing noise and reduce unknown audio, thereby improving the quality of the backed-up audio and video data and enhancing the user experience.
[0135] The data processing method for audio and video calls disclosed herein detects noise and unknown audio before, during, and after a call by building a noise database and establishing a user voice feature database. It removes noise and weakens unknown audio in a user-friendly manner, thereby improving call quality and enhancing user experience in all aspects of audio and video calls.
[0136] The following detailed description of a data processing method for audio and video calls provided in this disclosure, with reference to a specific embodiment, is provided in detail.
[0137] like Figure 3 As shown in the embodiments of this disclosure, a data processing method for audio and video calls may include: sound source detection, spectrogram analysis, and sound quality enhancement.
[0138] Sound source detection; that is, the steps described above for collecting audio from the target caller and obtaining the target audio; the sound source can contain a variety of sounds, such as: the sound of a microphone, the sound of mouths, the sound of electric drills, and the sound of knocking on the ceiling, etc.
[0139] Spectral analysis; that is, analyzing the detected sound source to obtain the various sounds contained in the sound source; corresponding to the above steps of performing specified audio decomposition on the target audio to obtain various audio data; in this embodiment, the various sounds in the sound source can be decomposed by frequency to obtain multiple sounds; that is, by recognizing the various sound contents of the sound source, the audio is stripped and analyzed, and audio with different characteristics is distinguished to obtain various audio.
[0140] Audio quality enhancement; that is, improving the sound quality of audio and video calls, which can include multiple steps: voice recognition, noise labeling, noise modeling, filtering, and audio output.
[0141] Among them, voice recognition involves extracting the audio features corresponding to each audio item in the audio analysis results. Using a pre-established user voice feature database, the audio belonging to the speaker in each audio item can be identified. When performing voice recognition, it is possible to analyze separately according to different call scenarios. For example, in a teaching scenario, there is usually only one party speaking, so only the audio of that party needs to be identified. In a scenario of mutual communication, there are multiple parties speaking. In this case, there can be an audio analysis result for each party speaking, as well as the user voice feature database corresponding to that party's user. Each audio item in each audio analysis result can then be analyzed.
[0142] Noise labeling refers to the process of identifying and labeling noise in audio samples after human voice recognition, which may contain noise and unknown audio. This can be achieved by comparing the noise features of a pre-established noise database with those of other audio samples besides the recognized user audio.
[0143] Noise modeling involves identifying unknown audio besides noise and user audio. If the user issues a command to weaken the unknown audio, the audio features of the unknown audio can be added to the noise database to expand the features of the noise database in the call scenario.
[0144] Filtering involves removing identified noise through filtering and weakening unknown audio that is not part of human voice or noise. During noise removal, the audio segments identified as belonging to human voice are merged from the analyzed audio to obtain the subsequent output audio, thus completing the noise removal process for the call's audio and video.
[0145] Audio output refers to sending the filtered audio to the audio receiver. Streaming media technology can be used to repackage the audio content and send it to the receiver in segments.
[0146] The data processing method for audio and video calls disclosed herein can detect the sound of a sound source and obtain individual audio frequencies through audio analysis. Subsequently, the sound quality of each audio frequency can be improved, noise can be removed in a user-friendly manner, and unknown audio frequencies can be weakened. This method can remove noise in audio and video calls in a user-friendly manner, thereby comprehensively improving the call quality during audio and video calls and enhancing the user experience.
[0147] Based on the above method embodiments, this disclosure also provides a data processing apparatus for audio and video calls, such as... Figure 4 As shown, the device includes:
[0148] The acquisition module 410 is used to acquire the target audio obtained by audio capture of the target caller;
[0149] The first detection module 420 is used to detect whether there is a first type of audio data that belongs to noise in the target audio;
[0150] The first output module 430 is used to output a first prompt message in the call interface if the condition exists; wherein the first prompt message is used to prompt whether to perform noise removal processing on the target caller for the first type of audio data.
[0151] The removal module 440 is configured to, in response to a removal instruction obtained based on the first prompt information, perform removal processing on the specified audio for the first type of audio data before the specified audio is played.
[0152] The specified audio is the audio obtained by collecting audio from the target caller and is to be played by the other end of the target caller.
[0153] In this solution, after obtaining the target audio from the target caller, if the target audio contains noise (Type I audio data), a user-friendly prompt message is displayed on the call interface to ask if noise removal is desired. Responding to the removal command based on the prompt message, the specified audio undergoes noise removal processing for Type I audio data before playback, thus achieving noise reduction for the target caller's audio. Therefore, this solution effectively removes noise from audio and video calls in a user-friendly manner, improving call quality and user experience.
[0154] Optionally, the device further includes:
[0155] The second detection module is used to detect whether there is a second type of audio data belonging to unknown audio in the target audio; wherein, the unknown audio is audio that is not noise and does not belong to the user of the target caller;
[0156] The second output module is used to output a second prompt message in the call interface if it exists; wherein the second prompt message is used to prompt whether to perform weakening processing on the second type of audio data for the target caller;
[0157] A weakening module is used to weaken the specified audio for the second type of audio data before the specified audio is played, in response to a weakening instruction obtained based on the second prompt information.
[0158] Optionally, the acquisition module is specifically used for:
[0159] Acquire the target audio obtained by audio capture of the target caller during a specified call phase;
[0160] The specified call phase includes the call phase before the call begins, and / or the call phase during the call.
[0161] Optionally, the call interface is the call interface of the target caller, and / or the call interface of the counterparty of the target caller.
[0162] Optionally, the first detection module includes:
[0163] The detection submodule is used to detect whether there is any first-class audio data belonging to noise in the target audio based on a predetermined noise feature library;
[0164] The noise feature library contains audio features of audio data that belong to noise.
[0165] Optionally, the detection submodule is specifically used for:
[0166] Obtain the various audio data obtained after performing a specified audio decomposition on the target audio; wherein, the specified audio decomposition is a method of decomposing according to different sound sources;
[0167] Based on a predetermined noise feature library and the audio features of each audio data, it is determined whether there is a first type of audio data belonging to noise in the target audio.
[0168] Optionally, the second detection module is specifically used for:
[0169] Based on the audio features of a pre-defined user feature library and other audio data, detect whether there is a second type of audio data belonging to unknown audio in the target audio;
[0170] The other audio data refers to the audio data in the target audio other than the first type of audio data that belongs to noise.
[0171] The user feature database contains audio features of users belonging to the target caller.
[0172] Optionally, the device further includes:
[0173] An adding module is used to add the audio features of the second type of audio data to the noise feature library in response to the weakening instruction obtained based on the second prompt information.
[0174] According to embodiments of this disclosure, this disclosure also provides an electronic device, a readable storage medium, and a computer program product.
[0175] This disclosure provides an electronic device, including:
[0176] At least one processor; and
[0177] A memory communicatively connected to the at least one processor; wherein,
[0178] The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform any of the data processing methods for audio and video calls described above.
[0179] This disclosure provides a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to perform any of the data processing methods for audio and video calls described above.
[0180] This disclosure provides a computer program product, including a computer program that, when executed by a processor, implements any of the data processing methods for audio and video calls described above.
[0181] Figure 5 A schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.
[0182] like Figure 5 As shown, device 500 includes a computing unit 501, which can perform various appropriate actions and processes based on a computer program stored in read-only memory (ROM) 502 or a computer program loaded from storage unit 508 into random access memory (RAM) 503. RAM 503 may also store various programs and data required for the operation of device 500. The computing unit 501, ROM 502, and RAM 503 are interconnected via bus 504. Input / output (I / O) interface 505 is also connected to bus 504.
[0183] Multiple components in device 500 are connected to I / O interface 505, including: input unit 506, such as keyboard, mouse, etc.; output unit 507, such as various types of monitors, speakers, etc.; storage unit 508, such as disk, optical disk, etc.; and communication unit 509, such as network card, modem, wireless transceiver, etc. Communication unit 509 allows device 500 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0184] The computing unit 501 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 501 performs the various methods and processes described above, such as data processing methods for audio and video calls. For example, in some embodiments, the data processing methods for audio and video calls can be implemented as computer software programs tangibly contained in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program can be loaded and / or installed on device 500 via ROM 502 and / or communication unit 509. When the computer program is loaded into RAM 503 and executed by the computing unit 501, one or more steps of the data processing methods for audio and video calls described above can be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform data processing methods for audio and video calls by any other suitable means (e.g., by means of firmware).
[0185] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.
[0186] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.
[0187] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
[0188] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).
[0189] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with embodiments of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.
[0190] Computer systems can include clients and servers. Clients and servers are generally located far apart and typically interact through communication networks. The client-server relationship is created by computer programs running on the respective computers and having a client-server relationship with each other.
[0191] It can be a cloud server, a server for a distributed system, or a 5G server that incorporates blockchain technology.
[0192] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.
[0193] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.
Claims
1. A data processing method for audio and video calls, comprising: Acquire the target audio obtained by audio capture of the target caller during a specified call phase; Based on a predetermined noise feature library, detect whether there is any first-class audio data belonging to noise in the target audio; If present, a first prompt message is output in the call interface; wherein, the first prompt message is used to prompt whether to perform noise removal processing on the target caller’s first type of audio data; In response to the removal instruction obtained based on the first prompt information, the specified audio is subjected to removal processing for the first type of audio data before the specified audio is played; The specified audio is the audio obtained by audio acquisition of the target caller and is to be played by the other end of the target caller. The method further includes: Based on the audio features of a predetermined user feature library and other audio data, the system detects whether there is a second type of audio data belonging to unknown audio in the target audio; wherein, the unknown audio is audio that is not noise and does not belong to the user of the target caller; the other audio data is: audio data in the target audio other than the first type of audio data belonging to noise; the user feature library corresponds to the target caller and contains audio features belonging to the user of the target caller; If present, a second prompt message is output in the call interface; wherein, the second prompt message is used to prompt whether to perform weakening processing on the second type of audio data for the target caller; In response to the weakening instruction obtained based on the second prompt information, the specified audio is weakened for the second type of audio data before the specified audio is played; In response to the weakening instruction obtained based on the second prompt information, the audio features of the second type of audio data are added to the noise feature library.
2. The method according to claim 1, wherein, The designated call phase includes the call phase before the call begins, and / or, during the call.
3. The method according to claim 1, wherein, The call interface is the call interface of the target caller, and / or the call interface of the counterparty of the target caller.
4. The method according to claim 1, wherein detecting whether there is first-class audio data belonging to noise in the target audio based on a predetermined noise feature library includes: Obtain the various audio data obtained after performing a specified audio decomposition on the target audio; wherein, the specified audio decomposition is a method of decomposing according to different sound sources; Based on a predetermined noise feature library and the audio features of each audio data, it is determined whether there is a first type of audio data belonging to noise in the target audio.
5. A data processing device for audio and video calls, comprising: The acquisition module is used to acquire the target audio obtained by audio capture of the target caller within a specified call phase; The first detection module is used to detect whether there is a first type of audio data belonging to noise in the target audio based on a predetermined noise feature library; The first output module is used to output a first prompt message in the call interface if the condition exists; wherein the first prompt message is used to prompt whether to perform noise removal processing on the target caller for the first type of audio data. The removal module is used to, in response to the removal instruction obtained based on the first prompt information, perform removal processing on the specified audio for the first type of audio data before the specified audio is played; The specified audio is the audio obtained by collecting audio from the target caller and is to be played by the other end of the target caller. The device further includes: The second detection module is used to detect whether there is a second type of audio data belonging to unknown audio in the target audio based on the audio features of a predetermined user feature library and other audio data; wherein, the unknown audio is audio that is not noise and does not belong to the user of the target caller; the other audio data is: audio data in the target audio other than the first type of audio data belonging to noise; the user feature library corresponds to the target caller and contains audio features belonging to the user of the target caller; The second output module is used to output a second prompt message in the call interface if it exists; wherein the second prompt message is used to prompt whether to perform weakening processing on the second type of audio data for the target caller; A weakening module is used to respond to a weakening instruction obtained based on the second prompt information, and to perform weakening processing on the specified audio for the second type of audio data before the specified audio is played; An adding module is used to add the audio features of the second type of audio data to the noise feature library in response to the weakening instruction obtained based on the second prompt information.
6. An electronic device, comprising: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the data processing method for audio and video calls as described in any one of claims 1-4.
7. A non-transitory computer-readable storage medium storing computer instructions, wherein, The computer instructions are used to cause the computer to perform the data processing method for audio and video calls according to any one of claims 1-4.
8. A computer program product comprising a computer program that, when executed by a processor, implements the data processing method for audio and video calls according to any one of claims 1-4.
Citation Information
Patent Citations
Audio processing method, device and equipment and computer readable storage medium
CN114520005A
Speech enhancement method and device, electronic equipment and storage medium
CN114550738A