Audio similarity determination method and apparatus, and storage medium
A technology for determining methods and similarities, applied in the field of communication, can solve problems such as inapplicability, inability to extract MIDI feature files, and narrow applicability of existing solutions, so as to achieve the effect of improving applicability
Pending Publication Date: 2018-05-11
TENCENT TECH (SHENZHEN) CO LTD
4 Cites 7 Cited by
AI-Extracted Technical Summary
Problems solved by technology
[0004] During the research and practice of the prior art, the inventors of the present invention found that since the MIDI feature file mainly shows the pitch and frequency of the audio at each sampling point, therefore, for the song, the MIDI feature will be more obv...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View moreAbstract
Embodiments of the present invention disclose an audio similarity determination method and apparatus, and a storage medium. The method may comprises: performing normalization processing and high-passfiltering on first audio data and second audio data respectively, determining short-time energy distribution of the first audio data and the second audio data respectively, and calculating the similarity between the first audio data and the second audio data based on the obtained short-time energy distribution. The scheme of the embodiments of the present invention not only can effectively and accurately calculate the similarity but also can be applied to most application scenarios, and the applicability of the scheme is improved.
Application Domain
Technology Topic
Speech recognitionEnergy distribution
Image
Examples
- Experimental program(5)
Example Embodiment
[0041] Embodiment 1.
[0042] In this embodiment, description will be made from the perspective of an apparatus for determining audio similarity, which may specifically be integrated in a server or other equipment.
[0043] A method for determining audio similarity, comprising: acquiring first audio data and second audio data; respectively performing normalization processing and high-pass filtering on the first audio data and the second audio data to obtain the first audio data corresponding to the first audio data. First filtered data, second filtered data corresponding to the second audio data, and determining the short-term energy distribution of the first filtered data and the second filtered data respectively, to obtain first distribution information corresponding to the first filtered data , and second distribution information corresponding to the second filtered data; calculating the similarity between the first audio data and the second audio data based on the first distribution information and the second distribution information.
[0044] like Figure 1b As shown, the specific process of the method for determining the audio similarity may be as follows:
[0045] 101. Acquire first audio data and second audio data.
[0046] For example, the first audio file may be acquired, the first audio data may be extracted from the first audio file, and the second audio file may be acquired, and the second audio data may be extracted from the second audio file, and so on.
[0047] Optionally, in order to reduce interference, reduce the difference between audio files caused by interference, and improve the accuracy of calculation, when extracting audio data, the audio files can be transcoded, and the format of the parameters can be unified; that is, Optionally, the step "obtaining the first audio data and the second audio data" may include:
[0048] Obtain the first audio file, perform transcoding processing on the first audio file according to the preset transcoding strategy, and set the preset parameters in the transcoded first audio file according to the preset parameter setting rule, from the set extracting the first audio data from the first audio file;
[0049] And, obtain the second audio file, perform transcoding processing on the second audio file according to the preset transcoding strategy, and set the preset parameters in the transcoded second audio file according to the preset parameter setting rule, from the setting After extracting the second audio data from the second audio file.
[0050] The preset transcoding strategy and preset parameter setting rules can be set according to actual application requirements. For example, audio files (including the first audio file and the second audio file) can be converted into wav (a sound file format) ) uncompressed format, set the parameters to: sampling frequency 44100, bit rate 96k, and mono, and so on.
[0051] For example, taking the first audio file as the original audio file dubbed by a character, and the second audio file as the user audio file about the character recorded by the user as an example, at this time, the original audio file and the user audio file can be respectively Convert to wav uncompressed format, then set the sampling frequency of the converted original audio file and the user audio file to 44100, the bit rate to 96k, and mono, etc., and then, from The audio data is extracted from the original audio file after the parameters are set to obtain the first audio data, and the audio data is extracted from the user audio file after the parameters are set to obtain the second audio data.
[0052] 102. Perform normalization processing and high-pass filtering on the first audio data and the second audio data, respectively, to obtain first filtered data corresponding to the first audio data and second filtered data corresponding to the second audio data. For example, it can be as follows:
[0053] (1) The first audio data and the second audio data are sampled respectively to obtain the first sampling point set corresponding to the first audio data and the second sampling point set corresponding to the second audio data, specifically as follows:
[0054] A1. Sampling the first audio data to obtain a first set of sampling points.
[0055] Among them, the sampling method can be determined according to the actual application requirements. For example, a signed number can be read every 16 bits as a sampling point for sampling, and the obtained sampling points can be added to the same set to obtain the first sampling point. Collection of sampling points.
[0056] For another example, it is also possible to perform sampling by reading a signed number every 8 bits as a sampling point, and adding the obtained sampling points to the same set to obtain the first sampling point set, and so on.
[0057] A2. Sampling the second audio data to obtain a second set of sampling points.
[0058] Similar to the sampling of the first audio data, the second audio data can be sampled in various ways. For example, a signed number can be read every 16 bits as a sampling point for sampling, and the obtained Points are added to the same set to obtain a second set of sampled points. For another example, it is also possible to perform sampling by reading a signed number every 8 bits as a sampling point, and adding the obtained sampling points to the same set to obtain a second sampling point set, and so on, and so on. It can be set according to the actual application requirements.
[0059] It should be noted that, the steps A1 and A2 may be executed in no particular order.
[0060] (2) Normalize all sampling points in the first sampling point set and all sampling points in the second sampling point set, respectively, to obtain the first processed data corresponding to the first sampling point set and the second sampling point set Corresponding second processed data; for example, the details may be as follows:
[0061] B1. Perform normalization processing on all sampling points in the first sampling point set to obtain first processed data.
[0062] For example, the absolute maximum value (also referred to as the maximum value, that is, max-value) of all sampling points in the first sampling point set may be calculated, and then, all sampling points in the first sampling point set are calculated by the absolute maximum value. The points are normalized to obtain the first processed data.
[0063] Among them, the normalization process refers to converting the signals of these sampling points into a unified standard mode. For example, since the amplitude distribution corresponding to each sampling point is relatively wide, the normalization process can be used to convert the signals of these sampling points. The amplitude is adjusted between preset intervals, and so on. That is, the step "normalize all the sampling points in the first sampling point set by the absolute maximum value to obtain the first processed data" can be specifically:
[0064] The amplitudes of all the sampling points in the first sampling point set are adjusted to be within a preset interval according to the maximum absolute value, so as to obtain the first processed data.
[0065] The preset interval can be set according to actual application requirements. For example, taking the preset interval as [-1, 1] as an example, the following formula can be used to perform the analysis on all sampling points in the first sampling point set. The normalization process is as follows:
[0066]
[0067] where x t (i) is the normalized amplitude of the original data of the i-th sampling point, and the range of its amplitude can be [-1, 1], and x(i) is the original data of the i-th sampling point. Amplitude, the range of the amplitude can generally be [-32768, 32767].
[0068] After the amplitudes of all sampling points in the first sampling point set are adjusted according to the above-mentioned normalization processing formula, the first processed data x(n) can be obtained.
[0069] B2. Perform normalization processing on all sampling points in the second sampling point set to obtain second processed data.
[0070] For example, the absolute maximum value (also referred to as the maximum value, that is, max-value) of all the sampling points in the second sampling point set may be calculated, and then, all samples in the second sampling point set are calculated by the absolute maximum value The points are normalized to obtain the second processed data.
[0071] Among them, the normalization process refers to converting the signals of these sampling points into a unified standard mode. For example, since the amplitude distribution corresponding to each sampling point is relatively wide, the normalization process can be used to convert the signals of these sampling points. The amplitude is adjusted between preset intervals, and so on. That is, the step "normalize all the sampling points in the second sampling point set by the absolute maximum value to obtain the second processed data" can be specifically:
[0072] The amplitudes of all the sampling points in the second sampling point set are adjusted to be within the preset interval according to the maximum absolute value, so as to obtain the second processed data.
[0073] Wherein, the preset interval can be set according to actual application requirements. For example, taking the preset interval as [-1,1] as an example, the following formula can be used to analyze all sampling points in the second sampling point set. The normalization process is as follows:
[0074]
[0075] where x t (i) is the normalized amplitude of the original data of the i-th sampling point, and the range of its amplitude can be [-1, 1], and x(i) is the original data of the i-th sampling point. Amplitude, the range of the amplitude can generally be [-32768, 32767].
[0076] After adjusting the amplitudes of all sampling points in the second sampling point set according to the above normalization processing formula, the second processed data can be obtained. Since the normalization processing formula used here is the same as that in step 102, Therefore, in this step, x(n) is also used to represent the second processed data, and it should be understood that the parameters of the various formulas in the embodiments of the present invention here are universal, and not only refer to a certain Specific data, for example, x(n) here refers to the data obtained by adjusting the amplitudes of all sampling points in a certain sampling point set according to the above-mentioned normalization processing formula, and not only refers to The first processed data or the second processed data, the subsequent parameters such as y(n) are similar to this, and will not be described in detail later.
[0077] It should be noted that, the execution of steps B1 and B2 may be performed in no particular order.
[0078] (3) Filter the first processed data and the second processed data by using a high-pass filter, respectively, to obtain first filtered data corresponding to the first audio data and second filtered data corresponding to the second audio data. For example, it can be as follows:
[0079] C1. Use a high-pass filter to filter the first processed data to obtain the first filtered data.
[0080] For example, a first-order high-pass filter can be directly used, for example, a first-order high-pass filter of 6dB/octave is used to filter the first processed data to obtain the first filtered data.
[0081] In addition, since the average power spectrum of the speech signal (such as the first processed data) is affected by the glottal excitation and the radiation from the mouth and nose, after the speech signal is radiated from the lips, the high-frequency end has an attenuation of 6dB/octave above 800Hz. Therefore, optionally, before filtering the speech signal, the speech signal may also be boosted, wherein the boosting process is called "pre-emphasis". The purpose of pre-emphasis is to enhance the high frequency part, weaken the low frequency, and make the signal spectrum flat, so that the subsequent spectrum analysis and channel parameter analysis can be performed. That is to say, the step "using a high-pass filter to filter the first processed data to obtain the first filtered data" can be specifically as follows:
[0082] Pre-emphasis is performed on the first processed data, and a high-pass filter is used to filter the pre-emphasized first processed data to obtain first filtered data.
[0083] For example, a first-order high-pass filter can be used, for example, a 6dB/octave first-order high-pass filter is used to pre-emphasize the first processed data, and the first-order high-pass filter is used to process the pre-emphasized first data. After the data is filtered, the first filtered data is obtained, which is expressed by the formula:
[0084] y(n)=1.0*x(n)-u*x(n-1)
[0085] Wherein, in this step, y(n) is the first filtered data, x(n) is the first processed data, and u is the pre-emphasis coefficient. The value of u can be determined according to the requirements of the actual application, and the value range of u is [0.9, 1.0], for example, it can be 0.9375, and so on.
[0086] C2. Use a high-pass filter to filter the second processed data to obtain second filtered data.
[0087]For example, a first-order high-pass filter may be directly used, for example, a first-order high-pass filter of 6dB/octave is used to filter the second processed data to obtain the second filtered data.
[0088] Optionally, in order to enhance the high frequency part, weaken the low frequency, and make the signal spectrum flat, so that the spectrum analysis and channel parameter analysis can be performed later, the data after the second processing can also be pre-emphasized; The device filters the second processed data to obtain the second filtered data" may include:
[0089] The second processed data is pre-emphasized, and a high-pass filter is used to filter the pre-emphasized second processed data to obtain second filtered data.
[0090] For example, a first-order high-pass filter can be used, for example, a 6dB/octave first-order high-pass filter is used to pre-emphasize the second processed data, and the first-order high-pass filter is used to process the pre-emphasized second data. After the data is filtered, the second filtered data is obtained, which is expressed by the formula:
[0091] y(n)=1.0*x(n)-u*x(n-1)
[0092] Wherein, in this step, y(n) is the second filtered data, x(n) is the second processed data, and u is the pre-emphasis coefficient. The value of u can be determined according to the requirements of the actual application, and the value range of u is [0.9, 1.0], for example, it can be 0.9375, and so on.
[0093] It should be noted that, the steps C1 and C2 may be executed in no particular order.
[0094] 103. Determine the short-term energy distribution of the first filtered data and the second filtered data respectively, and obtain the first distribution information corresponding to the first filtered data and the second distribution information corresponding to the second filtered data; for example, specific Can be as follows:
[0095] (1) Determine the short-term energy distribution of the first filtered data to obtain first distribution information.
[0096] Optionally, since the first filtered data is very long and difficult to process at one time, the first filtered data may be processed in segments. For example, the first filtered data can be segmented, the short-term energy distribution of each segmented data can be determined separately, the short-term energy distribution of all segmented data can be counted, and the first distribution information can be obtained, and so on.
[0097] Optionally, since the segmented data has no obvious periodicity, it is inconvenient to perform subsequent convolution. Therefore, when segmenting, a Hamming window can be used for segmentation. In this way, the segment obtained The data has obvious periodicity, among which, the data in one window represents one cycle. That is, the step "determine the short-term energy distribution of the first filtered data, and obtain the first distribution information" can be specifically as follows:
[0098] A Hamming window function is obtained, a dot product operation is performed on the first filtered data, and the result obtained by the operation is convolved with the Hamming window function to obtain first distribution information.
[0099] Among them, for the nth frame signal y of a certain audio data y(n) (such as the first filtered data) n (m), it satisfies the relationship of the following formula:
[0100] y n (m)=w(n-m)y(m)
[0101] Among them, 0≤m≤N-1
[0102]
[0103] Wherein, n=0, 1T, 2T, ..., and N is the frame length, and T is the frame shift length.
[0104] If the nth frame signal y n The short-term energy of (m) is used for e n means, then y n The short-term energy of (m) is expressed by the formula:
[0105]
[0106] Therefore, the short-term energy E of the audio data y(n) (such as the first filtered data) n for:
[0107]
[0108] Among them, h(n-m) is the Hamming window function (referred to as Hamming window).
[0109] It should be noted that after adding the Hamming window, the data in the middle of the window will be reflected, and the data on both sides will be lost. Therefore, when doing convolution, you can only move 1/3 or 1/2 of the window at a time. , in this way, the data lost by the previous frame or the previous two frames can be reflected in the window again, so that the purpose of avoiding data loss can be achieved.
[0110] (2) Determine the short-term energy distribution of the second filtered data to obtain second distribution information.
[0111] Similar to processing the first filtered data, since the second filtered data is very long and difficult to process at one time, the second filtered data can be processed in segments. For example, the second filtered data may be segmented, the short-term energy distribution of each segmented data may be determined separately, the short-term energy distribution of all segmented data may be counted, and the second distribution information may be obtained, and so on.
[0112] Optionally, in order to make the data obtained by segmentation have obvious periodicity and facilitate subsequent convolution, a Hamming window can be used to perform segmentation, wherein the data in one window represents one cycle. That is, the step "determine the short-term energy distribution of the second filtered data, and obtain the second distribution information" can be specifically as follows:
[0113] Obtain a Hamming window function (this Hamming window function is consistent with the Hamming window function for processing the first filtered data), perform a dot product operation on the second filtered data, and wrap the result obtained by the operation with the Hamming window function product to obtain the second distribution information. The formula is expressed as:
[0114]
[0115] Wherein, y(n) is the second filtered data, h(n-m) is the Hamming window function (referred to as Hamming window), and the specific analysis process of this formula can refer to step 102, which will not be repeated here.
[0116] It should be noted that, similar to the processing of the first filtered data, after adding the Hamming window, in order to avoid data loss, during convolution, only 1/3 or 1/2 of the window can be moved each time, so that the The data lost in one frame or the first two frames can be reflected in the window again.
[0117] It should also be noted that, in step 103, steps (1) and (2) may be executed in no particular order.
[0118] 104. Calculate the similarity between the first audio data and the second audio data based on the first distribution information and the second distribution information.
[0119] Since the first distribution information and the second distribution information are both data matrices, the similarity between the first audio data and the second audio data can be obtained by calculating the cosine similarity between the two data matrices, the so-called cosine similarity. , also known as cosine similarity, is a calculation method to evaluate the similarity of two vectors by calculating the cosine value of the angle between them. That is, the step "calculating the similarity between the first audio data and the second audio data based on the first distribution information and the second distribution information" may include:
[0120] The cosine similarity between the first distribution information and the second distribution information is calculated to obtain the similarity between the first audio data and the second audio data.
[0121] It should be noted that since the lengths of the first audio data and the second audio data may be inconsistent, in order to facilitate the subsequent calculation of the cosine similarity between the first distribution information and the second distribution information, the first audio data and the second audio data may be 0 is added to the end of the shorter one of the two to keep the number of sampling points of the first audio data and the second audio data consistent.
[0122] Among them, the cosine similarity formula is as follows:
[0123]
[0124] A is the vector of the short-term energy distribution of the first audio data, that is, the vector of the first distribution information; B is the vector of the short-term energy distribution of the second audio data, that is, the vector of the second distribution information, and Similarity is the first The similarity between the audio data and the second audio data, in the embodiment of the present invention, mainly refers to the similarity in intonation of the two pieces of audio data (ignoring the interference of the timbre).
[0125] Optionally, since most recordings generally have a long silence segment at the beginning and/or end, this silence segment is of little significance for calculating similarity. Therefore, in order to reduce the amount of computation and improve the efficiency of Before the calculation, remove the silence segment; that is, before the step of "calculating the cosine similarity between the first distribution information and the second distribution information", the method for determining the audio similarity may further include:
[0126] The first effective distribution information is obtained by removing the silence segments at the beginning and the end of the first distribution information; and the second effective distribution information is obtained by removing the silence segments at the beginning and the end of the second distribution information.
[0127] At this time, the step of "calculating the cosine similarity between the first distribution information and the second distribution information" may specifically be: calculating the cosine similarity between the first effective distribution information and the second effective distribution information.
[0128] The first and last silent segments are sampling points whose energy value is lower than a preset threshold in the head and tail of the audio data. The preset threshold can be set according to the requirements of the actual application. For example, the sampling point whose energy value is lower than 0.025 can be set as mute, and then scan the beginning and end of the audio through the threshold to remove the mute segment at the beginning and the end to obtain an effective short-term energy distribution, etc., which will not be repeated here.
[0129] It can be seen from the above that in this embodiment, normalization processing and high-pass filtering can be performed on the first audio data and the second audio data respectively, and then their short-term energy distributions can be determined respectively, and the first audio frequency can be calculated based on the obtained short-term energy distribution. The similarity between the data and the second audio data; because the short-term energy of various audio data, such as songs or voice signals, changes significantly with time, and the short-term energy can effectively reflect the magnitude of the signal amplitude, and the sound /silent, etc. Therefore, even if the audio data is a speech signal, the similarity of two pieces of audio data can be effectively calculated by this solution. Therefore, compared with the existing solution, this solution can effectively and accurately The similarity is calculated, and it can also be applied to most application scenarios, which greatly improves the applicability of the scheme.
Example Embodiment
[0130] Embodiment two,
[0131] According to the methods implemented in the previous embodiments, the following examples will be used for further detailed description.
[0132] In this embodiment, the device for determining the audio similarity is specifically integrated in the server, the first audio file is an original audio file (ie, an original dubbing file), and the second audio file is a user audio file as an example for description.
[0133] like Figure 2a As shown, a method for determining audio similarity, the specific process can be as follows:
[0134] 201. The server acquires an original audio file, extracts first audio data from the original audio file, and acquires a user audio file, and extracts second audio data from the user audio file.
[0135]For example, after obtaining the original audio file, the server can transcode the original audio file according to a preset transcoding strategy, such as converting the original audio file into an uncompressed wav format, and convert the original audio file according to the preset parameter setting rules. Set the preset parameters in the encoded original audio file, such as setting the sampling frequency of the original audio file to 44100, the bit rate to 96k, and the channel to mono, etc. Then, from the setting parameters After extracting the first audio data from the original audio file.
[0136] Similarly, after obtaining the user's audio file, the server can also transcode the user's audio file according to the preset transcoding strategy, for example, convert the user's audio file into an uncompressed wav format, and set the rules according to the preset parameters. Set the preset parameters in the transcoded user audio file, such as setting the sampling frequency of the user audio file to 44100, the bit rate to 96k, and the channel to mono, etc., and then, from The second audio data is extracted from the user audio file after the parameters are set.
[0137] The preset transcoding strategy and the preset parameter setting rule may be set according to actual application requirements, which will not be repeated here. For example, the transcoding instruction may be as follows:
[0138] ./ffmpeg –y –i local_file –ar 44100 –ac 1 –acodec pcm_s16le wav_file
[0139] Wherein, the acquisition of the original audio file and the user audio file can be determined according to the requirements of the actual application scenario. For example, taking the original audio file as an example of the original dubbing of character A in a certain game K, at this time, it can be specified from The original dubbing of the character A is obtained from the local storage or other storage devices, and the original audio file is obtained; and the user's audio file can be obtained by receiving the voice recorded by the user. For example, see Figure 2b , the user can record by clicking "click to record" in the interface, and follow the lines in the interface to prompt "the script has started, the hunting time has started la la la", and the server will record the user after receiving the user's recording. Save as user audio file.
[0140] Optionally, in order to facilitate users to dub better, a listening interface for "original audio files" can also be provided in this interface. For example, see Figure 2b The "listen to the original sound" trigger key in , the user can click or slide the trigger key to listen to the original dubbing file of the character A, which will not be repeated here.
[0141] 202. The server normalizes the first audio data to obtain the first processed data, and then executes step 203.
[0142] For example, it can be as follows:
[0143] (1) The server samples the first audio data to obtain a first set of sampling points.
[0144] Among them, the sampling method can be determined according to the actual application requirements. For example, a signed number can be read every 16 bits as a sampling point for sampling, and the obtained sampling points can be added to the same set to obtain the first sampling point. Collection of sampling points.
[0145] in, Figure 2c A schematic diagram of a sampling result obtained by sampling a certain audio data by reading a signed number every 16 bits as a sampling point.
[0146] (2) The server normalizes all the sampling points in the first sampling point set to obtain the first processed data.
[0147] For example, the absolute maximum value (also referred to as the maximum value, that is, max-value) of all sampling points in the first sampling point set may be calculated, and then, according to the absolute maximum value, all samples in the first sampling point set The amplitude of the point is adjusted to be between the preset intervals to obtain the first processed data.
[0148] The preset interval can be set according to actual application requirements. For example, taking the preset interval as [-1, 1] as an example, the following formula can be used to perform the analysis on all sampling points in the first sampling point set. The normalization process is as follows:
[0149]
[0150] where x t (i) is the normalized amplitude of the original data of the i-th sampling point, and the range of its amplitude can be [-1, 1], and x(i) is the original data of the i-th sampling point. Amplitude, the range of the amplitude can generally be [-32768, 32767].
[0151] After adjusting the amplitudes of all sampling points in the first sampling point set according to the above normalization processing formula, the first processed data x(n) can be obtained. For example, see Figure 2d , the figure is for Figure 2b A schematic diagram of the results obtained after normalizing the sampling points in .
[0152] 203. The server uses a high-pass filter to filter the first processed data to obtain the first filtered data, and then executes step 204.
[0153] Since the average power spectrum of the speech signal (such as the first processed data) is affected by the glottal excitation and the radiation from the mouth and nose, after the speech signal is radiated from the lips, the high-frequency end is attenuated by 6dB/octave above 800Hz. Therefore, A first-order high-pass filter (such as a 6dB/octave first-order high-pass filter) can be used to pre-emphasize the first processed data (it can weaken low frequencies and flatten the signal spectrum for subsequent spectrum analysis and channel analysis). parameter analysis), use the high-pass filter, such as a first-order high-pass filter of 6dB/octave, to filter the pre-emphasized first processed data to obtain the first filtered data. The formula is expressed as:
[0154] y(n)=1.0*x(n)-u*x(n-1)
[0155] Wherein, in this step, y(n) is the first filtered data, x(n) is the first processed data, and u is the pre-emphasis coefficient. The value of u can be determined according to the requirements of the actual application, and the value range of u is [0.9, 1.0], for example, it can be 0.9375, and so on.
[0156] For example, see Figure 2e , the figure is for Figure 2c A schematic diagram of the effect of filtering the normalized results in (ie, the filtering results).
[0157] 204. The server determines the short-term energy distribution of the first filtered data to obtain first distribution information.
[0158] Optionally, since the first filtered data is very long and difficult to process at one time, the first filtered data may be processed in segments. For example, the server may specifically segment the first filtered data, and then determine the short-term energy distribution of each segmented data respectively, and count the short-term energy distributions of all segmented data to obtain the first distribution information, and many more.
[0159] Optionally, since the segmented data has no obvious periodicity, it is inconvenient to perform subsequent convolution. Therefore, when segmenting, a Hamming window can be used for segmentation. In this way, the segment obtained The data has obvious periodicity, among which, the data in one window represents one cycle. That is, the step "the server determines the short-term energy distribution of the first filtered data, and obtains the first distribution information" may be specifically as follows:
[0160] The server obtains the Hamming window function, performs a dot product operation on the first filtered data, and convolves the result obtained by the operation with the Hamming window function to obtain the first distribution information, which is expressed by the formula:
[0161]
[0162] Among them, h(n-m) is the Hamming window function, y n (m) is the nth frame signal of a certain audio data y(n) (such as the first filtered data), y n (m) satisfies the relationship of the following formula:
[0163] y n (m)=w(n-m)ym
[0164] Among them, 0≤m≤N-1
[0165]
[0166] Wherein, n=0, 1T, 2T, ..., and N is the frame length, and T is the frame shift length.
[0167] For the specific derivation process of the short-term energy distribution formula, reference may be made to the foregoing embodiments, and details are not described herein again.
[0168] Among them, the Hamming window function can be determined according to the actual application requirements. For example, the following Hamming window function can be used:
[0169]
[0170] Wherein, h(n, a) is the value of the Hamming window after using the Hamming window parameter a on the nth frame, which is called the Hamming window function in the embodiment of the present invention, PI is the pi, and M is the size of the window, a is a constant, and its value can be determined according to the actual application requirements, for example, a can be 0.46, and so on.
[0171] When a=0.46, the effect of its Hamming window can be as follows Figure 2f shown, in addition, using the Hamming window pair Figure 2e After calculating the filtered data in , the obtained short-term energy distribution of the filtered data can be as follows Figure 2g shown (if the filtered data is the first filtered data, then Figure 2g Shown is the first distribution information, if the filtered data is the second filtered data, then Figure 2g shown is the second distribution information).
[0172] It should be noted that after adding the Hamming window, the data in the middle of the window will be reflected, and the data on both sides will be lost. Therefore, when doing convolution, you can only move 1/3 or 1/2 of the window at a time. , in this way, the data lost by the previous frame or the previous two frames can be reflected in the window again, so that the purpose of avoiding data loss can be achieved.
[0173] In addition, it should be noted that, after obtaining the first distribution information corresponding to the original audio file, the first distribution information can be saved. In this way, if the similarity between other user audio files and the original audio file needs to be calculated later, it can be The first distribution information is directly called without further calculation, which can reduce the occupation of computing resources and improve computing efficiency.
[0174] 205. The server performs normalization processing on the second audio data to obtain the second processed data, and then executes step 206.
[0175] For example, it can be as follows:
[0176] (1) The server samples the second audio data to obtain a second set of sampling points.
[0177] Similar to the sampling of the first audio data, the second audio data can be sampled in various ways. For example, a signed number can be read every 16 bits as a sampling point for sampling, and the obtained sampling Add the points to the same set to obtain the second set of sampling points. For details, please refer to Figure 2c and step 202, which will not be repeated here.
[0178] (2) Normalize all the sampling points in the second sampling point set to obtain the second processed data.
[0179] For example, the absolute maximum value (also called the maximum value, that is, max-value) of all the sampling points in the second sampling point set can be calculated, and then, according to the absolute maximum value, all the sampling points in the second sampling point set are The amplitude of the point is adjusted to be between the preset intervals to obtain the second processed data.
[0180] Wherein, the preset interval can be set according to actual application requirements. For example, taking the preset interval as [-1,1] as an example, the following formula can be used to analyze all sampling points in the second sampling point set. The normalization process is as follows:
[0181]
[0182] where x t (i) is the normalized amplitude of the original data of the i-th sampling point, and the range of its amplitude can be [-1, 1], and x(i) is the original data of the i-th sampling point. Amplitude, the range of the amplitude can generally be [-32768, 32767].
[0183] After adjusting the amplitudes of all sampling points in the second sampling point set according to the above normalization processing formula, the second processed data can be obtained. For details, please refer to Figure 2d and step 202, which will not be repeated here.
[0184] It should be noted that, since the normalization processing formula used here is the same as that in step 202, in this step, x(n) is also used to represent the second processed data.
[0185] 206. The server uses a high-pass filter to filter the second processed data to obtain second filtered data.
[0186] For example, the filtering method is similar to filtering the first processed data (see step 203), that is, a first-order high-pass filter can be directly used, for example, a 6dB/octave first-order high-pass filter is used to filter the first processed data. The data is filtered to obtain second filtered data.
[0187] Since the average power spectrum of the speech signal (such as the second processed data) is affected by the glottal excitation and the radiation from the mouth and nose, after the speech signal is radiated from the lips, the high-frequency end is attenuated by 6dB/octave above 800Hz. Therefore, A first-order high-pass filter (such as a 6dB/octave first-order high-pass filter) can be used to pre-emphasize the second processed data (which can weaken low frequencies and flatten the signal spectrum for subsequent spectrum analysis and channel analysis). parameter analysis), use the high-pass filter, such as a first-order high-pass filter of 6dB/octave, to filter the pre-emphasized second processed data to obtain second filtered data. The formula is expressed as:
[0188] y(n)=1.0*x(n)-u*x(n-1)
[0189] Wherein, in this step, y(n) is the second filtered data, x(n) is the second processed data, and u is the pre-emphasis coefficient. The value of u can be determined according to the actual application requirements. The value range of u is [0.9, 1.0], for example, it can be 0.9375, etc. For details, please refer to Figure 2e , and will not be repeated here.
[0190] Wherein, steps 202 and 205 may be executed in no particular order.
[0191] 207. The server determines the short-term energy distribution of the second filtered data, and obtains second distribution information.
[0192] Similar to processing the first filtered data, since the second filtered data is very long and difficult to process at one time, the second filtered data can be processed in segments. For example, the second filtered data may be segmented, the short-term energy distribution of each segmented data may be determined separately, the short-term energy distribution of all segmented data may be counted, and the second distribution information may be obtained, and so on.
[0193] Optionally, in order to make the data obtained by segmentation have obvious periodicity and facilitate subsequent convolution, a Hamming window can be used to perform segmentation, wherein the data in one window represents one cycle. That is, the step "the server determines the short-term energy distribution of the second filtered data, and obtains the second distribution information" can be specifically as follows:
[0194] The server obtains the Hamming window function, performs a dot product operation on the second filtered data, and convolves the result obtained by the operation with the Hamming window function to obtain second distribution information. The formula is expressed as:
[0195]
[0196] Wherein, y(n) is the second filtered data, and h(n-m) is the Hamming window function (referred to as Hamming window). The specific analysis process of this formula can refer to step 204, which will not be repeated here.
[0197] It should be noted that, similar to the processing of the first filtered data, after adding the Hamming window, in order to avoid data loss, during convolution, only 1/3 or 1/2 of the window can be moved each time, so that the The data lost in one frame or the first two frames can be reflected in the window again.
[0198] 208. The server removes the first and last silence segments of the first distribution information to obtain first valid distribution information; and removes the first and last silence segments of the second distribution information to obtain second valid distribution information.
[0199] The first and last silent segments are sampling points whose energy value is lower than a preset threshold in the head and tail of the audio data. The preset threshold can be set according to the requirements of the actual application. For example, the sampling point whose energy value is lower than 0.025 can be set as mute, then scan the beginning and end of the audio through this threshold to remove the beginning and end silent segments, get an effective short-term energy distribution, etc., for example, see Figure 2h , the part marked in the rectangular wire frame in the figure is the silent segment, which can be removed. The first effective distribution information can be obtained by removing the first and last silent segments of the first distribution information, for example, as in Figure 2I In the same way, the first and last silent segments of the second distribution information are removed, and the second effective distribution information can be obtained, for example, as in Figure 2J shown.
[0200] 209. The server calculates the cosine similarity between the first effective distribution information and the second effective distribution information to obtain the similarity between the first audio data and the second audio data.
[0201] It should be noted that since the lengths of the first audio data and the second audio data may be inconsistent, in order to facilitate the subsequent calculation of the cosine similarity between the first distribution information and the second distribution information, the first audio data and the second audio data may be 0 is added to the end of the shorter one of the two to keep the number of sampling points of the first audio data and the second audio data consistent.
[0202] Among them, the cosine similarity formula is as follows:
[0203]
[0204] Among them, A is the vector of the short-term energy distribution of the first audio data, that is, the vector of the first distribution information; B is the vector of the short-term energy distribution of the second audio data, that is, the vector of the second distribution information, and Similarity is The similarity between the first audio data and the second audio data, in the embodiment of the present invention, mainly refers to the similarity in intonation of the two pieces of audio data (ignoring the interference of timbre).
[0205] Optionally, after obtaining the similarity between the first audio data and the second audio data, further processing can be performed according to the requirements of the actual application, for example, the second audio file (ie, the user audio file) is scored based on the similarity. , for example, taking the game dubbing in step 201 as an example, see Figure 2k , when the user triggers "click to record" and completes the recording, the terminal can calculate the similarity between the "recording" (that is, the user's audio file) and the original audio file in the background. During the calculation process, it can be displayed on the terminal interface. Corresponding calculation progress (such as 62%), and prompt information, such as "intensive calculation...", etc., to remind the user to wait; after obtaining the similarity between the user's audio file and the original audio file, you can display the The similarity is displayed above. Optionally, a corresponding score may also be calculated based on the similarity. The specific scoring standard may be determined according to the actual application scenario, which will not be repeated here.
[0206] As can be seen from the above, in this embodiment, the first audio data and the second audio data can be extracted from the original audio file and the user audio file respectively, and then the two audio data are respectively subjected to normalization processing and high-pass filtering, and then respectively. Determine its short-term energy distribution, and calculate the similarity between the first audio data and the second audio data based on the obtained short-term energy distribution; because the short-term energy of various audio data, such as songs or voice signals, changes with time It will be more obvious, and the short-term energy can effectively reflect the magnitude of the signal amplitude, sound/silence, etc. Therefore, even if the audio data is a voice signal, the similarity of the two pieces of audio data can be effectively calculated through this scheme. Therefore, compared with the existing scheme, the scheme can not only calculate the similarity effectively and accurately, but also can be applied to most application scenarios, which greatly improves the applicability of the scheme.
Example Embodiment
[0207] Embodiment three,
[0208] In order to better implement the above method, an embodiment of the present invention further provides an apparatus for determining audio similarity, and the apparatus for determining audio similarity may specifically be integrated in a server or other equipment.
[0209] For example, as Figure 3a As shown, the apparatus for determining audio similarity may include an obtaining unit 301, a first processing unit 302, a second processing unit 303, and a calculating unit 304, as follows:
[0210] (1) Acquisition unit 301;
[0211] The acquiring unit 301 is configured to acquire first audio data and second audio data.
[0212] For example, the obtaining unit 301 can be specifically configured to obtain a first audio file, extract the first audio data from the first audio file, and obtain a second audio file, and extract the second audio data from the second audio file, and many more.
[0213] Optionally, in order to reduce interference, reduce the difference between audio files caused by interference, and improve the accuracy of calculation, when extracting audio data, the audio file can be transcoded, and the format of the parameters can be unified, that is, :
[0214] The obtaining unit 301 may be specifically configured to obtain a first audio file, perform transcoding processing on the first audio file according to a preset transcoding strategy, and perform preset parameter setting rules on the first audio file after transcoding according to the preset parameter setting rule. parameters are set, and the first audio data is extracted from the set first audio file.
[0215] And, the acquiring unit 301 can be specifically configured to acquire the second audio file, perform transcoding processing on the second audio file according to the preset transcoding strategy, and perform transcoding processing on the second audio file according to the preset parameter setting rule, and The preset parameters are set, and the second audio data is extracted from the set second audio file.
[0216] Among them, the preset transcoding strategy and preset parameter setting rules can be set according to the actual application requirements. For example, the audio file can be converted into wav uncompressed format, and the parameters can be set as: sampling frequency 44100, bit rate 96k, as well as mono, etc.
[0217] (2) filtering unit 302;
[0218] The filtering unit 302 is used to perform normalization processing and high-pass filtering on the first audio data and the second audio data, respectively, to obtain the first filtered data corresponding to the first audio data and the second filtered data corresponding to the second audio data. data.
[0219] For example, the filtering unit 302 may include a sampling subunit, a normalizing subunit, and a filtering subunit, which may be specifically as follows:
[0220]The sampling subunit can be used to sample the first audio data and the second audio data respectively to obtain a first sampling point set corresponding to the first audio data and a second sampling point set corresponding to the second audio data.
[0221] For example, the sampling subunit may be specifically used to sample the first audio data to obtain the first set of sampling points, and to sample the second audio data to obtain the second set of sampling points.
[0222] The sampling method can be determined according to the actual application requirements. For example, the first audio data can be sampled by reading a signed number every 16 bits as a sampling point, and the obtained sampling points can be added to the same set , obtain the first set of sampling points; similarly, the second audio data can be sampled by reading a signed number every 16 bits as a sampling point, and the obtained sampling points are added to the same set to obtain the first sampling point. Two sample point sets, and so on. It should be noted that, the sampling manner of the first audio data and the second audio data should be consistent.
[0223] The normalization subunit can be used to perform normalization processing on all sampling points in the first sampling point set and all sampling points in the second sampling point set, respectively, to obtain the first processed data corresponding to the first sampling point set, and the second processed data corresponding to the second sampling point set.
[0224] For example, the normalization subunit can be specifically used to calculate the absolute maximum value (max-value) of all sampling points in the first sampling point set, and according to the absolute maximum value, all sampling points in the first sampling point set The amplitude is adjusted to be between the preset intervals, and the first processed data is obtained; and the absolute value maximum value (max-value) of all sampling points in the second sampling point set is calculated, and the second absolute value maximum value is calculated according to the absolute value maximum value. The amplitudes of all the sampling points in the sampling point set are adjusted to be within the preset interval to obtain the second processed data.
[0225] Wherein, the preset interval can be set according to actual application requirements, for example, the preset interval can be specifically set as [-1, 1], see the foregoing method embodiments for details, and will not be repeated here.
[0226] The filtering subunit can be used to filter the first processed data and the second processed data respectively by using a high-pass filter, so as to obtain the first filtered data corresponding to the first audio data and the second filtered data corresponding to the second audio data. post data.
[0227] For example, the filtering subunit can be specifically used to pre-emphasize the first processed data, and use a high-pass filter to filter the pre-emphasized first processed data to obtain the first filtered data; The data is pre-emphasized, and a high-pass filter is used to filter the pre-emphasized second processed data to obtain second filtered data corresponding to the second audio data, and so on.
[0228] (3) determining unit 303;
[0229] The determining unit 303 is used to determine the short-term energy distribution of the first filtered data and the second filtered data respectively, and obtain first distribution information corresponding to the first filtered data and second distribution information corresponding to the second filtered data .
[0230] For example, the determining unit can be specifically used to obtain a Hamming window function; perform a dot product operation on the first filtered data, and compare the result obtained by the operation (that is, the result obtained by performing the dot multiplication operation on the first filtered data) with the The Hamming window function is convolved to obtain the first distribution information corresponding to the second filtered data; and the second filtered data is subjected to a dot multiplication operation, and the result obtained by the operation (that is, the second filtered data is subjected to a dot multiplication operation) The obtained result) is convolved with the Hamming window function to obtain second distribution information corresponding to the second filtered data.
[0231] It should be noted that after adding the Hamming window, in order to avoid data loss, when doing convolution, only 1/3 or 1/2 of the window can be moved each time, so that the data lost by the previous frame or the previous two frames can be moved. It can be reflected in the window again.
[0232] (4) calculation unit 304;
[0233] The calculation unit 304 is configured to calculate the similarity between the first audio data and the second audio data based on the first distribution information and the second distribution information.
[0234] Since the first distribution information and the second distribution information are both data matrices, the similarity between the first audio data and the second audio data can be obtained by calculating the cosine similarity between the two data matrices, namely:
[0235] The calculating unit 304 may be specifically configured to calculate the cosine similarity between the first distribution information and the second distribution information, and obtain the similarity between the first audio data and the second audio data.
[0236] It should be noted that since the lengths of the first audio data and the second audio data may be inconsistent, in order to facilitate the subsequent calculation of the cosine similarity between the first distribution information and the second distribution information, the first audio data and the second audio data may be 0 is added to the end of the shorter one of the two to keep the number of sampling points of the first audio data and the second audio data consistent.
[0237] Optionally, since most recordings generally have a long silence segment at the beginning and/or end, this silence segment is of little significance for calculating similarity. Therefore, in order to reduce the amount of computation and improve the efficiency of This silent segment is removed before calculation; i.e. optional, such as Figure 3b As shown, the apparatus for determining the audio similarity may further include an intercepting unit 305, as follows:
[0238] The intercepting unit 305 is configured to remove the first and last silence segments of the first distribution information to obtain first valid distribution information, and remove the first and last silence segments of the second distribution information to obtain second valid distribution information.
[0239] At this time, the calculation unit 304 may be specifically configured to calculate the cosine similarity between the first effective distribution information and the second effective distribution information, and obtain the similarity between the first audio data and the second audio data.
[0240] The first and last silent segments are sampling points whose energy value is lower than a preset threshold in the head and tail of the audio data. The preset threshold can be set according to the requirements of the actual application. For example, the sampling point whose energy value is lower than 0.025 can be set as mute, and then scan the beginning and end of the audio through the threshold to remove the mute segment at the beginning and the end to obtain an effective short-term energy distribution, etc., which will not be repeated here.
[0241] During specific implementation, the above units may be implemented as independent entities, or may be arbitrarily combined to be implemented as the same or several entities, see the foregoing method embodiments for details, and will not be repeated here.
[0242] As can be seen from the above, in the apparatus for determining the audio similarity in this embodiment, the filtering unit 302 can respectively perform normalization processing and high-pass filtering on the first audio data and the second audio data, and then the determining unit 303 respectively determines the short-term energy of the audio data. distribution, and the calculation unit 304 calculates the similarity of the first audio data and the second audio data based on the obtained short-term energy distribution; due to various audio data, such as the short-term energy of songs or voice signals changes with time. It is relatively obvious, and the short-term energy can effectively reflect the magnitude of the signal amplitude, sound/silence, etc. Therefore, even if the audio data is a voice signal, the similarity of the two pieces of audio data can be effectively calculated by this scheme. Therefore, compared with the existing scheme, the scheme can not only calculate the similarity effectively and accurately, but also can be applied to most application scenarios, which greatly improves the applicability of the scheme.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more PUM


Description & Claims & Application Information
We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more Similar technology patents
Railway Wagon
InactiveUS20080271634A1Wide applicationImprove applicabilityTank wagonsAxle-box lubricationTruckEngineering
Owner:FLEXIWAGGON
Portable card readers and method thereof
InactiveCN1614620AReduce manufacturing costImprove applicabilitySensing record carriersData exchangeCard reader
Owner:王小矿
Adhesive composition for optical member, adhesive layer for optical member and its producing method, optical member with adhesive, and image display device
ActiveCN1683464AAchieve low viscosityImprove applicabilityAdhesive processesLayered productsSolventSolubility
Owner:NITTO DENKO CORP
Street lamp illumination automatic controller and street lamp illumination control method
ActiveCN101545616AImprove applicabilityPoint-like light sourceElectric circuit arrangementsTraffic volumeAutomatic controller
Owner:BEIJING EFFI LED OPTO ELECTRONICS TECH
GaN-based ultraviolet detector with p-i-p-i-n structure and preparation method thereof
ActiveCN102386269AImprove applicabilityAvalanche multiplier distance increasedFinal product manufactureSemiconductor devicesPhysicsVoltage
Owner:TSINGHUA UNIV
Classification and recommendation of technical efficacy words
- Improve applicability
Composition for oily foamable aerosol
ActiveUS20040197276A1Improve applicabilityMaintain good propertiesCosmetic preparationsToilet preparationsChemistryGlycerol
Owner:TAIYO KAGAKU CO LTD
Temperature sensing cooker capable of being used for various cooking ranges
Owner:INFINITUS (CHINA) CO LTD
Robot joint based on harmonic wave speed reducer
Owner:HARBIN ENG UNIV
Spacecraft robust finite time saturation attitude tracking control method
Owner:HARBIN INST OF TECH
Method for analyzing load and stress distribution and stress levels of composite mechanical connection pins
InactiveCN102622472AHybrid connections applyImprove applicabilitySpecial data processing applicationsThree dimensional modelPhysical model
Owner:NORTHWESTERN POLYTECHNICAL UNIV