Method and apparatus for verifying voice identity, electronic device and storage medium
By identifying speech segments with the same phoneme sequence and generating feature curves in speech identity testing, and calculating similarity, the problem of low accuracy in speech identity testing is solved, and higher accuracy in speech identity testing is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- VOICEAI TECH CO LTD
- Filing Date
- 2022-08-10
- Publication Date
- 2026-06-16
AI Technical Summary
In existing technologies, the accuracy of voice identity verification is not high, and it cannot fully reflect the speaker's voice characteristics, resulting in low reliability of the identity verification results.
By acquiring the speech segment to be tested and the sample speech segment, speech sub-segments with the same phoneme sequence are identified, feature curves are extracted and generated, and the similarity between the feature curves is calculated to determine the identity test result.
It improves the accuracy of speech identity verification, and can determine whether different speech segments come from the same speaker based on the individual speech characteristics and similarity of the speakers in continuous speech.
Smart Images

Figure CN115547340B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of audio processing, and more specifically, to a method, apparatus, electronic device, and storage medium for verifying speech identity. Background Technology
[0002] Speech identity verification refers to the comparative analysis of two input speech samples to determine whether they originate from the same person. Currently, speech identity verification relies on the phonetic characteristics of individual syllables (single vowels or single characters). However, monosyllable phonetic characteristics cannot reflect the full picture of a speaker's speech features, resulting in low accuracy and reliability of the verification results. Therefore, how to comprehensively and accurately reflect the speech features of a human speaker to improve the accuracy of speech identity verification is a pressing issue. Summary of the Invention
[0003] In view of the above problems, embodiments of this application propose a method, apparatus, electronic device and storage medium for verifying voice identity, in order to improve the above problems.
[0004] In a first aspect, embodiments of this application provide a method for verifying speech identity. The method includes: acquiring a speech segment to be verified and a sample speech segment; determining a first speech sub-segment in the speech segment to be verified and a second speech sub-segment in the sample speech segment, wherein the first speech sub-segment and the second speech sub-segment are speech sub-segments having the same phoneme sequence; extracting a first feature of the first speech sub-segment and a second feature of the second speech sub-segment, and determining a first feature curve based on the first feature and a second feature curve based on the second feature; determining the similarity between the first feature curve and the second feature curve, and determining the verification result of the identity between the speech segment to be verified and the sample speech segment based on the similarity.
[0005] Secondly, embodiments of this application provide a speech identity verification device, comprising: an acquisition module for acquiring a speech segment to be verified and a sample speech segment; a speech sub-segment determination module for determining a first speech sub-segment in the speech segment to be verified and a second speech sub-segment in the sample speech segment, wherein the first speech sub-segment and the second speech sub-segment are speech sub-segments having the same phoneme sequence; a feature extraction module for extracting a first feature of the first speech sub-segment and a second feature of the second speech sub-segment, and determining a first feature curve based on the first feature and a second feature curve based on the second feature; and a determination module for determining the similarity between the first feature curve and the second feature curve, and determining the verification result of the identity between the speech segment to be verified and the sample speech segment based on the similarity.
[0006] In some embodiments, the speech identity verification device further includes: a timestamp determination module, configured to determine the timestamps of each phoneme in the first feature curve and the timestamps of each phoneme in the second feature curve respectively; and an alignment module, configured to align the same phonemes in the first feature curve and the second feature curve according to the timestamps of each phoneme in the first feature curve and the timestamps of each phoneme in the second feature curve.
[0007] In some embodiments, the first feature curve includes a first fundamental frequency feature curve, and the determining module includes: a peak point sequence first determining unit, configured to determine a first peak sequence based on each peak point in the first fundamental frequency feature curve, and to determine a second peak point sequence based on each peak point in the second fundamental frequency feature curve; a position information determining unit, configured to determine first position information of each peak point in the first peak point sequence and second position information of each peak point in the second peak point sequence; a first difference determining unit, configured to determine the mean of the position deviations of each corresponding peak point in the first peak point sequence and the second peak point sequence and the variance of the position deviations of each corresponding peak point in the first peak point sequence and the second peak point sequence based on the first position information and the second position information; and a first similarity determining unit, configured to determine the similarity between the first fundamental frequency feature curve and the second fundamental frequency feature curve based on the mean and the variance.
[0008] In some embodiments, the first feature curve includes a first zero-crossing rate feature curve, and the second feature curve includes a second zero-crossing rate feature curve; the determination module includes: a peak point sequence second determination unit, configured to determine a third peak point sequence based on each peak point in the first zero-crossing rate feature curve, and to determine a fourth peak point sequence based on each peak point in the second zero-crossing rate feature curve; a quantity acquisition unit, configured to acquire a first quantity of peak points in the third peak point sequence and a second quantity of peak points in the fourth peak point sequence; a quantity difference determination unit, configured to determine the quantity difference between the first quantity and the second quantity; and a second similarity determination unit, configured to determine the similarity between the first zero-crossing rate feature curve and the second zero-crossing rate feature curve based on the quantity difference.
[0009] In some embodiments, the first feature curve includes a first energy feature curve, and the second feature curve includes a second energy feature curve; the determining module includes: a third peak point sequence determining unit, configured to determine a fifth peak point sequence based on each peak point in the first energy feature curve, and a sixth peak point sequence based on each peak point in the second energy feature curve; an intensity value determining unit, configured to determine the intensity value corresponding to each peak point in the fifth peak point sequence and the intensity value corresponding to each peak point in the sixth peak point sequence; a second difference determining unit, configured to determine a first mean and a first variance of the intensity values corresponding to each peak point in the fifth peak point sequence, and to determine a second mean and a second variance of the intensity values corresponding to each peak point in the sixth peak point sequence; a third difference determining unit, configured to determine the mean difference between the first mean and the second mean, and to determine the variance difference between the first variance and the second variance; and a third similarity determining unit, configured to determine the similarity between the first energy feature curve and the second energy feature curve based on the mean difference and the variance difference.
[0010] In some embodiments, the determining module further includes: a normalization module, configured to normalize the first energy characteristic curve to obtain a normalized first energy characteristic curve, and to normalize the second energy characteristic curve to obtain a normalized second energy characteristic curve.
[0011] In some embodiments, the determining module further includes: a determining unit, configured to determine whether the similarity is greater than a similarity threshold; and an identity determining unit, configured to confirm that the speech segment to be examined and the sample speech segment belong to the same object if the similarity is greater than the similarity threshold.
[0012] Thirdly, embodiments of this application provide an electronic device, including: a processor; a memory, wherein computer-readable instructions are stored in the memory, and when the computer-readable instructions are executed by the processor, the method for verifying voice identity as described above is implemented.
[0013] Fourthly, embodiments of this application provide a computer-readable storage medium storing computer-readable instructions thereon, which, when executed by a processor, implement the method for verifying speech identity as described above.
[0014] In this application, two speech sub-segments with the same phoneme sequence are identified from the speech segment to be tested and the sample speech segment. Then, speech features are extracted from these two sub-segments, generating two corresponding feature curves. The similarity between the two feature curves is then determined, and finally, the similarity is used to determine the identity test result between the speech segment to be tested and the sample speech segment. This application can determine whether different speech segments come from the same speaker based on the individual speech characteristics and similarity exhibited by the speakers in continuous speech segments, thus improving the accuracy of speech identity verification.
[0015] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit the invention. Attached Figure Description
[0016] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application. It is obvious that the drawings described below are merely some embodiments of this application, and those skilled in the art can obtain other drawings based on these drawings without any inventive effort.
[0017] Figure 1 This is a schematic flowchart illustrating a method for verifying speech identity according to an embodiment of this application.
[0018] Figure 2 This is a schematic flowchart illustrating the specific steps of step 140 according to an embodiment of this application.
[0019] Figure 3 This is a schematic flowchart illustrating the specific steps of step 140 in another embodiment of this application.
[0020] Figure 4 This is a schematic diagram illustrating the specific steps of step 140 in another embodiment of this application.
[0021] Figure 5 This is a flowchart illustrating a method for verifying voice identity according to another embodiment of this application.
[0022] Figure 6 This is a block diagram of a device for verifying voice identity according to an embodiment of this application.
[0023] Figure 7 This is a hardware structure diagram of an electronic device according to an embodiment of this application.
[0024] The accompanying drawings have illustrated specific embodiments of the present invention, which will be described in more detail below. These drawings and descriptions are not intended to limit the scope of the inventive concept in any way, but rather to illustrate the concept of the invention to those skilled in the art through specific embodiments. Detailed Implementation
[0025] Exemplary embodiments will now be described more fully with reference to the accompanying drawings. However, these exemplary embodiments can be implemented in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided to make this application more comprehensive and complete, and to fully convey the concept of the exemplary embodiments to those skilled in the art.
[0026] Furthermore, the described features, structures, or characteristics can be combined in any suitable manner in one or more embodiments. Numerous specific details are provided in the following description to give a thorough understanding of embodiments of this application. However, those skilled in the art will recognize that the technical solutions of this application can be practiced without one or more of the specific details, or other methods, components, apparatuses, steps, etc., can be employed. In other instances, well-known methods, apparatuses, implementations, or operations are not shown or described in detail to avoid obscuring various aspects of this application.
[0027] Figure 1 This is a flowchart illustrating a method for verifying voice identity according to an embodiment of this application. The method can be executed by an electronic device with computing capabilities, such as a desktop computer, laptop computer, or other terminal device. The method can also be interactively executed by a processing system including a server and a terminal. Figure 1 As shown, the method includes the following steps:
[0028] Step 110: Obtain the speech segment to be tested and the sample speech segment.
[0029] A sample speech segment refers to a speech segment whose source (i.e., the speaker) is known, while a speech segment to be tested refers to a speech segment whose source is unknown. The source of the speech segment to be tested can be determined by comparing it with the sample speech segment.
[0030] Step 120: Determine the first speech sub-segment in the speech segment to be tested and the second speech sub-segment in the sample speech segment, wherein the first speech sub-segment and the second speech sub-segment are speech sub-segments with the same phoneme sequence.
[0031] In some embodiments, the first speech sub - segment and the second speech sub - segment can be determined through speech recognition technology. As a way, speech recognition technology can be used to perform phoneme recognition on the speech segment to be tested and the sample speech segment respectively, so as to obtain the phoneme content corresponding to the speech segment to be tested and the phoneme content of the sample speech segment.
[0032] A phoneme is the smallest speech unit divided according to the natural attributes of speech, and is divided into two categories: vowels and consonants. Phonemes are analyzed according to the pronunciation actions in a syllable. One action constitutes one phoneme. For example, the Chinese syllable "ā" has only one phoneme "ā"; "ài" has two phonemes, namely "à" and "i"; "dài" has three phonemes, namely "d", "à" and "i". Optionally, the phoneme can also be an English phoneme, or a phoneme corresponding to other languages, which is not specifically limited here.
[0033] Speech sub - segments with the same phoneme sequence refer to speech sub - segments in two speech segments where the positions of each phoneme in one speech segment are the same as the positions of the same phonemes in the other speech segment. It can be understood that if a speech segment includes a speech sub - segment composed of "abcde", where each of "abcde" corresponds to a phoneme, and the phonemes in this speech segment are arranged in order according to the phonemes in "abcde" to obtain a first phoneme sequence; another speech segment also includes a speech sub - segment composed of "abcde", with the same phonemes corresponding to "abcde", and the phonemes in this speech sub - segment are arranged in order according to the phonemes in "abcde" to obtain a second phoneme sequence; the first phoneme sequence and the second phoneme sequence are exactly the same.
[0034] In some embodiments, due to various reasons, the collected speech of the speech segment to be tested and the sample speech segment may have noise during the collection process. Therefore, as a way, before determining the first speech sub - segment and the second speech sub - segment, noise reduction processing is performed on the speech segment to be tested and the sample speech segment respectively. Optionally, the noise can be extracted and noise reduction can be performed according to the characteristics of the noise. Among them, the noise has chaotic amplitudes and frequencies. Therefore, a speech segment with chaotic amplitudes and frequencies can be found in the spectrograms of the speech segment to be tested and the sample speech segment, and this speech segment is extracted and only deleted, that is, the segment that is not noise is retained.
[0035] As another approach, the collected speech segments to be detected and sample speech segments include both speech and non-speech segments. To improve the accuracy of identifying the first and second speech sub-segments, active speech detection (VAD) can be performed on both the speech segments to be detected and the sample speech segments to determine the active speech segments within them. VAD, also known as endpoint detection, distinguishes between speech segments and non-speech segments (also called silence segments) in an audio clip, removing silence segments and retaining speech segments. Therefore, after performing VAD on the speech segments to be detected and the sample speech segments, the non-speech segments can be filtered out, retaining only the active speech segments (i.e., speech segments). Thus, in determining the first and second speech sub-segments, only the active speech segments in the speech segments to be detected and the sample speech segments need to be considered; the non-speech segments within these segments are not required.
[0036] Step 130: Extract the first feature of the first speech sub-segment and the second feature of the second speech sub-segment respectively, and determine the first feature curve based on the first feature and the second feature curve based on the second feature.
[0037] Speech features can include characteristic parameters such as fundamental frequency, energy, and zero-crossing rate. Furthermore, speech features can also include formant trends, Mel-frequency cepstral coefficients, and harmonics. The zero-crossing rate (ZCR) refers to the number of times the speech signal crosses a zero point (changing from positive to negative or vice versa) in each frame of the speech signal. The number of times the speech signal crosses zero per unit time is called the zero-crossing rate; the zero-crossing rate over a long period is called the average zero-crossing rate.
[0038] In one approach, speech signal processing algorithms, such as autocorrelation, cepstral method, and linear prediction (LPC), can be used to automatically calculate the fundamental frequency, energy, and zero-crossing rate of each speech frame of the first speech sub-segment. Then, the first feature curve corresponding to the first speech sub-segment can be determined based on the obtained fundamental frequency, energy, and zero-crossing rate of the first speech sub-segment.
[0039] The speech features of the second speech segment can be extracted in a similar manner, which will not be elaborated here.
[0040] Step 140: Determine the similarity between the first feature curve and the second feature curve, and determine the test result of the identity between the speech segment to be tested and the sample speech segment based on the similarity.
[0041] In some embodiments, the similarity between the first feature curve and the second feature curve can be the curve similarity between the first feature curve and the second feature curve. One approach is to calculate the similarity based on the peak points corresponding to the first feature curve and the second feature curve, and the feature parameters corresponding to the peak points. Optionally, the similarity between the first feature curve and the second feature curve can be determined by calculating the Hausdorff distance or the Frechet distance.
[0042] Alternatively, the peak points of the first and second characteristic curves can be counted separately, and then the similarity between the first and second characteristic curves can be calculated based on the difference in the number of peak points.
[0043] In some embodiments, the identity verification result may include a verification result indicating that the speech segment to be verified and the sample speech segment are from the same person (referred to as the first verification result for ease of description), and a verification result indicating that the speech segment to be verified and the sample speech segment are not from the same person (referred to as the second verification result for ease of description). Therefore, if, according to the above process, the similarity between the first feature curve and the second feature curve is lower than a similarity threshold, the identity verification result is determined to be the second verification result.
[0044] In this application, two speech sub-segments with the same phoneme sequence are identified from the speech segment to be tested and the sample speech segment. Then, speech features are extracted from these two sub-segments, generating two corresponding feature curves. The similarity between the two feature curves is then determined, and finally, the similarity is used to determine the identity test result between the speech segment to be tested and the sample speech segment. This application can determine whether different speech segments come from the same speaker based on the individual speech characteristics and similarity exhibited by the speakers in continuous speech segments, thus improving the accuracy of speech identity verification.
[0045] In some embodiments, prior to step 140, the method further includes: determining the timestamps of each phoneme in the first feature curve and the timestamps of each phoneme in the second feature curve, respectively; and aligning the same phonemes in the first feature curve and the second feature curve according to the timestamps of each phoneme in the first feature curve and the timestamps of each phoneme in the second feature curve.
[0046] Since the speech rate and rhythm of the first and second speech segments with the same phoneme sequence may differ, and the pronunciation duration of the first and second speech segments may also differ, in order to facilitate the determination of the identity between the speech segment to be tested and the sample speech segment based on the first and second feature curves and reduce the influence of non-speaker identity phonemes, it is necessary to align the first and second feature curves.
[0047] One approach is to align the first and second feature curves using phonemes as the alignment reference. Optionally, the timestamps of each phoneme in the first and second feature curves are determined, whereby the timestamps of each phoneme in the first and second feature curves determine the pronunciation duration of each phoneme. Then, based on the pronunciation duration of each phoneme, the pronunciation durations of each phoneme in the first and second feature curves are adjusted to be the same. Optionally, this can be achieved by stretching phonemes with shorter pronunciation durations or compressing phonemes with longer pronunciation durations. Alignment can be performed using interpolation or extraction methods. This ensures that the lengths of the first and second feature curves are the same, and that the pronunciation durations of each phoneme in the first and second feature curves are the same, i.e., the feature curves of corresponding phonemes are aligned end-to-end.
[0048] Optionally, similarity can be determined based on the aligned first feature curve and the aligned second feature curve, thereby enabling the identification of the identity between the speech segment to be tested and the sample speech segment based on the similarity.
[0049] Alternatively, the first and second speech segments can be aligned, and then the first and second feature curves can be determined based on the aligned first and second speech segments.
[0050] In some embodiments, such as Figure 2 As shown, the first characteristic curve includes a first fundamental frequency characteristic curve, and the second characteristic curve includes a second fundamental frequency characteristic curve; step 140 includes the following steps:
[0051] Step 210: Determine the first peak point sequence based on each peak point in the first fundamental frequency characteristic curve, and determine the second peak point sequence based on each peak point in the second fundamental frequency characteristic curve.
[0052] Optionally, the fundamental frequency feature information of the first speech sub-segment and the fundamental frequency information of the second speech sub-segment can be extracted using methods such as autocorrelation algorithm, parallel processing method, cepstral method, and simplified inverse filtering method. The fundamental frequency feature curve is a feature curve determined based on the fundamental frequency features of each frame of the speech signal.
[0053] A peak point sequence refers to a sequence obtained by sorting all the peak points in the fundamental frequency characteristic curve according to their order on the curve. For example, if the first fundamental frequency characteristic curve has four peak points, A, B, C, and D, and their order on the curve is first, third, fourth, and second, then sorting them in order yields the first peak point sequence, "ADBC". The process for determining the second peak point sequence is the same as that for the first, and will not be repeated here.
[0054] Step 220: Determine the first position information of each peak point in the first peak point sequence and the second position information of each peak point in the second peak point sequence.
[0055] In one approach, the first positional information can be the timestamp information of each peak point in the first peak point sequence, and the second positional information can be the timestamp information of each peak point in the second peak point sequence. Optionally, the first fundamental frequency characteristic curve and the second characteristic curve can be arranged according to the speech duration of the first speech sub-segment and the second speech sub-segment to obtain a first peak sequence and a second peak sequence with the same start time. Then, the timestamp of each peak point in the rearranged first peak point sequence (i.e., the time information of the occurrence of each peak point) and the timestamp of each peak point in the rearranged second peak point sequence can be determined.
[0056] Alternatively, the first position information can be the fundamental frequency information corresponding to each peak point in the first peak point sequence, and the second position information can be the fundamental frequency information of each peak point in the second peak point sequence.
[0057] Step 220: Based on the first location information and the second location information, determine the mean of the positional deviations of each corresponding peak point in the first peak point sequence and the second peak point sequence, and the variance of the positional deviations of each corresponding peak point in the first peak point sequence and the second peak point sequence.
[0058] As one approach, when the first position information is the timestamp information of each peak point in the first peak point sequence and the second position information is the timestamp information of each peak point in the second peak point sequence, the positional deviation of each corresponding peak point can be calculated based on the timestamp information of each peak point in the first peak point sequence and the timestamp information of the corresponding peak points in the second peak point sequence. Then, the mean and variance of this positional deviation are calculated based on these deviations. Here, a corresponding peak point refers to the point in the first peak point sequence that corresponds to the peak point in the second peak point sequence. For example, the corresponding peak point of the first peak point in the first peak point sequence is the first peak point in the second peak point sequence. Optionally, if the number of peak points in the first peak point sequence is different from the number of peak points in the second peak point sequence, then the corresponding peak points refer to the two peak points that are closest in position in the first and second peak point sequences.
[0059] Alternatively, the positional deviation of each corresponding peak point can be calculated using the fundamental frequency value of each corresponding peak point in the first peak point sequence and the second peak point sequence, and then the mean and variance can be calculated based on the positional deviation of each corresponding peak point.
[0060] Step 240: Determine the similarity between the first fundamental frequency characteristic curve and the second fundamental frequency characteristic curve based on the mean and the variance.
[0061] One approach is to perform a weighted calculation on the mean and variance, and then use the weighted value as the similarity between the first and second fundamental frequency characteristic curves. Optionally, the weighted value can be mapped to the interval [0,1], or to 0-100%. The required mapping interval can be set according to actual needs, and no specific limitation is made here. Optionally, the weighted calculation on the mean and variance can be a weighted average of the mean and variance.
[0062] In other embodiments, such as Figure 3 As shown, the first characteristic curve includes a first zero-crossing rate characteristic curve, and the second characteristic curve includes a second zero-crossing rate characteristic curve; step 130 includes the following steps:
[0063] Step 310: Determine the third peak point sequence based on each peak point in the first zero-crossing rate characteristic curve, and determine the fourth peak point sequence based on each peak point in the second zero-crossing rate characteristic curve.
[0064] The specific implementation of step 310 can be found in step 210, and will not be repeated here.
[0065] Step 320: Obtain the first number of peak points in the third peak point sequence and the second number of peak points in the fourth peak point sequence.
[0066] Because different people pronounce the same phoneme differently due to their pronunciation habits, resulting in variations in the pronunciation of voiceless and voiced consonants, the zero-crossing rate can be used to indicate whether a speaker produces more voiceless or consonant sounds in a speech segment. A high zero-crossing rate indicates voiceless sounds, while a low zero-crossing rate indicates voiced sounds. Since the third and fourth peak point sequences are the peak point sequences corresponding to the first and second zero-crossing rate characteristic curves, respectively, the probability that two speech segments containing the same phoneme originate from the same speaker can be determined by counting the number of peak points in the third and fourth peak point sequences.
[0067] Step 330: Determine the difference between the first quantity and the second quantity.
[0068] Optionally, the quantity difference can be the absolute value of the difference between the value of the first quantity and the value of the second quantity.
[0069] Step 330: Determine the similarity between the first zero-crossing rate characteristic curve and the second zero-crossing rate characteristic curve based on the quantity difference.
[0070] One approach is to map the value of the quantity difference to the interval [0,1], or to map the weighted value to the interval 0-100%. The required mapping interval can be set according to actual needs, and no specific limitation is made here. Optionally, the smaller the quantity difference, the higher the similarity between the first zero-crossing rate characteristic curve and the second zero-crossing rate characteristic curve; conversely, the larger the quantity difference, the lower the similarity between the first zero-crossing rate characteristic curve and the second zero-crossing rate characteristic curve.
[0071] In other embodiments, such as Figure 4 As shown, the first characteristic curve includes a first energy characteristic curve, and the second characteristic curve includes a second energy characteristic curve; step 140 includes the following steps:
[0072] Step 410: Determine the fifth peak point sequence based on each peak point in the first energy characteristic curve, and determine the sixth peak point sequence based on each peak point in the second energy characteristic curve.
[0073] The specific implementation of step 410 can be found in step 210, and will not be repeated here.
[0074] In some embodiments, before step 410, the method further includes: normalizing the first energy characteristic curve to obtain a normalized first energy characteristic curve, and normalizing the second energy characteristic curve to obtain a normalized second energy characteristic curve.
[0075] Optionally, since the amplitude of the energy feature curve is related to the volume of the speech, in order to eliminate the influence of the volume on the intensity value of the peak point, the first energy feature curve and the second energy feature curve are normalized before determining the fifth peak point sequence and the sixth peak point sequence.
[0076] One approach to normalizing the first and second energy characteristic curves is to pre-set an intensity threshold, limiting the maximum intensity values of the first and second energy characteristic curves to within the threshold, thereby adjusting the amplitude of the first and second energy characteristic curves.
[0077] Step 420: Determine the intensity value corresponding to each peak point in the fifth peak point sequence and the intensity value corresponding to each peak point in the sixth peak point sequence.
[0078] As one approach, the energy curve can be defined as the short-time energy characteristic curves of the first and second speech segments. The short-time energy can be considered as the square of the speech signal passed through a linear filter. Short-time energy can distinguish between voiced and unvoiced sounds, determine whether a segment is audible or silent, demarcate initials and finals, and demarcate ligatures. The intensity values corresponding to each peak point can be considered short-time intensity values, which can be used to determine whether different speech segments originate from the same person.
[0079] Step 430: Determine the first mean and first variance of the intensity values corresponding to each peak point in the fifth peak point sequence, and determine the second mean and second variance of the intensity values corresponding to each peak point in the sixth peak point sequence.
[0080] Optionally, the mean and variance of the intensity values in the energy feature curve of a speech segment can reflect the pronunciation habits of the speaker in that speech segment. The first mean can be obtained by summing the intensity values corresponding to all peaks in the fifth peak sequence, and then dividing the sum by the number of peaks in the fifth peak sequence, as shown in the formula: Where E is the first mean; x1, x2, ..., x n-1 x n These are the intensity values corresponding to each peak point; the first variance is the sum of the squares of the subtractions between the intensity values corresponding to each peak point in the fifth peak point sequence and the mean, and the number of peak points in the fifth peak point sequence, as shown in the formula: Among them, S 2 Let be the first variance. The determination of the second mean and the second variance is the same as that of the first mean and the second variance, and will not be repeated here.
[0081] Step 440: Determine the mean difference between the first mean and the second mean, and determine the variance difference between the first variance and the second variance.
[0082] As one method, the mean difference is the difference between the first mean and the second mean, and the absolute value is taken; the variance difference is the difference between the first variance and the second variance, and the absolute value is taken.
[0083] Step 440: Determine the similarity between the first energy characteristic curve and the second energy characteristic curve based on the mean difference and the variance difference.
[0084] Optionally, the mean difference can roughly reflect the similarity between the first energy characteristic curve and the second energy characteristic curve, but it is not comprehensive. Therefore, the variance difference needs to be added to jointly determine the similarity between the first zero-crossing rate characteristic curve and the second zero-crossing rate characteristic curve.
[0085] One approach is to perform a weighted calculation on the mean difference and variance difference, and then use the weighted value as the similarity between the first and second energy characteristic curves. Optionally, the weighted value can be mapped to the interval [0,1], or to 0-100%. The required mapping interval can be set according to actual needs, and no specific limitation is made here. Optionally, the weighted calculation on the mean difference and variance difference can be performed by taking a weighted average of the mean difference and variance difference.
[0086] Figure 5 This is a flowchart illustrating a method for verifying voice identity according to an embodiment of this application. The method can be executed by an electronic device with computing capabilities, such as a desktop computer, laptop computer, or other terminal device. The method can also be interactively executed by a processing system including a server and a terminal. Figure 5 As shown, the method includes the following steps:
[0087] Step 510: Obtain the speech segment to be tested and the sample speech segment.
[0088] Step 520: Determine the first speech sub-segment in the speech segment to be tested and the second speech sub-segment in the sample speech segment, wherein the first speech sub-segment and the second speech sub-segment are speech sub-segments with the same phoneme sequence.
[0089] Step 530: Extract the first feature of the first speech sub-segment and the second feature of the second speech sub-segment respectively, and determine the first feature curve based on the first feature and the second feature curve based on the second feature.
[0090] Steps 510-530 can be referred to as steps 110-130, and will not be repeated here.
[0091] Step 540: Determine the similarity between the first feature curve and the second feature curve, and determine whether the similarity is greater than a similarity threshold.
[0092] As a method, it can be based on Figures 2-4 The method involves obtaining feature curves for different features of the first and second speech segments, then calculating the differences between these features based on the different feature curves, and finally determining the similarity between the first and second feature curves by weighting these differences. For example, the mean and variance of the positional deviations of corresponding peak points in the first and second fundamental frequency feature curves, the difference in the number of peak points in the first and second zero-crossing rate feature curves, and the mean and variance of the intensity values of peak points in the first and second energy feature curves can be weighted and calculated. The resulting weighted value is then mapped to the interval [0,1], and the value mapped to this interval represents the similarity between the first and second feature curves.
[0093] Optionally, the correspondence between the weighted calculated values and the similarity can be preset, and then the similarity between the first feature curve and the second feature curve can be determined based on the weighted calculated values. The similarity threshold can be set according to actual needs and is not specifically limited here.
[0094] Step 550: If the similarity is greater than the similarity threshold, then it is confirmed that the speech segment to be tested and the sample speech segment belong to the same object.
[0095] When the similarity is greater than the similarity threshold, the first feature curve and the second feature curve can be considered to be the same. Since the first feature curve is determined based on the first speech segment and the second feature curve is determined based on the second speech segment, and the first and second speech segments come from the speech segment to be tested and the sample speech segment respectively, it can be determined that the speech segment to be tested and the sample speech segment belong to the same object.
[0096] Optionally, since the speaker's identity in the sample speech segment is known, the speaker's identity in the speech segment to be examined can be determined. This method can be used in voiceprint identification.
[0097] In the scheme of this application, the similarity between the determined first feature curve and the second feature curve is compared with a similarity threshold. When the similarity is greater than the similarity threshold, it is confirmed that the speech segment to be tested and the sample speech segment belong to the same object, thereby improving the accuracy of speech identity testing.
[0098] Figure 6 This is a block diagram of a voice identity verification device according to an embodiment of this application, such as... Figure 6As shown, the speech identity verification device 600 includes: an acquisition module 610, a speech segment determination module 620, a feature extraction module 630, and a determination module 640.
[0099] The acquisition module 610 is used to acquire the speech segment to be tested and the sample speech segment; the speech segment determination module 620 is used to determine the first speech segment in the speech segment to be tested and the second speech segment in the sample speech segment, wherein the first speech segment and the second speech segment are speech segments with the same phoneme sequence; the feature extraction module 630 is used to extract the first feature of the first speech segment and the second feature of the second speech segment, and determine the first feature curve based on the first feature and the second feature curve based on the second feature; the determination module 640 is used to determine the similarity between the first feature curve and the second feature curve, and determine the test result of the identity of the speech segment to be tested and the sample speech segment based on the similarity.
[0100] In some embodiments, the voice identity verification device 600 further includes: a timestamp determination module, configured to determine the timestamps of each phoneme in the first feature curve and the timestamps of each phoneme in the second feature curve respectively; and an alignment module, configured to align the same phonemes in the first feature curve and the second feature curve according to the timestamps of each phoneme in the first feature curve and the timestamps of each phoneme in the second feature curve.
[0101] In some embodiments, the first feature curve includes a first fundamental frequency feature curve, and the determining module 640 includes: a peak point sequence first determining unit, configured to determine a first peak sequence based on each peak point in the first fundamental frequency feature curve, and to determine a second peak point sequence based on each peak point in the second fundamental frequency feature curve; a position information determining unit, configured to determine first position information of each peak point in the first peak point sequence and second position information of each peak point in the second peak point sequence; a first difference determining unit, configured to determine the mean of the position deviations of each corresponding peak point in the first peak point sequence and the variance of the position deviations of each corresponding peak point in the second peak point sequence based on the first position information and the second position information; and a first similarity determining unit, configured to determine the similarity between the first fundamental frequency feature curve and the second fundamental frequency feature curve based on the mean and the variance.
[0102] In other embodiments, the first feature curve includes a first zero-crossing rate feature curve, and the second feature curve includes a second zero-crossing rate feature curve; the determining module 640 includes: a peak point sequence second determining unit, configured to determine a third peak point sequence based on each peak point in the first zero-crossing rate feature curve, and to determine a fourth peak point sequence based on each peak point in the second zero-crossing rate feature curve; a quantity acquisition unit, configured to acquire a first quantity of peak points in the third peak point sequence and a second quantity of peak points in the fourth peak point sequence; a quantity difference determining unit, configured to determine the quantity difference between the first quantity and the second quantity; and a second similarity determining unit, configured to determine the similarity between the first zero-crossing rate feature curve and the second zero-crossing rate feature curve based on the quantity difference.
[0103] In other embodiments, the first feature curve includes a first energy feature curve, and the second feature curve includes a second energy feature curve; the determining module 640 includes: a third peak point sequence determining unit, configured to determine a fifth peak point sequence based on each peak point in the first energy feature curve, and a sixth peak point sequence based on each peak point in the second energy feature curve; an intensity value determining unit, configured to determine the intensity value corresponding to each peak point in the fifth peak point sequence and the intensity value corresponding to each peak point in the sixth peak point sequence; a second difference determining unit, configured to determine a first mean and a first variance of the intensity values corresponding to each peak point in the fifth peak point sequence, and a second mean and a second variance of the intensity values corresponding to each peak point in the sixth peak point sequence; a third difference determining unit, configured to determine the mean difference between the first mean and the second mean, and the variance difference between the first variance and the second variance; and a third similarity determining unit, configured to determine the similarity between the first energy feature curve and the second energy feature curve based on the mean difference and the variance difference.
[0104] In some embodiments, the determining module 640 further includes a normalization module, configured to normalize the first energy characteristic curve to obtain a normalized first energy characteristic curve, and to normalize the second energy characteristic curve to obtain a normalized second energy characteristic curve.
[0105] In some embodiments, the determining module 640 further includes: a determining unit, configured to determine whether the similarity is greater than a similarity threshold; and an identity determining unit, configured to confirm that the speech segment to be examined and the sample speech segment belong to the same object if the similarity is greater than the similarity threshold.
[0106] According to one aspect of the embodiments of this application, an electronic device is also provided, such as... Figure 7As shown, the electronic device 700 includes a processor 710 and one or more memories 720. The one or more memories 720 are used to store program instructions executed by the processor 710. When the processor 710 executes the program instructions, it implements the above-mentioned voice identity verification method.
[0107] Furthermore, the processor 710 may include one or more processing cores. The processor 710 runs or executes instructions, programs, code sets, or instruction sets stored in the memory 720, and retrieves data stored in the memory 720. Optionally, the processor 710 may be implemented using at least one hardware form selected from Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 710 may integrate one or a combination of several of the following: Central Processing Unit (CPU), Graphics Processing Unit (GPU), and modem. The CPU primarily handles the operating system, user interface, and applications; the GPU is responsible for rendering and drawing the displayed content; and the modem handles wireless communication. It is understood that the modem may also not be integrated into the processor and may be implemented using a separate communication chip.
[0108] According to one aspect of the embodiments of this application, a computer program product or computer program is provided, the computer program product or computer program including computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, causing the computer device to perform the methods of any of the above embodiments.
[0109] According to one aspect of this application, a computer-readable storage medium is also provided, which may be included in the electronic device described in the above embodiments; or it may exist independently and not assembled into the electronic device. The computer-readable storage medium carries computer-readable instructions that, when executed by a processor, implement the methods in any of the above embodiments.
[0110] It should be noted that the computer-readable medium shown in the embodiments of this application can be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), flash memory, optical fiber, portable compact disc read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this application, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this application, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such transmitted data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. The computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to wireless, wired, etc., or any suitable combination thereof.
[0111] The units described in the embodiments of this application can be implemented in software or hardware, and the described units can also be located in a processor. The names of these units do not necessarily limit the specific unit itself.
[0112] It should be noted that although several modules or units for the device used to perform actions have been mentioned in the detailed description above, this division is not mandatory. In fact, according to the embodiments of this application, the features and functions of two or more modules or units described above can be embodied in one module or unit. Conversely, the features and functions of one module or unit described above can be further divided and embodied by multiple modules or units.
[0113] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. Each block in a flowchart or block diagram may represent a module, segment, or portion of code, which contains one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0114] Other embodiments of this application will readily occur to those skilled in the art upon consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary techniques in the art not disclosed herein.
[0115] It should be understood that this application is not limited to the precise structure described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this application is limited only by the appended claims.
Claims
1. A method of verifying voice identity, characterized by, The method includes: Obtain the speech segment to be tested and the sample speech segment; The first speech sub-segment in the speech segment to be tested and the second speech sub-segment in the sample speech segment are determined respectively, wherein the first speech sub-segment and the second speech sub-segment are speech sub-segments with the same phoneme sequence; the phoneme sequence includes multiple phonemes arranged in a continuous sequence; First features of the first speech segment and second features of the second speech segment are extracted respectively, and a first feature curve is determined based on the first feature and a second feature curve is determined based on the second feature; the first feature curve includes a first fundamental frequency feature curve, a first zero-crossing rate feature curve and a first energy feature curve; the second feature curve includes a second fundamental frequency feature curve, a second zero-crossing rate feature curve and a second energy feature curve; The timestamps of each phoneme in the first feature curve and the timestamps of each phoneme in the second feature curve are determined respectively. The pronunciation duration of each phoneme in the same phoneme sequence is determined based on the timestamp of each phoneme in the first feature curve, and the pronunciation duration of each phoneme in the same phoneme sequence is determined based on the timestamp of each phoneme in the second feature curve. Align the same phonemes in the first feature curve and the second feature curve so that the pronunciation duration of the same phonemes in the first feature curve and the second feature curve is adjusted to be the same; The similarity between the aligned first feature curve and the aligned second feature curve is determined, and the test result of the identity of the speech segment to be tested and the sample speech segment is determined based on the similarity. Determining the similarity between the aligned first feature curve and the aligned second feature curve includes: The mean and variance of the positional deviations of the corresponding peak points in the aligned first fundamental frequency characteristic curve and the aligned second fundamental frequency characteristic curve, the difference in the number of peak points in the aligned first zero-crossing rate characteristic curve and the aligned second zero-crossing rate characteristic curve, and the mean and variance differences of the intensity values of the peak points in the aligned first energy characteristic curve and the aligned second energy characteristic curve are weighted and calculated. The weighted values are mapped to [0, 1] to obtain the similarity between the aligned first feature curve and the aligned second feature curve.
2. The method of claim 1, wherein, The mean and variance of the positional deviations of corresponding peak points in the aligned first and second fundamental frequency characteristic curves are determined as follows: A first peak point sequence is determined based on each peak point in the aligned first fundamental frequency characteristic curve, and a second peak point sequence is determined based on each peak point in the aligned second fundamental frequency characteristic curve; Determine the first position information of each peak point in the first peak point sequence and the second position information of each peak point in the second peak point sequence; Based on the first location information and the second location information, determine the mean of the positional deviations of each corresponding peak point in the first peak point sequence and the second peak point sequence, and the variance of the positional deviations of each corresponding peak point in the first peak point sequence and the second peak point sequence.
3. The method of claim 1, wherein, The difference in the number of peak points between the aligned first zero-crossing rate characteristic curve and the aligned second zero-crossing rate characteristic curve is determined according to the following process: The third peak point sequence is determined based on each peak point in the aligned first zero-crossing rate characteristic curve, and the fourth peak point sequence is determined based on each peak point in the aligned second zero-crossing rate characteristic curve. Obtain the first number of peak points in the third peak point sequence and the second number of peak points in the fourth peak point sequence; The difference between the first quantity and the second quantity is determined as the difference in the number of peak points in the aligned first zero-crossing rate characteristic curve and the aligned second zero-crossing rate characteristic curve.
4. The method of claim 1, wherein, The mean difference and variance difference of the intensity values at the peak points in the first energy characteristic curve and the second energy characteristic curve are determined according to the following process: The fifth peak point sequence is determined based on each peak point in the aligned first energy characteristic curve, and the sixth peak point sequence is determined based on each peak point in the aligned second energy characteristic curve. Determine the intensity value corresponding to each peak point in the fifth peak point sequence and the intensity value corresponding to each peak point in the sixth peak point sequence; Determine the first mean and first variance of the intensity values corresponding to each peak point in the fifth peak point sequence, and determine the second mean and second variance of the intensity values corresponding to each peak point in the sixth peak point sequence; The mean difference between the first mean and the second mean is determined as the mean difference of the intensity values of the peak points in the aligned first energy characteristic curve and the aligned second energy characteristic curve. The variance difference between the first variance and the second variance is determined as the variance difference of the intensity values of the peak points in the aligned first energy characteristic curve and the aligned second energy characteristic curve.
5. The method of claim 4, wherein, Before determining the fifth peak point sequence based on the peak points in the aligned first energy characteristic curve, and the sixth peak point sequence based on the peak points in the aligned second energy characteristic curve, the method further includes: The aligned first energy characteristic curve is normalized to obtain the normalized first energy characteristic curve, and the aligned second energy characteristic curve is normalized to obtain the normalized second energy characteristic curve.
6. The method of claim 1, wherein, The test result for determining the identity of the speech segment to be tested and the sample speech segment based on the similarity includes: Determine whether the similarity is greater than a similarity threshold; If the similarity is greater than the similarity threshold, then the speech segment to be tested and the sample speech segment are confirmed to belong to the same object.
7. A device for verifying voice identity, characterized in that, The device includes: The acquisition module is used to acquire the speech segment to be tested and the sample speech segment; The speech segment determination module is used to determine a first speech segment in the speech segment to be examined and a second speech segment in the sample speech segment, wherein the first speech segment and the second speech segment are speech segments with the same phoneme sequence; the phoneme sequence includes multiple phonemes arranged consecutively. The feature extraction module is used to extract a first feature of the first speech sub-segment and a second feature of the second speech sub-segment, and to determine a first feature curve based on the first feature and a second feature curve based on the second feature; the first feature curve includes a first fundamental frequency feature curve, a first zero-crossing rate feature curve and a first energy feature curve; the second feature curve includes a second fundamental frequency feature curve, a second zero-crossing rate feature curve and a second energy feature curve; The timestamp determination module is used to determine the timestamps of each phoneme in the first feature curve and the timestamps of each phoneme in the second feature curve, respectively; and to determine the pronunciation duration of each phoneme in the same phoneme sequence based on the timestamps of each phoneme in the first feature curve, and to determine the pronunciation duration of each phoneme in the same phoneme sequence based on the timestamps of each phoneme in the second feature curve. The alignment module is used to align the same phonemes in the first feature curve and the second feature curve so that the pronunciation duration of the same phonemes in the first feature curve and the second feature curve is adjusted to be the same. The determination module is used to determine the similarity between the aligned first feature curve and the aligned second feature curve, and to determine the test result of the identity of the speech segment to be tested and the sample speech segment based on the similarity. Determining the similarity between the aligned first feature curve and the aligned second feature curve includes: The mean and variance of the positional deviations of the corresponding peak points in the aligned first fundamental frequency characteristic curve and the aligned second fundamental frequency characteristic curve, the difference in the number of peak points in the aligned first zero-crossing rate characteristic curve and the aligned second zero-crossing rate characteristic curve, and the mean and variance differences of the intensity values of the peak points in the aligned first energy characteristic curve and the aligned second energy characteristic curve are weighted and calculated. The weighted values are mapped to [0, 1] to obtain the similarity between the aligned first feature curve and the aligned second feature curve.
8. An electronic device, comprising: include: processor; A memory storing computer-readable instructions that, when executed by the processor, implement the method as described in any one of claims 1-6.
9. A computer-readable storage medium having stored thereon computer-readable instructions that, when executed by a processor, implement the method as described in any one of claims 1-6.