Speaker recognition method and apparatus for audio

By extracting and clustering features from audio frames, combined with centroid vectors and smoothing, the problem of low accuracy in speaker recognition methods in complex scenarios in existing technologies is solved, and higher recognition accuracy is achieved.

CN116312565BActive Publication Date: 2026-06-16SF TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SF TECH CO LTD
Filing Date
2021-12-21
Publication Date
2026-06-16

Smart Images

  • Figure CN116312565B_ABST
    Figure CN116312565B_ABST
Patent Text Reader

Abstract

The application provides a speaker recognition method and device for audio, the speaker recognition method for audio comprising: obtaining sound features of a plurality of first audio frames in audio to be recognized and a plurality of preset speaker sound features; determining a first speaker recognition result of each first audio frame according to the sound features of the plurality of first audio frames and the plurality of preset speaker sound features; performing smoothing processing on the first speaker recognition results of the plurality of first audio frames to obtain a second speaker recognition result; and updating the preset speaker sound features according to the second speaker recognition result and performing speaker recognition again to obtain a target speaker recognition result. The application can obtain a more accurate speaker recognition result, thereby improving the accuracy of the speaker recognition method for audio.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application mainly relates to the field of speech technology, specifically to a method and apparatus for speaker recognition in audio. Background Technology

[0002] Dynamic speaker separation primarily involves identifying the number of speakers in a long audio clip with multiple speakers and detecting the start and end timestamps of each speaker's speech. This addresses the question of "Who speaks when," enabling quick retrieval and location of specific speaker segments. It forms the foundation for subsequent modules such as speech recognition and voiceprint recognition, and is widely used in customer service and meeting scenarios. However, current audio speaker recognition methods have relatively low accuracy, especially in complex scenarios where speakers overlap.

[0003] In other words, the accuracy of existing speaker recognition methods in audio is relatively low. Summary of the Invention

[0004] This application provides a method and apparatus for speaker recognition in audio, aiming to solve the problem of low accuracy in existing speaker recognition methods for audio.

[0005] Firstly, this application provides a speaker recognition method for audio, the speaker recognition method for audio comprising:

[0006] Acquire the sound features of multiple first audio frames and multiple preset speaker sound features in the audio to be identified;

[0007] The first speaker recognition result for each first audio frame is determined based on the sound features of the plurality of first audio frames and the plurality of preset speaker sound features;

[0008] The first speaker recognition results of multiple first audio frames are smoothed to obtain the second speaker recognition results;

[0009] The preset speaker voice features are updated based on the second speaker recognition result, and speaker recognition is performed again to obtain the target speaker recognition result.

[0010] Optionally, acquiring the sound features of multiple first audio frames and multiple preset speaker sound features in the audio to be identified includes:

[0011] The audio to be identified is divided into multiple audio segments, wherein each audio segment contains at least two first audio frames;

[0012] Cluster the multiple audio segments to obtain multiple audio segment sets;

[0013] The speaker voice features corresponding to each set of audio segments are determined based on each set of audio segments, thereby obtaining the multiple preset speaker voice features.

[0014] Optionally, the step of determining the speaker voice features corresponding to each of the audio segment sets based on each of the audio segment sets to obtain the plurality of preset speaker voice features includes:

[0015] Obtain the centroid vector of the audio segment set;

[0016] Calculate the similarity between each audio segment in the audio segment set and the centroid vector;

[0017] Obtain multiple target audio segments whose similarity to the centroid vector is higher than a first preset similarity;

[0018] Based on the sound features of the multiple target audio segments, the speaker's voice features corresponding to the audio segment set are determined, and the multiple preset speaker voice features are obtained.

[0019] Optionally, the step of updating the preset speaker voice features based on the second speaker recognition result and performing speaker recognition again to obtain the target speaker recognition result includes:

[0020] Obtain audio frames belonging to the same speaker from the second speaker identification result;

[0021] The sound features of audio frames belonging to the same speaker are fused to obtain the speaker's sound features for each speaker.

[0022] The speaker voice features of each speaker are redefined as the preset speaker voice features, and speaker recognition and smoothing are performed to obtain the target speaker recognition result.

[0023] Optionally, determining the first speaker recognition result for each first audio frame based on the sound features of the plurality of first audio frames and the plurality of preset speaker sound features includes:

[0024] The plurality of first audio frames are respectively identified as target audio frames;

[0025] The sound features of the target audio frame are matched with the sound features of the plurality of preset speakers to obtain the matching speaker that matches the target audio frame;

[0026] The matching speaker of each of the first audio frames is determined as the first speaker recognition result of each of the first audio frames.

[0027] Optionally, the smoothing process of the first speaker recognition results of multiple first audio frames to obtain the second speaker recognition result includes:

[0028] Audio is extracted from multiple first audio frames based on a preset time window and a preset step size to obtain multiple sets of first frames. Each set of first frames includes multiple second audio frames. The number of second audio frames is less than the number of first audio frames. The second audio frames in the same set of first frames are continuous in time.

[0029] Each of the multiple second audio frames in each of the first frame sets is smoothed to obtain the second speaker recognition result.

[0030] Optionally, the step of smoothing multiple second audio frames in each of the first frame sets to obtain the second speaker recognition result includes:

[0031] The overlapping audio frames in the first frame set are removed to obtain the second frame set, wherein the number of speakers matched by the overlapping audio frames is at least two;

[0032] Determine whether a target speaker exists in the second frame set, wherein the number of second audio frames in the first frame set that match the target speaker is greater than a preset value;

[0033] If the target speaker exists in the second set of frames, then the speaker of the second audio frame in the second set of frames is identified as the target speaker.

[0034] Secondly, this application provides an audio speaker recognition device, the audio speaker recognition device comprising:

[0035] The acquisition unit is used to acquire the sound features of multiple first audio frames and multiple preset speaker sound features in the audio to be identified.

[0036] The determining unit is configured to determine the first speaker recognition result of each first audio frame based on the sound features of the plurality of first audio frames and the plurality of preset speaker sound features;

[0037] A smoothing processing unit is used to smooth the first speaker recognition results of multiple first audio frames to obtain a second speaker recognition result;

[0038] The update iteration unit is used to update the preset speaker voice features based on the second speaker recognition result and perform speaker recognition again to obtain the target speaker recognition result.

[0039] Optionally, the acquisition unit is configured to:

[0040] The audio to be identified is divided into multiple audio segments, wherein each audio segment contains at least two first audio frames;

[0041] Cluster the multiple audio segments to obtain multiple audio segment sets;

[0042] The speaker voice features corresponding to each set of audio segments are determined based on each set of audio segments, thereby obtaining the multiple preset speaker voice features.

[0043] Optionally, the acquisition unit is configured to:

[0044] Obtain the centroid vector of the audio segment set;

[0045] Calculate the similarity between each audio segment in the audio segment set and the centroid vector;

[0046] Obtain multiple target audio segments whose similarity to the centroid vector is higher than a first preset similarity;

[0047] Based on the sound features of the multiple target audio segments, the speaker's voice features corresponding to the audio segment set are determined, and the multiple preset speaker voice features are obtained.

[0048] Optionally, the update iteration unit is used for:

[0049] Obtain audio frames belonging to the same speaker from the second speaker identification result;

[0050] The sound features of audio frames belonging to the same speaker are fused to obtain the speaker's sound features for each speaker.

[0051] The speaker voice features of each speaker are redefined as the preset speaker voice features, and speaker recognition and smoothing are performed to obtain the target speaker recognition result.

[0052] Optionally, the determining unit is configured to:

[0053] The plurality of first audio frames are respectively identified as target audio frames;

[0054] The sound features of the target audio frame are matched with the sound features of the plurality of preset speakers to obtain the matching speaker that matches the target audio frame;

[0055] The matching speaker of each of the first audio frames is determined as the first speaker recognition result of each of the first audio frames.

[0056] Optionally, the smoothing processing unit is used for:

[0057] Audio is extracted from multiple first audio frames based on a preset time window and a preset step size to obtain multiple sets of first frames. Each set of first frames includes multiple second audio frames. The number of second audio frames is less than the number of first audio frames. The second audio frames in the same set of first frames are continuous in time.

[0058] Each of the multiple second audio frames in each of the first frame sets is smoothed to obtain the second speaker recognition result.

[0059] Optionally, the smoothing processing unit is used for:

[0060] The overlapping audio frames in the first frame set are removed to obtain the second frame set, wherein the number of speakers matched by the overlapping audio frames is at least two;

[0061] Determine whether a target speaker exists in the second frame set, wherein the number of second audio frames in the first frame set that match the target speaker is greater than a preset value;

[0062] If the target speaker exists in the second set of frames, then the speaker of the second audio frame in the second set of frames is identified as the target speaker.

[0063] Thirdly, this application provides a computer device, the computer device comprising:

[0064] One or more processors;

[0065] Memory; and

[0066] One or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the processor to implement the speaker recognition method for audio as described in any one of the first aspects.

[0067] Fourthly, this application provides a computer-readable storage medium storing a plurality of instructions adapted for loading by a processor to perform the steps of the audio speaker recognition method described in any one of the first aspects.

[0068] This application provides a method and apparatus for speaker recognition in audio. The speaker recognition method includes: acquiring sound features of multiple first audio frames and multiple preset speaker sound features in the audio to be recognized; determining a first speaker recognition result for each first audio frame based on the sound features of the multiple first audio frames and the multiple preset speaker sound features; smoothing the first speaker recognition results of the multiple first audio frames to obtain a second speaker recognition result; updating the preset speaker sound features based on the second speaker recognition result and performing speaker recognition again to obtain a target speaker recognition result. This application first determines the first recognition result based on the preset speaker sound features and the audio to be recognized. After obtaining the first speaker recognition results for each frame, it smooths the first speaker recognition results and uses the smoothed results to update the preset speaker sound features before performing speaker recognition again. This results in a more accurate speaker recognition result, thereby improving the accuracy of the audio speaker recognition method. Attached Figure Description

[0069] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0070] Figure 1 This is a schematic diagram of a speaker recognition system for audio provided in an embodiment of this application;

[0071] Figure 2 This is a schematic flowchart of an embodiment of the speaker recognition method for audio provided in this application.

[0072] Figure 3 This is a schematic diagram of an embodiment of the audio speaker recognition device provided in this application.

[0073] Figure 4 This is a schematic diagram of an embodiment of the computer device provided in this application. Detailed Implementation

[0074] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0075] In the description of this application, it should be understood that the terms "center," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," and "outer," etc., indicating orientation or positional relationships based on the orientation or positional relationships shown in the accompanying drawings, are used only for the convenience of describing this application and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation, and therefore should not be construed as a limitation of this application. Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Thus, features defined with "first" and "second" may explicitly or implicitly include one or more features. In the description of this application, "a plurality of" means two or more, unless otherwise explicitly specified.

[0076] In this application, the term "exemplary" is used to mean "used as an example, illustration, or description." Any embodiment described as "exemplary" in this application is not necessarily to be construed as being more preferred or advantageous than other embodiments. The following description is provided to enable any person skilled in the art to make and use this application. Details are set forth in the following description for purposes of explanation. It should be understood that those skilled in the art will recognize that this application can be made without using these specific details. In other instances, well-known structures and processes are not described in detail to avoid obscuring the description of this application with unnecessary detail. Therefore, this application is not intended to be limited to the embodiments shown, but is consistent with the broadest scope of the principles and features disclosed in this application.

[0077] This application provides a speaker recognition method and apparatus for audio, which will be described in detail below.

[0078] Please see Figure 1 , Figure 1 This is a schematic diagram of an audio speaker recognition system provided in an embodiment of this application. The audio speaker recognition system may include a computer device 100, which integrates an audio speaker recognition device.

[0079] In this embodiment, the computer device 100 can be a standalone server, a server network, or a server cluster. For example, the computer device 100 described in this embodiment includes, but is not limited to, a computer, a network host, a single network server, a set of multiple network servers, or a cloud server composed of multiple servers. The cloud server is composed of a large number of computers or network servers based on cloud computing.

[0080] In this embodiment, the computer device 100 described above can be a general-purpose computer device or a special-purpose computer device. In specific implementations, the computer device 100 can be a desktop computer, a portable computer, a network server, a handheld computer (Personal Digital Assistant, PDA), a mobile phone, a tablet computer, a wireless terminal device, a communication device, an embedded device, etc. This embodiment does not limit the type of computer device 100.

[0081] Those skilled in the art will understand that Figure 1 The application environment shown is merely one application scenario of the solution in this application and does not constitute a limitation on the application scenario of the solution in this application. Other application environments may include more than one application scenario. Figure 1 The number of computer devices shown is more or less, for example Figure 1 Only one computer device is shown in the image. It is understood that the speaker recognition system for the audio may also include one or more other computer devices capable of processing data, which are not specifically limited here.

[0082] In addition, such as Figure 1 As shown, the speaker recognition system for the audio may also include a memory 200 for storing data.

[0083] It should be noted that, Figure 1 The illustrated scenario diagram of the audio speaker recognition system is merely an example. The audio speaker recognition system and scenario described in this application embodiment are intended to more clearly illustrate the technical solutions of this application embodiment and do not constitute a limitation on the technical solutions provided in this application embodiment. As those skilled in the art will know, with the evolution of audio speaker recognition systems and the emergence of new business scenarios, the technical solutions provided in this application embodiment are also applicable to similar technical problems.

[0084] First, this application provides an audio speaker recognition method and apparatus. The audio speaker recognition method includes: acquiring sound features of multiple first audio frames and multiple preset speaker sound features in the audio to be recognized; determining a first speaker recognition result for each first audio frame based on the sound features of the multiple first audio frames and the multiple preset speaker sound features; smoothing the first speaker recognition results of the multiple first audio frames to obtain a second speaker recognition result; updating the preset speaker sound features based on the second speaker recognition result and performing speaker recognition again to obtain a target speaker recognition result.

[0085] like Figure 2 As shown, Figure 2This is a schematic flowchart of an embodiment of the audio speaker recognition method in this application, which includes the following steps S201~S204:

[0086] S201. Obtain the sound features of multiple first audio frames and multiple preset speaker sound features in the audio to be identified.

[0087] In this embodiment of the application, before acquiring the sound features of multiple first audio frames and multiple preset speaker sound features in the audio to be identified, the process may include: acquiring the audio to be identified.

[0088] The audio to be identified can be customer service recordings, meeting recordings, etc. The preset speaker voice features are the voice features of a single speaker, represented by a vector ivector. For example, multiple preset speaker voice features are: ivector1 for speaker 1; ivector2 for speaker 2; and ivector3 for speaker 3.

[0089] In this embodiment of the application, obtaining the sound features of multiple first audio frames and multiple preset speaker sound features in the audio to be identified may include:

[0090] (1) Divide the audio to be identified into multiple audio segments.

[0091] Each audio segment must contain at least two first audio frames. The number of first audio frames in each audio segment can be the same or different. For example, if the audio to be identified is 10 seconds long and each frame plays for 25ms, then there are a total of 400 audio frames. If the audio to be identified is divided into 10 audio segments, then each audio segment contains 40 audio frames. The length of the audio segment can be set according to the specific situation.

[0092] In this embodiment, the sound features of the first audio frame can be MFCC (Mel-Frequency Ceptral Coefficients) features. Sound features are extracted from the audio frames to be identified, resulting in multiple sound features of the first audio frames.

[0093] (2) Cluster multiple audio segments to obtain multiple audio segment sets.

[0094] In this embodiment, multiple audio segments can be clustered using algorithms such as K-MEANS, K-MEDOIDS, CLARANS, BIRCH, CURE, and CHAMELEON, based on a preset number K, to obtain a preset number K audio segment set. The preset number K represents the clustering category. Audio segments located within the same audio segment set are relatively similar.

[0095] Specifically, the preset number can be determined through manual experience. For example, the preset number K in customer service recordings is 2. Preferably, multiple audio segments are clustered using multiple different clustering numbers to obtain multiple clustering results. The sum of squared errors (SSE) of the multiple clustering results is determined using the elbow method, and the optimal number of clusters is determined based on the SSE of the multiple clustering results. This optimal number of clusters is then set as the preset number. When the number of clusters is the optimal number, the curve formed by plotting the SSE of errors on the ordinate and the number of clusters on the abscissa has the highest curvature. Of course, in other embodiments, the optimal number of clusters can also be obtained using the contour method.

[0096] The core idea of ​​the elbow method is that as the number of clusters K increases, the sample partitioning becomes more refined, and the aggregation degree of each cluster gradually increases, thus the sum of squared errors (SSE) naturally decreases. Furthermore, when the number of clusters K is less than the optimal number of clusters, increasing the number of clusters K significantly increases the aggregation degree of each cluster, resulting in a large decrease in SSE. However, when the number of clusters K reaches the optimal number of clusters, the aggregation degree gain from further increasing the number of clusters K decreases rapidly, so the decrease in SSE slows down sharply. Then, as the number of clusters K continues to increase, the decrease tends to level off. In other words, the relationship between SSE and the number of clusters K resembles the shape of an elbow, and the k value corresponding to this elbow is the optimal number of clusters for the data. This is why the method is called the elbow method.

[0097] (3) Determine the speaker voice features corresponding to each audio segment set based on each audio segment set to obtain multiple preset speaker voice features.

[0098] In a specific embodiment, speaker voice features corresponding to each audio segment set are determined based on each audio segment set to obtain multiple preset speaker voice features, which may include:

[0099] (1) Obtain the centroid vector of the audio segment set.

[0100] Here, the centroid vector of the audio segment set is the average of the sound features of each audio segment in the set. This can be obtained when clustering multiple audio segments. The sound features of each audio segment are represented using an xvector.

[0101] (2) Calculate the similarity between each audio segment in the audio segment set and the centroid vector.

[0102] In this application example, the similarity between the sound features of each audio segment and the centroid vector can be represented by the cosine similarity between the sound features of each audio segment and the centroid vector.

[0103] (3) Obtain multiple target audio segments whose similarity to the centroid vector is higher than the first preset similarity.

[0104] In this embodiment of the application, the first preset similarity can be 80%, 70%, etc., and can be set according to the specific situation.

[0105] (4) Based on the sound features of multiple target audio segments, determine the speaker's voice features corresponding to the target audio set, and obtain multiple preset speaker voice features.

[0106] Specifically, the sound features of multiple target audio segments are fused to obtain the speaker's voice features corresponding to the audio segment set. Selecting multiple target audio segments with high similarity for fusion and using them as preset speaker voice features can improve the accuracy of the preset speaker voice features. Of course, in other embodiments, fusion of all audio segments in the audio segment set can be performed without fusion to obtain the preset speaker voice features.

[0107] S202. Determine the first speaker recognition result for each first audio frame based on the sound features of multiple first audio frames and multiple preset speaker sound features.

[0108] In one specific embodiment, determining the first speaker recognition result for each first audio frame based on the sound features of multiple first audio frames and multiple preset speaker sound features includes:

[0109] (1) Determine multiple first audio frames as target audio frames respectively.

[0110] (2) Match the sound features of the target audio frame with the sound features of multiple preset speakers to obtain the matching speaker that matches the target audio frame.

[0111] In this embodiment, the sound features of the target audio frame are input into a personal speech detection model along with multiple preset speaker sound features. The model then obtains a detection result indicating whether the sound features of the target audio frame match the preset speaker sound features, thus identifying the matching speaker that matches the target audio frame. A typical VAD system uses a frame-level classifier with sound features to make a speech / non-speech decision for each audio frame. The personal speech detection model (PERSONAL VAD, SPEAKER-CONDITIONED VOICE ACTIVITY DETECTION) generates frame-level category labels for three categories: non-speech (ns), target speaker speech (tss), and non-target speaker speech (ntss). When the personal speech detection model outputs the target speaker speech, it determines that the sound features of the target audio frame match the preset speaker sound features.

[0112] The training data for the personal speech detection model consists of labeled audio files, text, and speaker names. Let's assume there are 5 speakers. Each speaker has 10 files. The first 5 audio files for each speaker are combined to extract their voiceprint features (ivectors). The remaining 5 audio files are used as training data. Forced alignment is applied to the training audio and text, resulting in each frame of audio and its corresponding phoneme label. In actual network training, the input to the network is the extracted acoustic features from each frame and the voiceprint features (ivectors) representing the speaker. The network is constructed using a deep network model, such as CNN + BILSTM. The network output is 2-dimensional. The first dimension represents non-target speakers, and the second dimension represents target speakers, with each dimension representing a probability value. After model training, during actual testing, the speaker with the higher probability value is identified as belonging to that speaker's label.

[0113] (3) The matched speaker of each first audio frame is determined as the first speaker recognition result of each first audio frame.

[0114] For example, the preset speaker voice features are as follows: the preset speaker voice feature for speaker 1 is ivector1; the preset speaker voice feature for speaker 2 is ivector2; and the preset speaker voice feature for speaker 3 is ivector3.

[0115] The sound features of the target audio frame are input into the personal speech detection model along with the preset speaker sound features of speaker 1 (ivector1), and the detection result is the target speaker's speech (TSS). The sound features of the target audio frame are input into the personal speech detection model along with the preset speaker sound features of speaker 2 (ivector2), and the detection result is the target speaker's speech (TSS). The sound features of the target audio frame are input into the personal speech detection model along with the preset speaker sound features of speaker 3 (ivector3), and the detection result is the non-target speaker's speech (NTSS). Therefore, the matching speaker of the target audio frame is (speaker 1, speaker 2). Since the target audio frame has two matching speakers, it is an overlapping audio frame.

[0116] For example, there are 5 audio frames, and the first speaker identification results for the 5 audio frames are: Speaker 1; (Speaker 1, Speaker 2); Speaker 1; Speaker 1; Speaker 1. Among them, the second audio frame is an overlapping audio frame.

[0117] S203. Smooth the first speaker recognition results of multiple first audio frames to obtain the second speaker recognition results.

[0118] In one specific embodiment, smoothing the first speaker recognition results of multiple first audio frames to obtain second speaker recognition results may include:

[0119] (1) Based on a preset time window, multiple first audio frames are extracted with a preset step size to obtain multiple sets of first frames.

[0120] The first set of frames includes multiple second audio frames, the number of which is less than the number of first audio frames, and the second audio frames in the same first set are sequential in time.

[0121] In one specific embodiment, the length of the preset time window is greater than the preset step size. For example, the preset time window is 10 frames and the preset step size is 5 frames. The preset time window and preset step size can be set according to specific circumstances.

[0122] (2) Smooth the multiple second audio frames in each first frame set to obtain the second speaker recognition result.

[0123] In this embodiment of the application, multiple second audio frames in each first frame set are smoothed to obtain the second speaker recognition result, including:

[0124] (1) Remove the overlapping audio frames in the first frame set to obtain the second frame set, wherein the number of speakers matched in the overlapping audio frames is at least two.

[0125] For example, given 5 audio frames, the first speaker identification results for each frame are: Speaker 1; (Speaker 1, Speaker 2); Speaker 1; Speaker 1; Speaker 1. The second audio frame is an overlapping frame. Removing the overlapping frames and retaining only the audio of a single speaker improves the accuracy of subsequent smoothing processes, thereby enhancing speaker identification accuracy.

[0126] (2) Determine whether there is a target speaker in the second frame set, wherein the number of second audio frames in the first frame set that match the target speaker is greater than a preset value.

[0127] Specifically, the matching speaker for each second audio frame in the first frame set is obtained, and the number of second audio frames for each matching speaker is counted. This count is compared to a preset value. If the number of second audio frames for each matching speaker is less than the preset value, it is determined that there is no target speaker in the second frame set, and this second frame set does not require smoothing. If there is a matching speaker with a number of second audio frames greater than the preset value, then the matching speaker with the greater number of second audio frames is identified as the target speaker. For example, the matching speaker positions for each second audio frame in the first frame set are: Speaker 1; (Speaker 1, Speaker 2); Speaker 1; Speaker 1; Speaker 1. Speaker 1's second audio frames account for 80% of the total number of second audio frames in the first frame set. The preset value is a percentage of the total number of second audio frames in the first frame set. For example, if the preset value is 95% of the total number of second audio frames in the first frame set, then there is no target speaker in this first frame set. If the preset value is 70% of the total number of second audio frames in the first frame set, then the target speaker in this first frame set is Speaker 1.

[0128] (3) If the target speaker exists in the second frame set, then the speaker of the second audio frame of the second frame set is determined as the target speaker.

[0129] In this embodiment of the application, if the target speaker exists in the second frame set, it indicates that the target speaker has been speaking for that period of time, and other voices may be misjudged and need to be changed to the target speaker. All speakers in the second audio frames of the second frame set are identified as the target speaker.

[0130] Furthermore, if the target speaker exists in the second frame set, multiple second audio frames that do not belong to the target speaker are defined as the third frame set. It is then determined whether the third frame set contains multiple consecutive audio frames of a preset length. If not, the speaker of the second audio frame in the second frame set is identified as the target speaker. The preset length can be 10 frames, 9 frames, etc., depending on the specific situation. When the target speaker exists in the second frame set, it is further determined whether other audio frames are consecutive. If other audio frames are not consecutive, the probability of other sounds being misidentified is higher, and the recognition result of other sounds is changed to the target speaker, improving the accuracy of speaker recognition.

[0131] After processing multiple first audio frames as described above, the second speaker recognition results for multiple first audio frames can be obtained. After removing the misidentified audio frames from the first speaker recognition results, the second speaker recognition results become more accurate.

[0132] S204. Update the preset speaker voice features based on the second speaker recognition result and perform speaker recognition again to obtain the target speaker recognition result.

[0133] In this embodiment of the application, updating the preset speaker voice features based on the second speaker recognition result and performing speaker recognition again to obtain the target speaker recognition result may include:

[0134] (1) Obtain audio frames belonging to the same speaker from the second speaker identification results.

[0135] In one specific embodiment, overlapping audio frames in the second speaker identification result are removed, and audio frames belonging to the same speaker are obtained from the second speaker identification result after removing overlapping audio frames. For example, the audio frames belonging to speaker 1 are audio frames 1 to 15; the audio frames belonging to speaker 2 are audio frames 16 to 30.

[0136] (2) The sound features of audio frames belonging to the same speaker are fused to obtain the speaker sound features of each speaker.

[0137] For example, the audio frames belonging to speaker 1 are audio frames 1 to 15; the audio frames belonging to speaker 2 are audio frames 16 to 30. The sound features of audio frames 1 to 15 are fused to obtain the speaker sound features of speaker 1; the sound features of audio frames 16 to 30 are fused to obtain the speaker sound features of speaker 2.

[0138] (3) The voice features of each speaker are redefined as preset speaker voice features and speaker recognition and smoothing are performed to obtain the target speaker recognition result.

[0139] Specifically, the third speaker recognition result of each first audio frame is determined based on the sound features of multiple first audio frames and multiple updated preset speaker sound features; the third speaker recognition results of multiple first audio frames are smoothed to obtain the fourth speaker recognition result.

[0140] Furthermore, it is determined whether the similarity between the fourth speaker identification result and the second speaker identification result is greater than a second preset similarity. If the similarity is greater, the fourth speaker identification result is identified as the target speaker identification result. The second preset similarity can be 90%, 85%, etc., set according to specific circumstances. If the similarity is not greater than the second preset similarity, the preset speaker voice features are updated based on the fourth speaker identification result, and speaker identification and smoothing are performed again to obtain the target speaker identification result.

[0141] To better implement the audio speaker recognition method in the embodiments of this application, based on the audio speaker recognition method, the embodiments of this application also provide an audio speaker recognition device, such as... Figure 3 As shown, the speaker recognition device 300 for audio includes:

[0142] The acquisition unit 301 is used to acquire the sound features of multiple first audio frames and multiple preset speaker sound features in the audio to be identified;

[0143] The determining unit 302 is used to determine the first speaker recognition result of each first audio frame based on the sound features of multiple first audio frames and multiple preset speaker sound features;

[0144] The smoothing processing unit 303 is used to smooth the first speaker recognition results of multiple first audio frames to obtain the second speaker recognition results.

[0145] The update iteration unit 304 is used to update the preset speaker voice features based on the second speaker recognition result and perform speaker recognition again to obtain the target speaker recognition result.

[0146] Optionally, the acquisition unit 301 is used for:

[0147] The audio to be identified is divided into multiple audio segments, wherein each audio segment contains at least two first audio frames;

[0148] Clustering multiple audio segments yields multiple sets of audio segments;

[0149] Based on each set of audio segments, the speaker's voice features corresponding to each set of audio segments are determined, resulting in multiple preset speaker voice features.

[0150] Optionally, the acquisition unit 301 is used for:

[0151] Obtain the centroid vector of the audio segment set;

[0152] Calculate the similarity between each audio segment in the audio segment set and its centroid vector;

[0153] Obtain multiple target audio segments whose similarity to the centroid vector is higher than the first preset similarity;

[0154] Based on the sound features of multiple target audio segments, the speaker's voice features corresponding to the audio segment set are determined, resulting in multiple preset speaker voice features.

[0155] Optionally, update iteration unit 304 is used for:

[0156] Obtain audio frames belonging to the same speaker from the second speaker identification results;

[0157] The sound features of audio frames belonging to the same speaker are fused to obtain the speaker's sound features for each speaker.

[0158] The speaker voice features of each speaker are redefined as preset speaker voice features, and speaker recognition and smoothing are performed to obtain the target speaker recognition result.

[0159] Optionally, determining unit 302 is used for:

[0160] Multiple first audio frames are respectively identified as target audio frames;

[0161] The sound features of the target audio frame are matched with the sound features of multiple preset speakers to obtain the matching speaker that matches the target audio frame;

[0162] The speaker matched in each first audio frame is determined as the first speaker recognition result for each first audio frame.

[0163] Optionally, the smoothing unit 303 is used for:

[0164] Audio is extracted from multiple first audio frames based on a preset time window and a preset step size to obtain multiple sets of first frames. Each set of first frames includes multiple second audio frames, the number of second audio frames is less than the number of first audio frames, and the second audio frames in the same set of first frames are continuous in time.

[0165] The second speaker recognition results are obtained by smoothing multiple second audio frames in each first frame set.

[0166] Optionally, the smoothing unit 303 is used for:

[0167] Remove the overlapping audio frames from the first set of frames to obtain the second set of frames, wherein the number of speakers matched in the overlapping audio frames is at least two.

[0168] Determine whether the target speaker exists in the second frame set, wherein the number of second audio frames matching the target speaker in the first frame set is greater than a preset value;

[0169] If the target speaker exists in the second set of frames, then the speaker of the second audio frame in the second set of frames is identified as the target speaker.

[0170] This application also provides a computer device that integrates any of the audio speaker recognition devices provided in this application. The computer device includes:

[0171] One or more processors;

[0172] Memory; and

[0173] One or more applications, wherein the applications are stored in memory and configured to be executed by a processor, wherein the steps of the audio speaker recognition method in any of the embodiments described above are performed.

[0174] like Figure 4 As shown, it illustrates a structural schematic diagram of the computer device involved in the embodiments of this application, specifically:

[0175] The computer device may include components such as a processor 401 with one or more processing cores, a memory 402 with one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will understand that the computer device structure shown in the figures does not constitute a limitation on the computer device, and may include more or fewer components than shown, or combine certain components, or have different component arrangements. Wherein:

[0176] Processor 401 is the control center of the computer device. It connects various parts of the computer device via various interfaces and lines, and performs various functions and processes data by running or executing software programs and / or modules stored in memory 402, and by calling data stored in memory 402, thereby providing overall monitoring of the computer device. Optionally, processor 401 may include one or more processing cores; processor 401 may be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or any conventional processor. Preferably, processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly handles the operating system, user interface, and applications, and the modem processor mainly handles wireless communication. It is understood that the aforementioned modem processor may not be integrated into processor 401.

[0177] The memory 402 can be used to store software programs and modules. The processor 401 executes various functional applications and data processing by running the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area. The program storage area may store the operating system, application programs required for at least one function (such as sound playback function, image playback function, etc.), etc.; the data storage area may store data created according to the use of the computer device, etc. In addition, the memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.

[0178] The computer device also includes a power supply 403 that supplies power to the various components. Preferably, the power supply 403 can be logically connected to the processor 401 through a power management system, thereby enabling functions such as charging, discharging, and power consumption management through the power management system. The power supply 403 may also include one or more DC or AC power supplies, recharging systems, power fault detection circuits, power converters or inverters, power status indicators, and other arbitrary components.

[0179] The computer device may also include an input unit 404, which can be used to receive input digital or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

[0180] Although not shown, the computer device may also include a display unit, etc., which will not be described in detail here. Specifically, in this embodiment, the processor 401 in the computer device loads the executable files corresponding to the processes of one or more applications into the memory 402 according to the following instructions, and the processor 401 runs the applications stored in the memory 402 to realize various functions, as follows:

[0181] Acquire the sound features of multiple first audio frames and multiple preset speaker sound features in the audio to be identified; determine the first speaker recognition result of each first audio frame based on the sound features of multiple first audio frames and multiple preset speaker sound features; smooth the first speaker recognition results of multiple first audio frames to obtain the second speaker recognition result; update the preset speaker sound features based on the second speaker recognition result and perform speaker recognition again to obtain the target speaker recognition result.

[0182] Those skilled in the art will understand that all or part of the steps in the various methods of the above embodiments can be performed by instructions, or by instructions controlling related hardware. These instructions can be stored in a computer-readable storage medium and loaded and executed by a processor.

[0183] Therefore, embodiments of this application provide a computer-readable storage medium, which may include: read-only memory (ROM), random access memory (RAM), a disk, or an optical disk, etc. A computer program is stored thereon, and the computer program is loaded by a processor to execute the steps in any of the audio speaker recognition methods provided in embodiments of this application. For example, the computer program loaded by the processor can execute the following steps:

[0184] Acquire the sound features of multiple first audio frames and multiple preset speaker sound features in the audio to be identified; determine the first speaker recognition result of each first audio frame based on the sound features of multiple first audio frames and multiple preset speaker sound features; smooth the first speaker recognition results of multiple first audio frames to obtain the second speaker recognition result; update the preset speaker sound features based on the second speaker recognition result and perform speaker recognition again to obtain the target speaker recognition result.

[0185] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the detailed descriptions of other embodiments above, which will not be repeated here.

[0186] In practice, each of the above units or structures can be implemented as an independent entity or can be arbitrarily combined to be implemented as the same or several entities. For the specific implementation of each of the above units or structures, please refer to the previous method embodiments, which will not be repeated here.

[0187] For details on the implementation of each of the above operations, please refer to the previous examples, which will not be repeated here.

[0188] The present application provides a detailed description of an audio speaker recognition method and apparatus. Specific examples have been used to illustrate the principles and implementation methods of the present application. The description of the above embodiments is only for the purpose of helping to understand the method and core ideas of the present application. At the same time, those skilled in the art will recognize that there will be changes in the specific implementation methods and application scope based on the ideas of the present application. Therefore, the content of this specification should not be construed as a limitation of the present application.

Claims

1. A speaker recognition method for audio, characterized in that, The speaker recognition method for the audio includes: Acquire the sound features of multiple first audio frames and multiple preset speaker sound features in the audio to be identified; The first speaker recognition result for each first audio frame is determined based on the sound features of the plurality of first audio frames and the plurality of preset speaker sound features; The first speaker recognition results of multiple first audio frames are smoothed to obtain second speaker recognition results; wherein, based on a preset time window and a preset step size, audio is extracted from multiple first audio frames to obtain multiple first frame sets; the first frame sets include multiple second audio frames, the number of second audio frames is less than the number of first audio frames, and the second audio frames in the same first frame set are temporally continuous; overlapping audio frames in the first frame sets are removed to obtain a second frame set, wherein the number of matched speakers in the overlapping audio frames is at least two; it is determined whether there is a target speaker in the second frame set, wherein the existence of a target speaker is determined when the number of second audio frames in the first frame set that match the target speaker is greater than a preset value; if the target speaker exists in the second frame set, the speaker of the second audio frame in the second frame set is determined as the target speaker; The preset speaker voice features are updated based on the second speaker recognition result, and speaker recognition and smoothing are performed again to obtain the target speaker recognition result.

2. The speaker recognition method for audio according to claim 1, characterized in that, The acquisition of sound features from multiple first audio frames and multiple preset speaker sound features in the audio to be identified includes: The audio to be identified is divided into multiple audio segments, wherein each audio segment contains at least two first audio frames; Cluster the multiple audio segments to obtain multiple audio segment sets; The speaker voice features corresponding to each set of audio segments are determined based on each set of audio segments, thereby obtaining the multiple preset speaker voice features.

3. The speaker recognition method for audio according to claim 2, characterized in that, The step of determining the speaker voice features corresponding to each audio segment set based on each audio segment set to obtain the plurality of preset speaker voice features includes: Obtain the centroid vector of the audio segment set; Calculate the similarity between each audio segment in the audio segment set and the centroid vector; Obtain multiple target audio segments whose similarity to the centroid vector is higher than a first preset similarity; Based on the sound features of the multiple target audio segments, the speaker's voice features corresponding to the audio segment set are determined, and the multiple preset speaker voice features are obtained.

4. The speaker recognition method for audio according to claim 1, characterized in that, The step of updating the preset speaker voice features based on the second speaker recognition result and performing speaker recognition and smoothing processing again to obtain the target speaker recognition result includes: Obtain audio frames belonging to the same speaker from the second speaker identification result; The sound features of audio frames belonging to the same speaker are fused to obtain the speaker's sound features for each speaker. The speaker voice features of each speaker are redefined as the preset speaker voice features, and speaker recognition and smoothing are performed to obtain the target speaker recognition result.

5. The speaker recognition method for audio according to claim 1, characterized in that, The step of determining the first speaker recognition result for each first audio frame based on the sound features of the plurality of first audio frames and the plurality of preset speaker sound features includes: The plurality of first audio frames are respectively identified as target audio frames; The sound features of the target audio frame are matched with the sound features of the plurality of preset speakers to obtain the matching speaker that matches the target audio frame; The matching speaker of each of the first audio frames is determined as the first speaker recognition result of each of the first audio frames.

6. An audio speaker recognition device, characterized in that, The speaker recognition device for the audio includes: The acquisition unit is used to acquire the sound features of multiple first audio frames and multiple preset speaker sound features in the audio to be identified. The determining unit is configured to determine the first speaker recognition result of each first audio frame based on the sound features of the plurality of first audio frames and the plurality of preset speaker sound features; A smoothing processing unit is used to smooth the first speaker recognition results of multiple first audio frames to obtain second speaker recognition results. Specifically, it extracts audio from multiple first audio frames based on a preset time window and a preset step size to obtain multiple first frame sets. Each first frame set includes multiple second audio frames, where the number of second audio frames is less than the number of first audio frames, and the second audio frames in the same first frame set are temporally continuous. Overlapping audio frames in the first frame sets are removed to obtain a second frame set, wherein the number of matched speakers in the overlapping audio frames is at least two. It determines whether a target speaker exists in the second frame set, wherein the existence of a target speaker is determined when the number of second audio frames matching the target speaker in the first frame set is greater than a preset value. If the target speaker exists in the second frame set, the speaker of the second audio frame in the second frame set is identified as the target speaker. The update iteration unit is used to update the preset speaker voice features according to the second speaker recognition result and perform speaker recognition and smoothing processing again to obtain the target speaker recognition result.

7. A computer device, characterized in that, The computer device includes: One or more processors; Memory; and One or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the processor to implement the speaker recognition method for audio as described in any one of claims 1 to 5.

8. A computer-readable storage medium, characterized in that, It stores a computer program, which is loaded by a processor to perform the steps of the speaker recognition method for audio according to any one of claims 1 to 5.