A voiceprint recognition method and device

By dividing the spectrogram into different frequency bands and utilizing feature extraction networks with different time resolutions, the problems of redundant parameters and high computational cost in neural network voiceprint extraction methods are solved, achieving a voiceprint recognition effect with high recognition rate and low computational cost.

CN114974256BActive Publication Date: 2026-06-26HUAWEI TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HUAWEI TECH CO LTD
Filing Date
2021-02-24
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing neural network voiceprint extraction methods suffer from numerous redundant parameters and high computational costs, resulting in low recognition rates.

Method used

The spectrogram is divided into sub-spectrals of different frequency bands, and feature extraction networks with different temporal resolutions are used to extract feature information of high-frequency and low-frequency bands respectively. By setting parameters such as convolution kernel size, number of channels and stride, the feature extraction networks of high and low frequency bands are differentiated in terms of temporal resolution. Combined with dilated convolution module and information synchronization, the computational load is reduced and the recognition rate is improved.

Benefits of technology

By employing differentiated feature extraction and information fusion, high-recognition-rate voiceprint recognition was achieved, while simultaneously reducing computational load and improving the efficiency and accuracy of voiceprint recognition.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN114974256B_ABST
    Figure CN114974256B_ABST
Patent Text Reader

Abstract

The application relates to a voiceprint recognition method and device. The method comprises the following steps: obtaining a spectrogram of a speech signal, and dividing the spectrogram into a plurality of sub-spectrograms of different frequency bands; using feature extraction networks with different time resolutions to extract feature information of the plurality of sub-spectrograms, wherein the time resolution of a first feature extraction network used for extracting feature information of a high-frequency sub-spectrogram is greater than the time resolution of a second feature extraction network used for extracting feature information of a low-frequency sub-spectrogram; and fusing the feature information extracted by the feature extraction networks with different time resolutions into a voiceprint of the speech signal.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of intelligent voice processing technology, and in particular to a voiceprint recognition method and apparatus. Background Technology

[0002] As a biometric feature, voiceprints are widely used in various speech processing tasks, such as speaker identification. The process of voiceprint extraction typically involves converting a speech signal of variable or fixed length into a vector of fixed length. This vector is characterized by strong separability and high stability, and can uniquely identify the speaker.

[0003] With the widespread application of deep learning technology, neural network-based voiceprint extraction methods have become the mainstream approach. Common neural network voiceprint extraction methods first calculate the power spectrum or Mel-spectrum of the speaker's audio, then input this spectrum into a neural network model, which outputs the voiceprint. These methods perform indiscriminate feature extraction across all frequency bands. To achieve a high recognition rate, a large number of channels are often required to capture effective features from the power spectrum or Mel-spectrum, resulting in numerous redundant parameters in the neural network model and a high computational load.

[0004] Therefore, there is an urgent need in related technologies for a voiceprint recognition method with high recognition rate and low computational cost. Summary of the Invention

[0005] In view of this, a voiceprint recognition method and device are proposed.

[0006] In a first aspect, embodiments of this application provide a voiceprint recognition method, the method comprising:

[0007] Acquire the spectrogram of the speech signal and divide the spectrogram into multiple sub-spectrals of different frequency bands;

[0008] Feature information of the multiple sub-word spectrograms is extracted using feature extraction networks with different time resolutions. The time resolution of the first feature extraction network used to extract feature information of high-frequency sub-word spectrograms is greater than that of the second feature extraction network used to extract feature information of low-frequency sub-word spectrograms.

[0009] The feature information extracted by the feature extraction networks with different time resolutions is fused into the voiceprint of the speech signal.

[0010] The aforementioned method considers not only the differences in energy distribution along the frequency dimension of the spectrogram but also the differences in energy distribution along the temporal dimension. Based on dividing the spectrogram into sub-spectrals of different frequency bands, it also provides feature extraction networks with different temporal resolutions to extract features from the sub-spectrals of different frequency bands. Specifically, a higher temporal resolution feature extraction network can be used to extract the rich, rapidly changing patterns in the higher frequency bands of the spectrogram, while a lower temporal resolution feature extraction network can be used to extract the simpler, slowly changing patterns in the lower frequency bands.

[0011] According to the first possible implementation of the first aspect, the feature extraction network includes a convolutional neural network. Correspondingly, the time resolution of the first feature extraction network for extracting high-frequency segment spectrogram feature information is greater than the time resolution of the second feature extraction network for extracting low-frequency segment spectrogram feature information, including:

[0012] The kernel size of the first feature extraction network is smaller than the kernel size of the second feature extraction network.

[0013] In this embodiment, when the feature extraction network includes a convolutional neural network, the temporal resolution of the first feature extraction network and the second feature extraction network is adjusted by setting the kernel size of the convolutional neural network.

[0014] According to the second possible implementation of the first aspect, the temporal resolution of the first feature extraction network is greater than that of the second feature extraction network, and further includes at least one of the following: the number of channels of the first feature extraction network is greater than the number of channels of the second feature extraction network.

[0015] The step size of the first feature extraction network is smaller than the step size of the second feature extraction network.

[0016] In this embodiment, when the feature extraction network includes a convolutional neural network, the difference in temporal resolution between the first feature extraction network and the second feature extraction network can be achieved by setting parameters such as the number of channels and stride of the convolutional neural network.

[0017] According to the third possible implementation of the first aspect, the feature extraction network includes multiple serially connected sub-networks, wherein each subsequent sub-network is used to extract feature information from the output of the preceding sub-network.

[0018] In this embodiment, multiple sub-networks are used for feature extraction, resulting in richer and more accurate extracted features.

[0019] According to the fourth possible implementation of the first aspect, the extraction of feature information from the multiple sub-language spectrograms using feature extraction networks with different time resolutions includes:

[0020] Feature information of the multiple sub-language spectrograms is extracted using feature extraction networks with different time resolutions, and the output results of the sub-networks in the first and second feature extraction networks are synchronized to each other's sub-networks.

[0021] In this embodiment of the application, considering the correlation between the high-frequency and low-frequency parts of the spectrogram, information synchronization can be achieved between sub-networks, so that the first feature extraction network and the second feature extraction network can absorb useful information from each other while extracting features.

[0022] According to the fifth possible implementation of the first aspect, the step of synchronizing the output results of the sub-networks in the first feature extraction network and the second feature extraction network to each other's sub-networks includes:

[0023] The outputs of the subnetworks in the first feature extraction network and the second feature extraction network are synchronized to each other's subnetworks according to a dynamically generated ratio, which is determined based on the correlation of the outputs of the subnetworks.

[0024] In this embodiment, a synchronization information ratio is set so that the first feature extraction network and the second feature extraction network can obtain more useful information from each other.

[0025] According to the sixth possible implementation of the first aspect, the sub-network includes a neural network composed of at least one dilated convolution module connected in series, wherein the dilated convolution module includes a neural network module based on dilated convolution.

[0026] In this embodiment of the application, the receptive field of the sub-network can be expanded while maintaining the same amount of computation.

[0027] According to the seventh possible implementation of the first aspect, the step of fusing the feature information extracted by the feature extraction networks with different time resolutions into the voiceprint of the speech signal includes:

[0028] The feature information extracted by the feature extraction networks with different time resolutions is adapted in the time dimension.

[0029] The adapted feature information is spliced ​​together to generate the voiceprint of the speech signal.

[0030] In this embodiment of the application, a method for feature information fusion is provided.

[0031] According to the eighth possible implementation of the first aspect, after fusing the feature information extracted by the feature extraction networks with different time resolutions into the voiceprint of the speech signal, the method includes:

[0032] Determine the voiceprints corresponding to multiple other voice signals within the same voice time period as the stated voice signal;

[0033] The average value of the voiceprints corresponding to multiple voice signals within the speech period is determined, and the average value is used as the voiceprint result corresponding to the speech period.

[0034] In this embodiment of the application, the voiceprint results corresponding to each speech frame can be averaged over time to obtain a voiceprint recognition result that is independent of the entire time length.

[0035] Secondly, embodiments of this application provide a voiceprint recognition device, comprising:

[0036] The spectrogram segmentation module is used to acquire the spectrogram of the speech signal and divide the spectrogram into multiple sub-spectrals of different frequency bands;

[0037] The feature extraction module is used to extract feature information of the multiple sub-language spectrograms using feature extraction networks with different time resolutions, wherein the time resolution of the first feature extraction network used to extract feature information of high-frequency sub-language spectrograms is greater than the time resolution of the second feature extraction network used to extract feature information of low-frequency sub-language spectrograms.

[0038] The feature fusion module is used to fuse the feature information extracted by the feature extraction networks with different time resolutions into the voiceprint of the speech signal.

[0039] According to the first possible implementation of the second aspect, the feature extraction network includes a convolutional neural network. Correspondingly, the time resolution of the first feature extraction network for extracting high-frequency segment spectrogram feature information is greater than the time resolution of the second feature extraction network for extracting low-frequency segment spectrogram feature information, including:

[0040] The kernel size of the first feature extraction network is smaller than the kernel size of the second feature extraction network.

[0041] According to the second possible implementation of the second aspect, the temporal resolution of the first feature extraction network is greater than the temporal resolution of the second feature extraction network, and further includes at least one of the following:

[0042] The number of channels in the first feature extraction network is greater than the number of channels in the second feature extraction network;

[0043] The step size of the first feature extraction network is smaller than the step size of the second feature extraction network.

[0044] According to the third possible implementation of the second aspect, the feature extraction network includes multiple serially connected sub-networks, wherein each subsequent sub-network is used to extract feature information from the output of the preceding sub-network.

[0045] According to the fourth possible implementation of the second aspect, the feature extraction module is specifically used for:

[0046] Feature information of the multiple sub-language spectrograms is extracted using feature extraction networks with different time resolutions, and the output results of the sub-networks in the first and second feature extraction networks are synchronized to each other's sub-networks.

[0047] According to the fifth possible implementation of the first aspect, the feature extraction module is further configured to:

[0048] The outputs of the subnetworks in the first feature extraction network and the second feature extraction network are synchronized to each other's subnetworks according to a dynamically generated ratio, which is determined based on the correlation of the outputs of the subnetworks.

[0049] According to the sixth possible implementation of the second aspect, the sub-network includes a neural network consisting of at least one dilated convolution module connected in series, wherein the dilated convolution module includes a neural network module based on dilated convolution.

[0050] According to the seventh possible implementation of the second aspect, the feature fusion module is specifically used for:

[0051] The feature information extracted by the feature extraction networks with different time resolutions is adapted in the time dimension.

[0052] The adapted feature information is spliced ​​together to generate the voiceprint of the speech signal.

[0053] According to the eighth possible implementation of the second aspect, the feature fusion module is further configured to:

[0054] Determine the voiceprints corresponding to multiple other voice signals within the same voice time period as the stated voice signal;

[0055] The average value of the voiceprints corresponding to multiple voice signals within the speech period is determined, and the average value is used as the voiceprint result corresponding to the speech period.

[0056] Thirdly, embodiments of this application provide a terminal device, including:

[0057] A processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement, when executing the instructions, one or more of the voiceprint recognition methods described in the first aspect or various possible implementations of the first aspect.

[0058] Fourthly, embodiments of this application provide a non-volatile computer-readable storage medium storing computer program instructions thereon, characterized in that the computer program instructions, when executed by a processor, implement one or more of the voiceprint recognition methods described in the first aspect or various possible implementations of the first aspect.

[0059] Fifthly, embodiments of this application provide a computer program product including computer-readable code, or a non-volatile computer-readable storage medium carrying computer-readable code. When the computer-readable code is run in an electronic device, the processor in the electronic device executes one or more of the voiceprint recognition methods described in the first aspect or various possible implementations of the first aspect.

[0060] These and other aspects of this application will become more apparent in the description of the following embodiments(s). Attached Figure Description

[0061] The accompanying drawings, which are included in and form part of this specification, illustrate exemplary embodiments, features, and aspects of this application together with the specification and serve to explain the principles of this application.

[0062] Figure 1 A schematic diagram of the structure of a voiceprint recognition device 100 according to an embodiment of this application is shown.

[0063] Figure 2 A schematic flowchart of a voiceprint recognition method according to an embodiment of this application is shown.

[0064] Figure 3 A spectrogram according to an embodiment of this application is shown.

[0065] Figure 4 A schematic diagram of the structure of a voiceprint recognition device 100 according to an embodiment of this application is shown.

[0066] Figure 5 A schematic diagram of a subnetwork structure according to an embodiment of this application is shown.

[0067] Figure 6 A schematic diagram of a synchronization subnetwork structure according to an embodiment of this application is shown.

[0068] Figure 7 A schematic diagram of the feature fusion module according to an embodiment of this application is shown.

[0069] Figure 8 This diagram illustrates the module structure of a terminal device 1200 according to an embodiment of the present application. Detailed Implementation

[0070] Various exemplary embodiments, features, and aspects of this application will now be described in detail with reference to the accompanying drawings. The same reference numerals in the drawings denote elements that have the same or similar functions. Although various aspects of the embodiments are shown in the drawings, they are not necessarily drawn to scale unless specifically indicated otherwise.

[0071] The term "exemplary" as used herein means "serving as an example, embodiment, or illustration." Any embodiment illustrated herein as "exemplary" is not necessarily to be construed as superior to or better than other embodiments. The terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such terms are interchangeable where appropriate, and this is merely a way of distinguishing objects with the same properties in the embodiments of this application.

[0072] Furthermore, to better illustrate this application, numerous specific details are provided in the following detailed embodiments. Those skilled in the art should understand that this application can be implemented even without certain specific details. In some instances, methods and means well-known to those skilled in the art have not been described in detail, in order to highlight the main points of this application.

[0073] like Figure 1 As shown, this application provides a voiceprint recognition device 100. The voiceprint recognition device 100 can be applied to any terminal device requiring voiceprint recognition functionality, such as mobile smartphones, computers (including laptops and desktop computers), tablet computers, personal digital assistants (PDAs), smart facilities (including smart ticket machines, smart turnstiles, smart service robots, etc.), smart home appliances (including smart speakers, smart TVs, smart robot vacuums, etc.), and smart wearable devices (including smartwatches, smart glasses, smart bracelets, smart necklaces), etc. Figure 1 As shown, the voiceprint recognition device 100 may include a spectrogram segmentation module 101, a feature extraction module 103, and a feature fusion module 105.

[0074] The spectrogram division module 101 can divide the spectrogram corresponding to the speech signal into multiple sub-spectrals of different frequency bands. Figure 1 The illustration shows a sub-spectrum divided into two frequency bands: a high-frequency band and a low-frequency band. Of course, the embodiments of this application do not limit the number of frequency bands to be divided.

[0075] The feature extraction module 103 can use feature extraction networks with different time resolutions to extract feature information of the multiple sub-language spectra respectively. The time resolution of the first feature extraction network 1031 used to extract feature information of high-frequency sub-language spectra is greater than that of the second feature extraction network 1035 used to extract feature information of low-frequency sub-language spectra.

[0076] The feature fusion module 105 can be used to fuse the feature information extracted by the feature extraction networks with different time resolutions into the voiceprint of the speech signal.

[0077] The voiceprint recognition method described in this application will be explained in detail below with reference to the accompanying drawings. Figure 2 This is a schematic flowchart of an embodiment of the voiceprint recognition method provided in this application. Although this application provides method operation steps as shown in the following embodiments or figures, the method may include more or fewer operation steps based on conventional or non-inventive effort. For steps that do not have a logically necessary causal relationship, the execution order of these steps is not limited to the execution order provided in the embodiments of this application. In actual voiceprint recognition processes or when the device executes the method, it can be executed in the order shown in the embodiments or figures, or in parallel (e.g., in a parallel processor or multi-threaded processing environment).

[0078] Specifically, one embodiment of the voiceprint recognition method provided in this application is as follows: Figure 2 The method may include:

[0079] S201: Obtain the spectrogram of the speech signal and divide the spectrogram into multiple sub-spectrals of different frequency bands.

[0080] In this embodiment, the speech signal is the speech signal of the voiceprint to be identified. In practical applications, the original audio obtained may include not only speech signals but also other sounds, such as music and noise. Therefore, before acquiring the speech signal, Voice Activity Detection (VAD) can be performed on the acquired original audio to identify the speech signal. Of course, preprocessing such as speech denoising and speech enhancement can also be included, which is not limited here.

[0081] Time-domain analysis and frequency-domain analysis of speech signals are two important methods in speech analysis. However, both methods have their limitations. Time-domain analysis does not provide a direct visual representation of the frequency characteristics of the speech signal, while frequency-domain analysis lacks information on the changes in the speech signal over time. Therefore, related technologies have developed spectrograms that can simultaneously display both time-domain and frequency-domain information of speech signals, using a two-dimensional plane to represent three-dimensional information. The horizontal axis of the spectrogram represents time, the vertical axis represents frequency, and the coordinate points represent the energy of the speech signal. Figure 3 The spectrum of the voice signal “Is the UV radiation strong these five days?” is displayed, with a frequency range of 0-8000Hz. The smaller the gray level, the higher the energy, and the larger the gray level, the lower the energy.

[0082] It should be noted that the spectrogram described in the embodiments of this application includes not only the power spectrum, but also other forms after the power spectrum has been transformed. For example, the power spectrum can be filtered by a triangular filter to obtain the Mel spectrum, and this application does not limit this.

[0083] like Figure 3 As shown, the energy of human voice is mainly concentrated in the low-frequency band, with less energy at higher frequencies. Furthermore, the energy patterns in the low-frequency band are simple, mostly horizontal lines with long durations and slow changes, while the energy patterns in the high-frequency band are rich, including many diagonal lines and curves at different angles, with shorter durations and rapid changes. Therefore, it is evident that there are significant differences in the energy characteristics of the low-frequency and high-frequency bands in the spectrogram. Based on this, in this embodiment, the spectrogram of the speech signal can be divided into multiple sub-spectral maps of different frequency bands, and different feature extraction networks can be used to process each of these sub-spectral maps. For example... Figure 3 As shown, the spectrogram can be divided into two sub-spectral maps: a high-frequency band and a low-frequency band. The high-frequency sub-spectral map is input into the first feature extraction network 1031, and the low-frequency sub-spectral map is input into the second feature extraction network 1035.

[0084] In one example, for Figure 3 The spectrogram shown can be divided into two sub-spectral bands, [0, 4 kHz] and (4 kHz, 8 kHz], using 4 kHz as the frequency division point. In another example, the spectrogram can also be divided into three sub-spectral bands, [0, 2 kHz], (2 kHz, 4 kHz], and (4 kHz, 8 kHz], using 2 kHz and 4 kHz as the frequency division points. This application does not impose any restrictions on the selection of the frequency division point or the number of frequency bands.

[0085] S203: Feature extraction networks with different time resolutions are used to extract feature information of the multiple sub-language spectrograms respectively, wherein the time resolution of the first feature extraction network 1031 used to extract feature information of high-frequency sub-language spectrograms is greater than the time resolution of the second feature extraction network 1035 used to extract feature information of low-frequency sub-language spectrograms.

[0086] As mentioned above, in the time dimension, the energy in the low-frequency band exhibits slow changes and simple patterns, while the energy in the high-frequency band exhibits rapid changes and rich patterns. Based on this, from a time perspective, in this embodiment, a first feature extraction network 1031 with a larger time resolution can be used to extract feature information of the sub-speech spectra in the higher-frequency band, while a second feature extraction network 1035 with a smaller time resolution can be used to extract feature information of the sub-speech spectra in the lower-frequency band. The time resolution can include the temporal granularity of sub-speech spectra processing; the smaller the time granularity, the higher the time resolution, the more refined the sub-speech spectra processing, but the relatively larger the computational load; conversely, the larger the time granularity, the lower the time resolution, the simpler the sub-speech spectra processing, and the relatively smaller the computational load. Specifically, the feature extraction network can include a convolutional neural network, which can include parameters such as kernel size, number of channels, and stride. In one embodiment of this application, correspondingly, the time resolution of the first feature extraction network 1031 used to extract feature information of the high-frequency sub-speech spectra is greater than the time resolution of the second feature extraction network 1035 used to extract feature information of the low-frequency sub-speech spectra, including:

[0087] The kernel size of the first feature extraction network 1031 is smaller than the kernel size of the second feature extraction network 1035.

[0088] Regarding the kernel size, the energy changes of high-frequency segment spectrograms are more rapid, requiring finer-grained differentiation in the time dimension, thus necessitating smaller kernel sizes. Conversely, the energy changes of low-frequency segment spectrograms are relatively slow, and fine-grained differentiation in the time dimension is not required, allowing for a larger kernel size. Therefore, the kernel size of the first feature extraction network 1031 can be set smaller than the kernel size of the second feature extraction network 1035.

[0089] Furthermore, the temporal resolution of the first feature extraction network is greater than that of the second feature extraction network, and it also includes at least one of the following:

[0090] The number of channels in the first feature extraction network 1031 is greater than the number of channels in the second feature extraction network 1035;

[0091] The step size of the first feature extraction network 1031 is smaller than the step size of the second feature extraction network 1035.

[0092] Regarding the number of channels, the energy distribution patterns of high-frequency segment spectrograms are relatively rich, requiring more channels to capture these patterns, thus necessitating a higher channel count. In contrast, the energy distribution patterns of low-frequency segment spectrograms are relatively simple, requiring fewer channels to capture these patterns. Therefore, the number of channels in the first feature extraction network 1031 can be set to be greater than the number of channels in the second feature extraction network 1035.

[0093] Regarding the step size, a step size greater than 1 is equivalent to downsampling, which reduces the computational load of the feature extraction network. High-frequency segment spectrograms exhibit more rapid energy changes and richer information; therefore, the first feature extraction network 1031 is more suitable for smaller step sizes. Low-frequency segment spectrograms exhibit relatively slow energy changes and less information; therefore, the second feature extraction network 1035 is more suitable for larger step sizes to reduce its computational load. Furthermore, information synchronization can be performed between the first feature extraction network 1031 and the second feature extraction network 1035. To achieve time alignment, the step size of the second feature extraction network 1035 can be set to an integer multiple of the step size of the first feature extraction network 1031.

[0094] In one embodiment of this application, the feature extraction network may include multiple serially connected sub-networks, wherein each subsequent sub-network is used to extract feature information from the output of the preceding sub-network. Figure 4 A schematic diagram of the module structure of one embodiment of the first feature extraction network 1031 and the second feature extraction network 1035 is shown. Figure 4 As shown, the first / second feature extraction network may include N first / second sub-networks, namely first / second sub-network 1, first / second sub-network 2, ..., first / second sub-network N. First / second sub-network 2 is used to extract feature information from the output of first / second sub-network 1, and first / second sub-network 2 outputs the extracted feature information to first / second sub-network 3, which then extracts the feature information of the extracted feature information, and so on. In this embodiment, the more first / second sub-networks there are, the richer and more accurate the extracted features; conversely, the fewer the first / second sub-networks, the less computation is required. Therefore, the number of first / second sub-networks can be flexibly set according to the performance requirements of the voiceprint extraction device and the size of the device's computing resources. For example, for low-resource, low-power devices such as smartwatches, smart head-mounted devices, and smart glasses, the number of first / second sub-networks can be set to be relatively small. On the other hand, for devices with abundant computing resources and high requirements for voiceprint recognition accuracy, such as automatic voice ticket vending machines and smart robots, the number of first / second sub-networks can be set to be relatively large.

[0095] As described above, the first feature extraction network 1031 and the second feature extraction network 1035 can include convolutional neural networks. Therefore, when the first feature extraction network 1031 and the second feature extraction network 1035 include multiple sub-networks, each sub-network can include a convolutional neural network. In one embodiment of this application, the sub-network can include a neural network composed of at least one dilated convolution module connected in series, wherein the dilated convolution module includes a neural network module based on dilated convolution. Dilated convolution is based on the idea of ​​downsampling, which can expand the receptive field of the sub-network while maintaining the same computational load. The receptive field is an important indicator for evaluating neural networks; it refers to the mapping relationship between network layers and original input data. The larger the receptive field, the greater the influence of the original input data on the network layers, and thus the better the model's performance.

[0096] Figure 5 A schematic diagram of the module structure of one embodiment of the sub-network is shown. For example... Figure 5 As shown, the dilated convolution module can be composed of multiple one-dimensional convolutional neural networks based on dilated convolution and activation functions, such as the ReLU function. Additionally, the dilated convolution module may also include residual structures; dilated convolution modules with residual structures can achieve faster model convergence. It should be noted that... Figure 5 The diagram shows a subnetwork consisting of two dilated convolutional modules. Of course, it can also consist of other numbers of dilated convolutional modules. This application does not limit the number of dilated convolutional modules included in the subnetwork, nor does it limit the structure of the dilated convolutional modules.

[0097] It should be noted that the dilated convolution module can include multiple one-dimensional convolutional neural networks, and the kernel size, number of channels, stride, and other parameters of each convolutional neural network can be set separately. When setting the stride parameter, for low-frequency segment spectrograms, the stride of the first convolutional layer of the dilated convolution module can be set to be relatively large, i.e. Figure 5 In the one-dimensional convolutional neural network 1, the stride is relatively large, while the stride of other convolutional layers is 1, to reduce information loss. Furthermore, when the number of dilated convolutional modules is large, the stride of each dilated convolutional module can be set individually. For example, for low-frequency segment spectrograms, the stride of the first convolutional layer of dilated convolutional module 1 can be set to 3, and the stride of other convolutional layers or other dilated convolutional modules can be set to 1. In this way, not only can the computational cost of the subnetwork be reduced, but information loss can also be minimized. Figure 4 The parameters of the first feature extraction network 1031 and the second feature extraction network 1035 can be set as follows, based on the structure shown:

[0098] First feature extraction network 1031: Number of first sub-networks = 4, Number of dilated convolutional modules = 2, Convolutional kernel size = 3*3, Number of channels = 256, Stride = 1;

[0099] Second feature extraction network 1035: Number of second sub-networks = 4, number of dilated convolutional modules = 2, kernel size = 7*7, number of channels = 64, stride of the first dilated convolutional module = 2, and stride of the rest = 1.

[0100] In this embodiment of the application, when the first feature extraction network 1031 and the second feature extraction network 1035 include multiple sub-networks, during the process of extracting feature information of the multiple sub-spectral graphs using feature extraction networks with different time resolutions, the output results of the sub-networks in the first feature extraction network 1031 and the second feature extraction network 1035 can be synchronized to each other's sub-networks. This allows the first sub-network to incorporate the feature information of the low-frequency segment sub-spectral graph during feature extraction, and the second sub-network to incorporate the feature information of the high-frequency segment sub-spectral graph during feature extraction. Figure 1 As shown, in one embodiment of this application, the information of the first feature extraction network 1031 can be synchronized to the second feature extraction network 1035 using the time information synchronization network 1037, and the information of the second feature extraction network 1035 can be synchronized to the first feature extraction network 1031.

[0101] Specifically, in one embodiment of this application, the time information synchronization network 1037 may further include multiple synchronization sub-networks, the number of which matches the sum of the numbers of the first sub-network and the second sub-network. For example... Figure 4 As shown, the number of the first synchronization sub-networks matches the number of the first sub-networks, and the number of the second synchronization sub-networks matches the number of the second sub-networks. In one example, the first synchronization sub-network 1 is used to synchronize the output of the first sub-network 1 to the second sub-network 2 as the input of the second sub-network 2; the second synchronization sub-network 2 is used to synchronize the output of the second sub-network 1 to the first sub-network 2 as the input of the first sub-network 2. Since the number of channels, stride, and other parameters of the first feature extraction network 1031 and the second feature extraction network 1035 may be different, during the synchronization of the sub-network outputs, the outputs of the sub-networks can be adapted in terms of time and channel dimensions, so that the outputs of the first sub-network are adapted to the second sub-network, and the outputs of the second sub-network are adapted to the first sub-network.

[0102] Figure 6A schematic diagram of the module structure of one embodiment of the first synchronization subnetwork j and the second synchronization subnetwork j is shown. Figure 6 As shown, the one-dimensional convolutional neural network 1 in the first synchronization sub-network j is used to adapt the output of the first sub-network j in the time dimension, so that the output of the first sub-network j is aligned with the output of the second sub-network j in both the time and channel dimensions. In a specific example of adaptation in the time dimension, when the stride of the first sub-network j is S1 and the stride of the second sub-network j is S2, the first convolutional layer of the first synchronization sub-network j can be set as follows ( Figure 6 The stride S of a one-dimensional convolutional neural network (1) is shown as follows:

[0103]

[0104] In this embodiment of the application, during the adaptation process at the channel dimension, the channel adaptation method can be determined based on the subsequent data fusion method. For example... Figure 6 As shown, after adapting the first subnetwork j to the time and channel dimensions through the first synchronization subnetwork j, the adapted output can be fused with the output of the second subnetwork j before being input into the second subnetwork (j+1). In this embodiment, the data fusion method can include pointwise multiplication and channel-wise concatenation. Pointwise multiplication requires the same data dimensions, i.e., the same matrix size and number of channels. Based on this, during the channel dimension adaptation process of the first synchronization subnetwork j, if it is determined that the subsequent data fusion method is pointwise multiplication, the number of channels in the first convolutional layer of the first synchronization subnetwork j can be set to be the same as the number of channels in the second subnetwork j. On the other hand, if it is determined that the subsequent data fusion method is channel-wise concatenation, the number of channels in the first convolutional layer of the first synchronization subnetwork j is not limited.

[0105] It should be noted that when the spectrogram is divided into sub-spectral maps of two or more frequency bands, the corresponding number of feature extraction networks is also two or more. Based on this, during information synchronization, a time synchronization module can be set between the feature extraction networks corresponding to adjacent frequency bands, or a time synchronization module can be set between any two feature extraction networks; this application does not impose any restrictions.

[0106] Furthermore, in this embodiment, during the synchronization of the output results of the sub-networks, the output results of the sub-networks in the first feature extraction network 1031 and the second feature extraction network 1035 can be synchronized to each other's sub-networks according to a dynamically generated ratio. This ratio can be determined based on the correlation between the output results of the sub-networks. That is, in some cases, it is not necessary to synchronize all the output results of the sub-networks to the other sub-networks. When the correlation between the output results of the sub-network and the output results of the other sub-network is high, a higher proportion of the output results can be synchronized to the other sub-network; conversely, when the correlation between the output results of the sub-network and the output results of the other sub-network is low, a lower proportion of the output results can be synchronized to the other sub-network. Figure 6 In the exemplary structure shown, the one-dimensional convolutional neural network 2 can dynamically generate the scale while maintaining the input and output dimensions unchanged. Specifically, the one-dimensional convolutional neural network 2 can generate the scale by combining it with an activation function, such as the sigmoid function, where each value in the scale can be a dynamic probability value between 0 and 1. After generating the scale, since the output of the one-dimensional convolutional neural network 1 has the same data dimension as the scale, the output of the one-dimensional convolutional neural network 1 can be multiplied point-by-point with the output of the one-dimensional convolutional neural network 2 and then fused with the output of the second sub-network j.

[0107] In this embodiment, the structure of the second synchronization sub-network j is similar to that of the first synchronization sub-network j, and will not be described again here. It should be noted that when the stride of the second sub-network j is greater than that of the first sub-network j, in order to achieve temporal alignment, the first convolutional layer of the second synchronization sub-network j needs to perform upsampling. Therefore, the first convolutional layer may include a one-dimensional deconvolutional neural network. Of course, when the stride of the second sub-network j is 1, the first convolutional layer of the second synchronization sub-network j may include a one-dimensional convolutional neural network.

[0108] It should be noted that, Figure 6 Only one exemplary structure of the first / second synchronization sub-network is shown. In other embodiments, the first / second synchronization sub-network may also include more one-dimensional convolutional neural networks or activation functions, etc. This application does not limit the specific structure of the first / second synchronization sub-network as long as information synchronization can be achieved.

[0109] S205: The feature information extracted by the feature extraction networks with different time resolutions is fused into the voiceprint of the speech signal.

[0110] After extracting feature information from sub-spectrums of different frequency bands using feature extraction networks with different time resolutions, the feature information can be fused into the speaker pattern of the speech signal. Specifically, the fusion steps mainly include time adaptation, channel concatenation, and vector transformation. Based on this, Figure 7 A schematic diagram of the module structure of one embodiment of the feature fusion module 105 is shown. Figure 7 As shown, the output of the first sub-network N is input to the time adaptation module 1051, which aligns the output of the first sub-network N with the output of the second sub-network N in the time dimension. In one example, the time adaptation module 1051 may include a one-dimensional convolutional neural network. Of course, in other embodiments, the time adaptation module 1051 may also be set in the second sub-network N, so that the output of the second sub-network N is aligned with the output of the first sub-network N in the time dimension; this application does not impose any limitations here. After time-dimensional adaptation, the adapted output can be input to the channel splicing module 1053, which completes the splicing of data for each channel. Then, since the voiceprint is generally a one-dimensional vector with a preset length, a vector conversion module 1055 is needed to convert the output into a one-dimensional vector with the preset length.

[0111] In practical applications, a speech signal is typically processed frame by frame. For example, a 3-second speech signal can be processed at a frame length of 10ms and a frame shift of 5ms. Processing the spectrogram corresponding to one frame of speech signal yields one voiceprint, so processing a 3-second speech signal can yield approximately 600 voiceprints. In this embodiment, the vector averaging module 1057 can be used to determine the voiceprints corresponding to multiple other speech signals within the same speech time period as the given speech signal (e.g., one frame of speech signal). The average value of the voiceprints corresponding to the multiple speech signals within the speech time period is determined, and this average value is used as the voiceprint result corresponding to the speech time period. That is, the voiceprint results corresponding to each speech frame are averaged over time, ultimately yielding a voiceprint recognition result independent of the entire time length.

[0112] based on Figure 1 The voiceprint recognition device 100 shown may optionally include, in one embodiment of this application, a feature extraction network comprising a convolutional neural network. Correspondingly, the time resolution of the first feature extraction network for extracting high-frequency segment spectrogram feature information is greater than the time resolution of the second feature extraction network for extracting low-frequency segment spectrogram feature information, including at least one of the following:

[0113] The kernel size of the first feature extraction network is smaller than the kernel size of the second feature extraction network.

[0114] Optionally, in one embodiment of this application, the temporal resolution of the first feature extraction network is greater than that of the second feature extraction network, and it further includes at least one of the following:

[0115] The number of channels in the first feature extraction network is greater than the number of channels in the second feature extraction network;

[0116] The step size of the first feature extraction network is smaller than the step size of the second feature extraction network.

[0117] Optionally, in one embodiment of this application, the feature extraction network includes multiple serially connected sub-networks, wherein each subsequent sub-network is used to extract feature information from the output of the previous sub-network.

[0118] Optionally, in one embodiment of this application, the feature extraction module is specifically used for:

[0119] Feature information of the multiple sub-language spectrograms is extracted using feature extraction networks with different time resolutions, and the output results of the sub-networks in the first and second feature extraction networks are synchronized to each other's sub-networks.

[0120] Optionally, in one embodiment of this application, the feature extraction module is further configured to:

[0121] The outputs of the subnetworks in the first feature extraction network and the second feature extraction network are synchronized to each other's subnetworks according to a dynamically generated ratio, which is determined based on the correlation of the outputs of the subnetworks.

[0122] Optionally, in one embodiment of this application, the sub-network includes a neural network composed of at least one dilated convolution module connected in series, wherein the dilated convolution module includes a neural network module based on dilated convolution.

[0123] Optionally, in one embodiment of this application, the feature fusion module is specifically used for:

[0124] The feature information extracted by the feature extraction networks with different time resolutions is adapted in the time dimension.

[0125] The adapted feature information is spliced ​​together to generate the voiceprint of the speech signal.

[0126] Optionally, in one embodiment of this application, the feature fusion module is further configured to:

[0127] Determine the voiceprints corresponding to multiple other voice signals within the same voice time period as the stated voice signal;

[0128] The average value of the voiceprints corresponding to multiple voice signals within the speech period is determined, and the average value is used as the voiceprint result corresponding to the speech period.

[0129] The voiceprint recognition device 100 according to the embodiments of this application can be used to execute the methods described in the embodiments of this application. The above and other operations and / or functions of each module in the voiceprint recognition device 100 are respectively for implementing the corresponding processes of the methods provided in the above embodiments. For the sake of brevity, they will not be described again here.

[0130] It should also be noted that the embodiments described above are merely illustrative. The modules described as separate components may or may not be physically separate, and the components shown as modules may or may not be physical modules; that is, they may be located in one place or distributed across multiple network modules. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. In addition, in the accompanying drawings of the device embodiments provided in this application, the connection relationship between modules indicates that they have a communication connection, which can be implemented as one or more communication buses or signal lines.

[0131] This application embodiment also provides a terminal device 1200 for implementing the above. Figure 1 The system architecture diagram shown illustrates the function of the voiceprint recognition device 100. The terminal device 1200 can be a physical device or a cluster of physical devices, or it can be a virtualized device, such as at least one cloud virtual machine in a cloud computing cluster. For ease of understanding, this application provides an example illustrating the structure of the device 1200.

[0132] Figure 8 A structural schematic diagram of a terminal device 1200 is provided, as follows: Figure 8 As shown, device 1200 includes a bus 1201, a processor 1202, a communication interface 1203, and a memory 1204. The processor 1202, memory 1204, and communication interface 1203 communicate via bus 1201. Bus 1201 can be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc. Buses can be divided into address buses, data buses, control buses, etc. For ease of representation, Figure 8 The symbol is represented by a single thick line, but this does not indicate that there is only one bus or one type of bus. Communication interface 1203 is used for external communication. For example, it can be used to acquire spectrograms of speech signals and output identified voiceprint information to other modules.

[0133] The processor 1202 may be a central processing unit (CPU). The memory 1204 may include volatile memory, such as random access memory (RAM). The memory 1204 may also include non-volatile memory, such as read-only memory (ROM), flash memory, HDD, or SSD.

[0134] The memory 1204 stores executable code, and the processor 1202 executes the executable code to perform the aforementioned voiceprint recognition method.

[0135] Specifically, in achieving Figure 1 In the case of the illustrated embodiment, and Figure 1 When the modules of the voiceprint recognition device 100 described in the embodiment are implemented by software, the following steps are performed: Figure 1 The software or program code required for the functions of the spectrogram segmentation module 101, feature extraction module 103, and feature fusion module 105 are stored in the memory 1204. The processor 1202 executes the program code corresponding to each module stored in the memory 1204, such as the program code corresponding to the feature extraction module 103 and the feature fusion module 105, to extract feature information of the multiple sub-spectral graphs using feature extraction networks with different time resolutions, and to fuse the feature information extracted by the feature extraction networks with different time resolutions into the voiceprint of the speech signal.

[0136] The terminal device can be a smartwatch, but is not limited to smartwatches. In some embodiments, the terminal device can also be a smart bracelet, glasses, head-mounted electronic device, goggles, smartphone, PDA, laptop, etc., that includes an ECG device. In some embodiments, the terminal device may also include a serial interface such as an RS-232 interface. This serial interface can be connected to other devices, such as audio external playback devices like smart speakers, enabling the terminal device and the audio external playback device to collaborate in playing audio and video.

[0137] Understandable, Figure 8 The illustrated structure does not constitute a specific limitation on the terminal device 1200. In other embodiments of this application, the terminal device may include more or fewer components than illustrated, or combine some components, or split some components, or have different component arrangements. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

[0138] Embodiments of this application provide a non-volatile computer-readable storage medium storing computer program instructions thereon, which, when executed by a processor, implement the above-described method.

[0139] Embodiments of this application provide a computer program product including computer-readable code, or a non-volatile computer-readable storage medium carrying computer-readable code, wherein when the computer-readable code is run in a processor of an electronic device, the processor in the electronic device performs the above-described method.

[0140] Computer-readable storage media can be tangible devices capable of holding and storing instructions for use by an instruction execution device. Computer-readable storage media can be, for example—but not limited to—electrical storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), electrically programmable read-only memory (EPROM or flash memory), static random access memory (SRAM), compact disc read-only memory (CD-ROM), digital video disc (DVD), memory sticks, floppy disks, mechanical encoding devices, such as punch cards or recessed protrusions storing instructions thereon, and any suitable combination of the foregoing.

[0141] The computer-readable program instructions or code described herein can be downloaded from computer-readable storage media to various computing / processing devices, or downloaded via a network, such as the Internet, local area network, wide area network, and / or wireless network, to an external computer or external storage device. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives the computer-readable program instructions from the network and forwards them to the computer-readable storage media in the respective computing / processing device.

[0142] The computer program instructions used to perform the operations of this application may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, status setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages ​​such as Smalltalk, C++, etc., and conventional procedural programming languages ​​such as "C" or similar languages. The computer-readable program instructions may be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer may be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, electronic circuits, such as programmable logic circuits, field-programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), are personalized by utilizing state information from computer-readable program instructions. These electronic circuits can execute computer-readable program instructions to implement various aspects of this application.

[0143] Various aspects of this application are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.

[0144] These computer-readable program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processor of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner; thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.

[0145] Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions executed on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.

[0146] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of apparatus, systems, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than those shown in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved.

[0147] It should also be noted that each block in the block diagram and / or flowchart, as well as combinations of blocks in the block diagram and / or flowchart, can be implemented using hardware (such as circuits or ASICs (Application Specific Integrated Circuits)) that performs the corresponding function or action, or using a combination of hardware and software, such as firmware.

[0148] Although the invention has been described herein in conjunction with various embodiments, those skilled in the art will understand and implement other variations of the disclosed embodiments by reviewing the accompanying drawings, disclosure, and appended claims in carrying out the claimed invention. In the claims, the word "comprising" does not exclude other components or steps, and "a" or "an" does not exclude a plurality. A single processor or other unit can implement several functions listed in the claims. While different dependent claims may recite certain measures, this does not mean that these measures cannot be combined to produce good results.

[0149] The various embodiments of this application have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical application, or improvement of the technology in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.

Claims

1. A voiceprint recognition method, characterized in that, include: Acquire the spectrogram of the speech signal and divide the spectrogram into multiple sub-spectrals of different frequency bands; Feature information of the multiple sub-language spectrograms is extracted using feature extraction networks with different temporal resolutions. The temporal resolution of the first feature extraction network used to extract feature information of high-frequency sub-language spectrograms is greater than that of the second feature extraction network used to extract feature information of low-frequency sub-language spectrograms. The temporal resolution includes the temporal granularity of sub-language spectrogram processing, wherein the smaller the temporal granularity, the higher the temporal resolution. The feature information extracted by the feature extraction networks with different time resolutions is fused into the voiceprint of the speech signal.

2. The method according to claim 1, characterized in that, The feature extraction network includes a convolutional neural network. Correspondingly, the time resolution of the first feature extraction network used to extract high-frequency segment spectrogram feature information is greater than the time resolution of the second feature extraction network used to extract low-frequency segment spectrogram feature information, including: The kernel size of the first feature extraction network is smaller than the kernel size of the second feature extraction network.

3. The method according to claim 2, characterized in that, The first feature extraction network has a greater temporal resolution than the second feature extraction network, and further includes at least one of the following: The first feature extraction network has more channels than the second feature extraction network; The step size of the first feature extraction network is smaller than the step size of the second feature extraction network.

4. The method according to any one of claims 1-3, characterized in that, The feature extraction network includes multiple serially connected sub-networks, wherein each subsequent sub-network is used to extract feature information from the output of the preceding sub-network.

5. The method according to claim 4, characterized in that, The feature extraction network with different time resolutions extracts feature information from the multiple sub-language spectrograms, including: Feature information of the multiple sub-language spectrograms is extracted using feature extraction networks with different time resolutions, and the output results of the sub-networks in the first and second feature extraction networks are synchronized to each other's sub-networks.

6. The method according to claim 5, characterized in that, The step of synchronizing the output results of the sub-networks in the first feature extraction network and the second feature extraction network to each other's sub-networks includes: The outputs of the subnetworks in the first feature extraction network and the second feature extraction network are synchronized to each other's subnetworks according to a dynamically generated ratio, which is determined based on the correlation of the outputs of the subnetworks.

7. The method according to claim 4, characterized in that, The sub-network includes a neural network consisting of at least one dilated convolution module connected in series, wherein the dilated convolution module includes a neural network module based on dilated convolution.

8. The method according to claim 1, characterized in that, The step of fusing the feature information extracted by the feature extraction networks with different time resolutions into the voiceprint of the speech signal includes: The feature information extracted by the feature extraction networks with different time resolutions is adapted in the time dimension. The adapted feature information is spliced ​​together to generate the voiceprint of the speech signal.

9. The method according to claim 1, characterized in that, After fusing the feature information extracted by the feature extraction networks with different temporal resolutions into the speakerprint of the speech signal, the process includes: Determine the voiceprints corresponding to multiple other voice signals within the same voice time period as the stated voice signal; The average value of the voiceprints corresponding to multiple voice signals within the speech period is determined, and the average value is used as the voiceprint result corresponding to the speech period.

10. A voiceprint recognition device, characterized in that, include: The spectrogram segmentation module is used to acquire the spectrogram of the speech signal and divide the spectrogram into multiple sub-spectrals of different frequency bands; The feature extraction module is used to extract feature information of the multiple sub-language spectrograms using feature extraction networks with different time resolutions. The time resolution of the first feature extraction network used to extract feature information of high-frequency sub-language spectrograms is greater than that of the second feature extraction network used to extract feature information of low-frequency sub-language spectrograms. The time resolution includes the time granularity of sub-language spectrogram processing. The smaller the time granularity, the higher the time resolution. The feature fusion module is used to fuse the feature information extracted by the feature extraction networks with different time resolutions into the voiceprint of the speech signal.

11. The apparatus according to claim 10, characterized in that, The feature extraction network includes a convolutional neural network. Correspondingly, the time resolution of the first feature extraction network used to extract high-frequency segment spectrogram feature information is greater than the time resolution of the second feature extraction network used to extract low-frequency segment spectrogram feature information, including: The kernel size of the first feature extraction network is smaller than the kernel size of the second feature extraction network.

12. The apparatus according to claim 11, characterized in that, The first feature extraction network has a greater temporal resolution than the second feature extraction network, and further includes at least one of the following: The number of channels in the first feature extraction network is greater than the number of channels in the second feature extraction network; The step size of the first feature extraction network is smaller than the step size of the second feature extraction network.

13. The apparatus according to any one of claims 10-12, characterized in that, The feature extraction network includes multiple serially connected sub-networks, wherein each subsequent sub-network is used to extract feature information from the output of the preceding sub-network.

14. The apparatus according to claim 13, characterized in that, The feature extraction module is specifically used for: Feature information of the multiple sub-language spectrograms is extracted using feature extraction networks with different time resolutions, and the output results of the sub-networks in the first and second feature extraction networks are synchronized to each other's sub-networks.

15. The apparatus according to claim 14, characterized in that, The feature extraction module is also used for: The outputs of the subnetworks in the first feature extraction network and the second feature extraction network are synchronized to each other's subnetworks according to a dynamically generated ratio, which is determined based on the correlation of the outputs of the subnetworks.

16. The apparatus according to claim 13, characterized in that, The sub-network includes a neural network consisting of at least one dilated convolution module connected in series, wherein the dilated convolution module includes a neural network module based on dilated convolution.

17. The apparatus according to claim 10, characterized in that, The feature fusion module is specifically used for: The feature information extracted by the feature extraction networks with different time resolutions is adapted in the time dimension. The adapted feature information is spliced ​​together to generate the voiceprint of the speech signal.

18. The apparatus according to claim 10, characterized in that, The feature fusion module is also used for: Determine the voiceprints corresponding to multiple other voice signals within the same voice time period as the stated voice signal; The average value of the voiceprints corresponding to multiple voice signals within the speech period is determined, and the average value is used as the voiceprint result corresponding to the speech period.

19. A terminal device, characterized in that, include: processor; Memory used to store processor-executable instructions; The processor is configured to implement the method according to any one of claims 1-9 when executing the instructions.

20. A non-volatile computer-readable storage medium storing computer program instructions thereon, characterized in that, When the computer program instructions are executed by the processor, they implement the method described in any one of claims 1-9.

21. A computer program product, characterized in that, The method includes computer-readable code, or a non-volatile computer-readable storage medium carrying computer-readable code, wherein when the computer-readable code is executed in a processor of an electronic device, the processor in the electronic device performs the method described in any one of claims 1-9.