Speech fundamental frequency detection method and device based on biological hearing inspiration
By employing biomimetic multi-channel bandpass filtering technology, combined with the biological cochlear multi-channel processing mechanism, periodic structural evidence is labeled and confidence is calculated. This solves the problems of detection accuracy and stability of existing fundamental frequency detection technologies in complex speech waveforms and noisy environments, achieving higher detection accuracy and robustness.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHEJIANG SCI-TECH UNIV
- Filing Date
- 2026-05-20
- Publication Date
- 2026-06-19
Smart Images

Figure CN122245335A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of speech signal processing, and in particular to a speech fundamental frequency detection method and apparatus based on bio-auditory inspiration. Background Technology
[0002] The fundamental frequency (FFM) of speech is a core parameter characterizing human pitch perception and a fundamental technical indicator for scenarios such as speech analysis, speech synthesis, speaker recognition, prosodic modeling, and pathological speech assessment. The essence of FFM detection is to accurately estimate the basic period of vocal cord vibration from a continuous speech signal and convert it into a continuous and stable FFM trajectory. Natural speech exhibits characteristics such as harmonic superposition, formant modulation, non-stationary changes, and rapid switching between voiced and unvoiced sounds, resulting in significant differences between the actual waveform and the ideal periodic signal. Therefore, achieving stable, accurate, and low-false-alarm FFM detection has always been a key technical challenge in the field of speech signal processing.
[0003] Current mainstream fundamental frequency detection technologies can be mainly divided into three categories: traditional rule-based methods, data-driven deep learning methods, and auditory-inspired methods. Traditional rule-based methods are based on short-time frame analysis, extracting periodic features through autocorrelation, difference functions, and cepstral analysis, and then outputting the fundamental frequency after thresholding and smoothing. Representative methods such as YIN reduce octave errors through difference and normalization, while pYIN introduces probabilistic models and hidden Markov models to improve trajectory continuity. However, these methods rely on fixed frame lengths and manual rules, and local periodic cues are prone to failure under complex speech waveforms. It is difficult to achieve a balance between missed detections with sound and false alarms without sound, resulting in insufficient overall trajectory stability.
[0004] Data-driven deep learning methods directly learn the nonlinear mapping relationship between speech features and fundamental frequency. Models such as CREPE, PENN, and SwiftF0 have high accuracy on public datasets and are adaptable to complex noise. However, these methods are highly dependent on the distribution of training data, have limited generalization ability, and are black-box mechanisms internally, lacking interpretability and failing to demonstrate the reasoning basis from speech signal to fundamental frequency. They also have application limitations in low-power, high-reliability speech perception scenarios.
[0005] Auditory heuristics draw upon the frequency decomposition of the cochlea and the phase-locking mechanism of the auditory nerve, achieving fundamental frequency estimation through multi-channel filtering, temporal structure analysis, and periodic fusion, which better aligns with the mechanism by which human hearing processes periodic sounds. However, existing auditory heuristics still have significant shortcomings: they do not fully utilize event-driven neural coding representations, have weak ability to integrate periodic structures across frequency channels, lack accurate identification and boundary correction of the sound production process, exhibit poor robustness in low-frequency noise environments, and struggle to simultaneously achieve detection accuracy, trajectory stability, and false alarm suppression capabilities.
[0006] In summary, there is an urgent need for a speech fundamental frequency detection scheme to address the current limitations of existing technologies in simultaneously meeting the requirements of strong interpretability, controllable false alarms and false negatives, noise robustness, and trajectory stability, thereby improving the overall performance of fundamental frequency detection. Summary of the Invention
[0007] This application provides a speech fundamental frequency detection method and device based on bio-auditory inspiration. It adopts a biomimetic multi-channel bandpass filter, which conforms to the channel-specific processing mechanism of the cochlea, preserves the temporal details of the signal, marks periodic structure 1 / 2 structure evidence and calculates confidence, effectively suppresses octave error, distinguishes stable periodic structures from noise disturbances, and thus generates a continuous and stable speech fundamental frequency trajectory.
[0008] In a first aspect, embodiments of this application provide a speech fundamental frequency detection method based on bio-auditory inspiration, the method comprising:
[0009] Acquire the speech signal to be detected, construct multiple frequency channels corresponding to different frequencies within a preset baseband search range, and input the speech signal to be detected into each frequency channel respectively; The bandpass filter output signal of the corresponding frequency in the speech signal to be detected is extracted in the frequency channel. Extreme events in the bandpass filter output signal are detected to obtain an event pulse sequence. In the event pulse sequence, the subsequent event pulse with the time interval closest to the target period is obtained for each event pulse as a candidate pulse pair. Multiple candidate pulse pairs are arranged in chronological order to form a candidate pulse pair sequence. The candidate pulse pairs with continuous periods in the candidate pulse pair sequence are taken as periodic structure candidate intervals. The confidence of periodic structure is calculated based on the event pulse interval and the smoothness of the event pulse amplitude change in the local neighborhood of the periodic structure candidate interval. Periodic structure candidate intervals with 1 / 2 structure evidence are marked. If there is an event pulse in the midpoint position interval of the periodic structure candidate interval, it means that the corresponding periodic structure candidate interval has 1 / 2 structure evidence. The range corresponding to the preset period multiple is delineated in the candidate pulse pair sequence with the periodic structure candidate interval as the center as the corresponding local neighborhood. All periodic structure candidate intervals corresponding to the current frame in each frequency channel are obtained as the candidate frequency set of the current frame. Based on the periodic structure confidence of each periodic structure candidate interval and the 1 / 2 structure evidence marked by the periodic structure candidate interval, the fundamental frequency estimate of the current frame is obtained in the candidate frequency set of the current frame. By recognizing the speech process from the fundamental frequency estimates of multiple consecutive current frames, a continuous speech fundamental frequency trajectory can be obtained.
[0010] Secondly, embodiments of this application provide a speech fundamental frequency detection device based on bio-auditory inspiration, comprising: The acquisition module is used to acquire the speech signal to be detected, construct multiple frequency channels corresponding to different frequencies within a preset baseband search range, and input the speech signal to be detected into each frequency channel respectively. The confidence calculation module is used to extract the bandpass filter output signal of the corresponding frequency in the speech signal to be detected in the frequency channel, detect the extreme events in the bandpass filter output signal to obtain the event pulse sequence, and obtain the subsequent event pulse with the time interval closest to the target period for each event pulse in the event pulse sequence as a candidate pulse pair. Multiple candidate pulse pairs are arranged in chronological order to form a candidate pulse pair sequence. The candidate pulse pairs with continuous periods in the candidate pulse pair sequence are taken as periodic structure candidate intervals. The confidence of periodic structure is calculated based on the event pulse interval and the smoothness of the event pulse amplitude change in the local neighborhood of the periodic structure candidate interval, and the periodic structure candidate interval with 1 / 2 structure evidence is marked. If there is an event pulse in the midpoint position interval of the periodic structure candidate interval, it means that the corresponding periodic structure candidate interval has 1 / 2 structure evidence. The range corresponding to the preset period multiple is delineated in the candidate pulse pair sequence with the periodic structure candidate interval as the center as the corresponding local neighborhood. The fundamental frequency estimation module is used to obtain all periodic structure candidate intervals corresponding to the current frame in each frequency channel as the candidate frequency set of the current frame. Based on the periodic structure confidence of each periodic structure candidate interval and the half structure evidence marked by the periodic structure candidate interval, the fundamental frequency estimate of the current frame is obtained in the candidate frequency set of the current frame. The fundamental frequency trajectory output module is used to identify the speech fundamental frequency trajectory by performing sound production process recognition on the fundamental frequency estimates of multiple consecutive current frames.
[0011] Thirdly, embodiments of this application provide an electronic device including a memory and a processor, wherein the memory stores a computer program and the processor is configured to run the computer program to perform a speech fundamental frequency detection method based on bio-auditory inspiration.
[0012] The main contributions and innovations of this invention are as follows: This application employs a biomimetic multi-channel bandpass filter, perfectly aligning with the cochlear's channel-wise frequency decomposition mechanism for sound. This helps preserve the temporal details of each frequency band and avoids phase distortion interference with subsequent periodic analysis. This application uses a three-state biomimetic spiking neuron to detect extreme events, using peak / valley values as the judgment criteria, setting noise thresholds and refractory periods to accurately eliminate invalid pulses generated by stable noise, suppressing repeated firing within a single cycle, and ensuring that the event pulse sequence contains only triggering information of effective periodic structures. This application filters periodic candidate intervals based on three conditions: periodic tolerance, consistency between adjacent periods, and amplitude intensity, and marks 1 / 2 structural evidence to accurately eliminate occasional periodic and unreliable low-amplitude pulse pairs. The 1 / 2 structural evidence identifies half-cycle support relationships, effectively suppressing octave errors and improving the accuracy of periodicity judgment.
[0013] Details of one or more embodiments of this application are set forth in the following drawings and description to make other features, objects and advantages of this application more readily apparent. Attached Figure Description
[0014] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, illustrate exemplary embodiments and are used to explain this application, but do not constitute an undue limitation of this application. In the drawings: Figure 1 This is a schematic diagram of a speech fundamental frequency detection method based on bio-auditory inspiration according to an embodiment of this application; Figure 2 This is a schematic diagram of a biomimetic spiking neuron according to an embodiment of this application; Figure 3 This is a structural block diagram of a speech fundamental frequency detection device based on bio-auditory inspiration according to an embodiment of this application; Figure 4 This is a schematic diagram of the hardware structure of an electronic device according to an embodiment of this application. Detailed Implementation
[0015] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numerals in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with one or more embodiments of this specification. Rather, they are merely examples of apparatuses and methods consistent with some aspects of one or more embodiments of this specification as detailed in the appended claims.
[0016] It should be noted that the steps of the corresponding methods are not necessarily performed in the order shown and described in this specification in other embodiments. In some other embodiments, the methods may include more or fewer steps than described in this specification. Furthermore, a single step described in this specification may be broken down into multiple steps in other embodiments; and multiple steps described in this specification may be combined into a single step in other embodiments.
[0017] Example 1 This application provides a speech fundamental frequency detection method based on bio-auditory inspiration. It employs a biomimetic multi-channel bandpass filter, conforming to the channel-specific processing mechanism of the cochlea, preserving signal temporal details, and marking periodic structural 1 / 2 structure evidence and calculating confidence levels. This effectively suppresses octave errors, distinguishes stable periodic structures from noise disturbances, and thus generates a continuous and stable speech fundamental frequency trajectory. Specifically, refer to... Figure 1 The method includes: The acquisition module is used to acquire the speech signal to be detected, construct multiple frequency channels corresponding to different frequencies within a preset baseband search range, and input the speech signal to be detected into each frequency channel respectively. The confidence calculation module is used to extract the bandpass filter output signal of the corresponding frequency in the speech signal to be detected in the frequency channel, detect the extreme events in the bandpass filter output signal to obtain the event pulse sequence, and obtain the subsequent event pulse with the time interval closest to the target period for each event pulse in the event pulse sequence as a candidate pulse pair. Multiple candidate pulse pairs are arranged in chronological order to form a candidate pulse pair sequence. The candidate pulse pairs with continuous periods in the candidate pulse pair sequence are taken as periodic structure candidate intervals. The confidence of periodic structure is calculated based on the event pulse interval and the smoothness of the event pulse amplitude change in the local neighborhood of the periodic structure candidate interval, and the periodic structure candidate interval with 1 / 2 structure evidence is marked. If there is an event pulse in the midpoint position interval of the periodic structure candidate interval, it means that the corresponding periodic structure candidate interval has 1 / 2 structure evidence. The range corresponding to the preset period multiple is delineated in the candidate pulse pair sequence with the periodic structure candidate interval as the center as the corresponding local neighborhood. The fundamental frequency estimation module is used to obtain all periodic structure candidate intervals corresponding to the current frame in each frequency channel as the candidate frequency set of the current frame. Based on the periodic structure confidence of each periodic structure candidate interval and the half structure evidence marked by the periodic structure candidate interval, the fundamental frequency estimate of the current frame is obtained in the candidate frequency set of the current frame. The fundamental frequency trajectory output module is used to identify the speech fundamental frequency trajectory by performing sound production process recognition on the fundamental frequency estimates of multiple consecutive current frames.
[0018] In the current embodiment, multiple frequency channels are constructed based on a preset fundamental frequency search range. In each frequency channel, a bandpass filter with a different center frequency is used to process the speech signal to be detected, thereby completing multi-resolution frequency decomposition, which conforms to the channel-based processing mechanism of the biological cochlea for sound signals.
[0019] Furthermore, in the frequency channel, a bandpass response consisting of a low-pass structural unit and a high-pass structural unit connected in series is used to filter the speech signal to be detected. Both the low-pass and high-pass structural units are iteratively updated based on their internal state variables. The low-pass structural unit outputs its internal state quantity, and the high-pass structural unit outputs the product of the state change and the time constant. The time constants of the low-pass and high-pass structural units can be set to be the same or matched, and are determined according to the target frequency to obtain the bandpass filtered output signal corresponding to the center frequency.
[0020] Specifically, the series connection of low-pass and high-pass structural units ensures that each frequency channel retains the time-domain details of the signal within the corresponding frequency band, avoiding phase distortion from interfering with subsequent periodic analysis.
[0021] Specifically, the internal state update of the low-pass / high-pass structural unit can be represented as:
[0022] in, For time index, The voice signal to be detected. For internal state, It is a time constant. For the first The change in state at any given time. For the first The internal state variables at time t, For the first The state change at time t, in the low-pass structural unit, is the first... Low-pass filter results at time 1 It can be represented as: In the Qualcomm architecture unit, the first Low-pass filter results at time 1 It can be represented as: .
[0023] In other words, by setting different time constants and filter cascade parameters for different frequency channels, multiple bandpass filter output signals with different center frequencies and different frequency selection intensities can be obtained. Specifically, different filter orders, thresholds and amplitude gating ranges can be set for different frequencies to take into account the response characteristics of low-frequency and high-frequency channels.
[0024] In the current embodiment, in the step of obtaining an event pulse sequence by detecting extreme events in the bandpass filter output signal, a biomimetic spiking neuron is constructed. The biomimetic spiking neuron includes a resting state, a decision state, and a firing state. The biomimetic spiking neuron defaults to the resting state. In the resting state, the bandpass filter output signal is continuously detected. When an extreme event is detected to be about to occur, it switches to the decision state. In the decision state, it is determined whether the extreme event meets the preset extreme value confirmation condition. If it does, it switches to the firing state; if it does not, it switches to the resting state. In the firing state, an event pulse corresponding to the extreme event is output, and after a preset refractory period, it switches to the resting state. The biomimetic spiking neuron uses the peak and valley values in the bandpass filter output signal as extreme events.
[0025] Specifically, a schematic diagram of a biomimetic spiking neuron is shown below. Figure 2 As shown, the extreme value event is determined based on the direction of the signal change gradient. When the gradient of the bandpass filter output signal is positive and the signal value is greater than the preset noise threshold in the resting state, it indicates that a peak extreme value event is about to occur. When the gradient of the bandpass filter output signal is negative and the signal value is greater than the preset noise threshold in the resting state, it indicates that a valley extreme value event is about to occur. This effectively eliminates invalid pulses generated by stable noise and ensures that the event pulse sequence contains only the triggering information of effective periodic structures.
[0026] Specifically, by setting a refractory period, repeated issuance within a single cycle can be suppressed.
[0027] Specifically, each extreme event output by the bionic spiking neuron includes time location information and amplitude information. Based on the time location information, the extreme events are sorted in chronological order to obtain the event pulse sequence.
[0028] In the current embodiment, the reciprocal of the frequency corresponding to the frequency channel is used as the target period of the current frequency channel.
[0029] In other words, for a frequency of The frequency channel, its target period for .
[0030] For example, the current event pulse is denoted as Its time position is , in e i Then iterate through all event pulses to find those that satisfy the time interval. Minimum event pulse , then As The corresponding candidate pulse pairs can be obtained by traversing the complete event pulse sequence, and the candidate pulse pair sequence can be sorted by time.
[0031] Specifically, obtaining candidate pulses based on the target period can directly match the target periodicity of the current frequency channel, avoiding undirected traversal search, significantly reducing the amount of computation, while preserving the accuracy of the candidate pulse pair matching the target period, providing a reliable foundation for subsequent periodic structure screening.
[0032] In the current embodiment, in the step of taking the periodically continuous candidate pulse pairs in the candidate pulse pair sequence as the periodic structure candidate interval, the difference between the average period of the current candidate pulse pair and the adjacent candidate pulse pairs and the target period is less than a first preset tolerance range; and the difference between the current candidate pulse pair and the target period is less than a second preset tolerance range; and the candidate pulse pairs whose endpoint amplitude intensity of the current candidate pulse pair, or the endpoint amplitude intensity of the corresponding overlapping candidate pulse pairs is greater than the intensity threshold, are considered to be periodically continuous.
[0033] Furthermore, when determining the current candidate pulse pair whose endpoint amplitude is greater than the intensity threshold, if the endpoint amplitude intensity of the current candidate pulse pair, or the endpoint amplitude intensity of the corresponding overlapping candidate pulse pair, is greater than the intensity threshold, then the current candidate pulse pair is considered to conform to periodicity.
[0034] Specifically, the purpose of using the difference between the current candidate pulse pair and the target period being less than the first preset tolerance range as a judgment condition is to retain only candidate pulse pairs whose deviation from the target period falls within the first preset tolerance range, in order to eliminate pulse pairs that are obviously mismatched with the corresponding frequency channel.
[0035] Specifically, the objective of using the difference between the average period of the current candidate pulse pair and the adjacent candidate pulse pairs and the target period being less than the second preset tolerance range as the judgment condition is to examine the neighboring candidate pulse pairs on the left and right sides of the current candidate pulse pair and require them to maintain a similar periodic structure in the local time neighborhood, thereby avoiding the occasional periodicity of the candidate pulse pairs.
[0036] Specifically, by using the amplitude of the endpoint of the candidate pulse pair and the pulse amplitude of the overlapping candidate pulse pairs to be greater than the intensity threshold as a periodic judgment condition, low-reliability candidate pulse pairs with too low overall amplitude can be eliminated.
[0037] Specifically, other candidate pulse pairs that overlap on the time axis and have similar periods are considered as overlapping candidate pulse pairs of the current candidate pulse pair.
[0038] In the current embodiment, a preset time range adjacent to the center time position of the candidate periodic structure interval is defined as the midpoint position interval. The event pulses existing in the midpoint position interval are regarded as half-cycle pulses. If the difference between the average amplitude of the endpoints of the current candidate periodic structure interval and the amplitude of the corresponding half-cycle pulse endpoint is less than a preset amplitude tolerance, then the current candidate periodic structure interval satisfies the half-cycle judgment condition. When the proportion of the candidate periodic structure intervals that satisfy the half-cycle judgment condition in the corresponding local neighborhood reaches a preset threshold, it is considered that the current candidate periodic structure interval has 1 / 2 structural evidence.
[0039] For example, the preset threshold in this scheme is 60%, that is, if 60% of the candidate pulse pairs in the local neighborhood meet the half-cycle condition, it can be determined that there is 1 / 2 structural evidence in the current periodic structure candidate interval, and the corresponding interval is marked.
[0040] Specifically, the 1 / 2 structure evidence is used to characterize the existence of a half-cycle support relationship within the candidate interval, and can be used for hierarchical re-judgment and octave error suppression in subsequent frame-level candidate integration.
[0041] In the current embodiment, the confidence level of the periodic structure candidate interval is the product of the frequency score and the amplitude score. The formula for calculating the frequency score is as follows:
[0042] in, For frequency scoring, N is the total number of candidate pulse pairs in the corresponding local neighborhood. For candidate pulse pair index, The attenuation coefficient is... For the frequency of the corresponding candidate pulse pair, The frequency of the corresponding frequency channel; The relative error set is obtained by calculating the relative amplitude error of all candidate pulse pairs in the corresponding local neighborhood. The amplitude change set is obtained by calculating the amplitude change of all candidate pulse pairs in the corresponding local neighborhood. Obtain the median of the relative error set. To obtain the volatility of the set of amplitude changes The basic score calculation formula for the current periodic structure candidate interval is:
[0043] in, Based on the score, For the index of the candidate pulse pair, The total number of candidate pulse pairs. , , These are preset weight parameters; The product of the baseline score and the intensity factor is used as the magnitude score of the current periodic structure candidate interval, expressed by the formula:
[0044]
[0045] in, Intensity factor This is the value obtained by mapping the endpoint average amplitude of all candidate pulse pairs within the corresponding local neighborhood to the linear interval. The minimum smoothing score for the preset candidate pulse pair. The maximum smoothing score is set for the preset candidate pulse pair. The lower limit of the linear interval. The upper limit of the linear interval, The score is based on the magnitude.
[0046] Specifically, frequency scoring is first used to quantify the periodic matching stability of candidate intervals of periodic structures, thus avoiding distortion in frequency channel extraction.
[0047] Specifically, let the first The event pulse amplitudes at both ends of each candidate pulse pair are respectively and Then the relative error of the amplitude is:
[0048] in, This refers to the relative error of the amplitude. and The first The amplitude of the event pulses at both ends of a candidate pulse pair.
[0049] The change in amplitude is:
[0050] in, The magnitude change and The first The amplitude of the event pulses at both ends of a candidate pulse pair.
[0051] Then for the relative error set and set of amplitude changes In general, the median of the relative error set is The volatility of the set of amplitude changes is ,in, for The average value.
[0052] In the current embodiment, the time window of the current frame is obtained, and the periodic structure candidate intervals whose time centers fall within the time window of the current frame are obtained as the candidate frequency set of the current frame. Each periodic structure candidate interval in the candidate frequency set of the current frame is taken as a candidate fundamental frequency layer, and the confidence level of the corresponding periodic structure is taken as the fundamental frequency base score of the corresponding candidate fundamental frequency layer. The frequency corresponding to the candidate fundamental frequency layer with the highest fundamental frequency base score is taken as the initial fundamental frequency. Based on the approximate harmonic relationship corresponding to the initial fundamental frequency, the candidate frequency set of the current frame is divided into multiple candidate groups, and the group base score of each candidate group is calculated. The candidate group with the highest level base score is selected as the candidate frequency. To determine the winning candidate group, the frequency corresponding to the periodic structure candidate interval with the highest fundamental frequency score in the winning candidate group is used as the fundamental frequency estimate of the current frame. The time position of the midpoint of the periodic structure candidate interval is used as the corresponding time center. When calculating the group's fundamental score, the highest fundamental frequency score in the candidate group is used as the first score. If the candidate group does not contain a periodic structure candidate interval marked with 1 / 2 structure evidence, the first score is used as the group's fundamental score. If the candidate group contains a periodic structure candidate interval marked with 1 / 2 structure evidence, the sum of the first score and the preset support score is used as the group's fundamental score.
[0053] Specifically, the advantage of this is that when a certain harmonic layer itself has a high confidence level and there is supporting evidence of a 1 / 2 structure with a lower harmonic, it will receive an additional score boost, thereby avoiding misjudging the half-frequency as the fundamental frequency. This directly suppresses the common octave error problem in fundamental frequency detection from a mechanism perspective, and improves the accuracy of fundamental frequency estimation.
[0054] Specifically, using the sum of the first score and the preset support score as the group's base score allows candidate groups with evidence of half-cycle structure to receive more reasonable scores, making it easier for candidate groups corresponding to the true fundamental frequency to win in inter-group comparisons. It also prevents the output of higher harmonic results due to octave confusion errors, further enhancing the suppression effect of octave errors. At the same time, it does not require the introduction of additional complex post-processing correction steps, thus controlling the overall computational complexity of the algorithm while ensuring detection accuracy.
[0055] Specifically, a short-time sliding window analysis is performed along the time axis to obtain the time window corresponding to the current frame.
[0056] In the current embodiment, a frame-by-frame fundamental frequency estimation sequence is formed by the fundamental frequency estimates of consecutive frames in chronological order. The frame sequence in which the change in the fundamental frequency estimate of consecutive frames in the frame-by-frame fundamental frequency estimation sequence is less than a first preset threshold, the change in the corresponding periodic structure confidence is less than a second preset threshold, and the periodic structure confidence of the first frame is greater than the phonation threshold is taken as the phonation segment. The area covered by the phonation segment is taken as the voiced region, and the area not covered by the phonation segment is taken as the unvoiced region or silent region. The corresponding fundamental frequency estimate of the voiced region and the preset invalidation mark of the unvoiced region or silent region are output to obtain the continuous speech fundamental frequency trajectory.
[0057] Furthermore, for each sound segment, if the frequency difference between the edge frame of the sound segment and the corresponding main frequency is greater than a preset correction threshold, then the edge frame is taken as a frame to be corrected, and a frequency with a frequency difference less than the preset correction threshold is re-acquired from the candidate frequency set of the current frame as the fundamental frequency estimate of the corresponding edge frame. If there is no frequency with a frequency difference less than the preset correction threshold in the candidate frequency set, then the corresponding edge frame is deleted from the sound segment.
[0058] Furthermore, if the number of interval frames between two sound segments is less than the preset number of frames, the frequency of each interval frame is predicted based on the frequency change trend of the two sound segments. For each interval frame, the frequency closest to the prediction result is obtained from the candidate frequency set of the corresponding current frame as the fundamental frequency estimate, and the two sound segments and the interval frames therein are merged into one sound segment.
[0059] Example 2 Based on the same concept, referencing Figure 3 This application also proposes a speech fundamental frequency detection device based on bio-auditory inspiration, comprising: The acquisition module is used to acquire the speech signal to be detected, construct multiple frequency channels corresponding to different frequencies within a preset baseband search range, and input the speech signal to be detected into each frequency channel respectively. The confidence calculation module is used to extract the bandpass filter output signal of the corresponding frequency in the speech signal to be detected in the frequency channel, detect the extreme events in the bandpass filter output signal to obtain the event pulse sequence, and obtain the subsequent event pulse with the time interval closest to the target period for each event pulse in the event pulse sequence as a candidate pulse pair. Multiple candidate pulse pairs are arranged in chronological order to form a candidate pulse pair sequence. The candidate pulse pairs with continuous periods in the candidate pulse pair sequence are taken as periodic structure candidate intervals. The confidence of periodic structure is calculated based on the event pulse interval and the smoothness of the event pulse amplitude change in the local neighborhood of the periodic structure candidate interval, and the periodic structure candidate interval with 1 / 2 structure evidence is marked. If there is an event pulse in the midpoint position interval of the periodic structure candidate interval, it means that the corresponding periodic structure candidate interval has 1 / 2 structure evidence. The range corresponding to the preset period multiple is delineated in the candidate pulse pair sequence with the periodic structure candidate interval as the center as the corresponding local neighborhood. The fundamental frequency estimation module is used to obtain all periodic structure candidate intervals corresponding to the current frame in each frequency channel as the candidate frequency set of the current frame. Based on the periodic structure confidence of each periodic structure candidate interval and the half structure evidence marked by the periodic structure candidate interval, the fundamental frequency estimate of the current frame is obtained in the candidate frequency set of the current frame. The fundamental frequency trajectory output module is used to identify the speech fundamental frequency trajectory by performing sound production process recognition on the fundamental frequency estimates of multiple consecutive current frames.
[0060] Example 3 This embodiment also provides an electronic device, see reference. Figure 4 It includes a memory 404 and a processor 402, wherein the memory 404 stores a computer program and the processor 402 is configured to run the computer program to perform the steps in any of the above method embodiments.
[0061] Specifically, the processor 402 may include a central processing unit (CPU), or an application-specific integrated circuit (ASIC), or one or more integrated circuits that can be configured to implement the embodiments of this application.
[0062] Memory 404 may include a mass storage device for data or instructions. For example, and not limitingly, memory 404 may include a hard disk drive (HDD), a floppy disk drive, a solid-state drive (SSD), flash memory, an optical disk drive, a magneto-optical disk drive, magnetic tape, or a Universal Serial Bus (USB) drive, or a combination of two or more of these. Where appropriate, memory 404 may include removable or non-removable (or fixed) media. Where appropriate, memory 404 may be internal or external to a data processing device. In a particular embodiment, memory 404 is non-volatile memory. In a particular embodiment, memory 404 includes read-only memory (ROM) and random access memory (RAM). Where appropriate, the ROM may be a mask-programmed ROM, a programmable read-only memory (PROM), an erasable read-only memory (EPROM), an electrically erasable read-only memory (EEPROM), an electrically alterable read-only memory (EAROM), or flash memory, or a combination of two or more of these. Where appropriate, the RAM can be Static Random-Access Memory (SRAM) or Dynamic Random-Access Memory (DRAM). DRAM can be Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), Extended Data Out Dynamic Random-Access Memory (EDODRAM), Synchronous Dynamic Random-Access Memory (SDRAM), etc.
[0063] The memory 404 can be used to store or cache various data files that need to be processed and / or communicated, as well as possible computer program instructions executed by the processor 402.
[0064] The processor 402 reads and executes computer program instructions stored in the memory 404 to implement any of the bio-auditory-inspired speech fundamental frequency detection methods in the above embodiments.
[0065] Optionally, the electronic device may further include a transmission device 406 and an input / output device 408, wherein the transmission device 406 is connected to the processor 402, and the input / output device 408 is connected to the processor 402.
[0066] The transmission device 406 can be used to receive or send data via a network. Specific examples of the network described above may include wired or wireless networks provided by the communication provider of the electronic device. In one example, the transmission device includes a Network Interface Controller (NIC), which can connect to other network devices via a base station to communicate with the Internet. In another example, the transmission device 406 may be a Radio Frequency (RF) module used for wireless communication with the Internet.
[0067] The input / output device 408 is used to input or output information. In this embodiment, the input information may be a voice signal to be detected, and the output information may be a continuous voice fundamental frequency trajectory, etc.
[0068] Optionally, in this embodiment, the processor 402 can be configured to perform the following steps via a computer program: Acquire the speech signal to be detected, construct multiple frequency channels corresponding to different frequencies within a preset baseband search range, and input the speech signal to be detected into each frequency channel respectively; The bandpass filter output signal of the corresponding frequency in the speech signal to be detected is extracted in the frequency channel. Extreme events in the bandpass filter output signal are detected to obtain an event pulse sequence. In the event pulse sequence, the subsequent event pulse with the time interval closest to the target period is obtained for each event pulse as a candidate pulse pair. Multiple candidate pulse pairs are arranged in chronological order to form a candidate pulse pair sequence. The candidate pulse pairs with continuous periods in the candidate pulse pair sequence are taken as periodic structure candidate intervals. The confidence of periodic structure is calculated based on the event pulse interval and the smoothness of the event pulse amplitude change in the local neighborhood of the periodic structure candidate interval. Periodic structure candidate intervals with 1 / 2 structure evidence are marked. If there is an event pulse in the midpoint position interval of the periodic structure candidate interval, it means that the corresponding periodic structure candidate interval has 1 / 2 structure evidence. The range corresponding to the preset period multiple is delineated in the candidate pulse pair sequence with the periodic structure candidate interval as the center as the corresponding local neighborhood. All periodic structure candidate intervals corresponding to the current frame in each frequency channel are obtained as the candidate frequency set of the current frame. Based on the periodic structure confidence of each periodic structure candidate interval and the 1 / 2 structure evidence marked by the periodic structure candidate interval, the fundamental frequency estimate of the current frame is obtained in the candidate frequency set of the current frame. By recognizing the speech process from the fundamental frequency estimates of multiple consecutive current frames, a continuous speech fundamental frequency trajectory can be obtained.
[0069] It should be noted that the specific examples in this embodiment can refer to the examples described in the above embodiments and optional implementations, and will not be repeated here.
[0070] Generally, various embodiments can be implemented in hardware or dedicated circuitry, software, logic, or any combination thereof. Some aspects of the invention can be implemented in hardware, while others can be implemented by firmware or software executed by a controller, microprocessor, or other computing device, but the invention is not limited thereto. Although various aspects of the invention may be shown and described as block diagrams, flowcharts, or using some other graphical representation, it should be understood that, by way of non-limiting example, these blocks, apparatuses, systems, techniques, or methods described herein can be implemented in hardware, software, firmware, dedicated circuitry or logic, general-purpose hardware or controllers or other computing devices, or some combination thereof.
[0071] Embodiments of the present invention can be implemented by computer software, which may be executable by a data processor of a mobile device, such as a processor entity, or by hardware, or by a combination of software and hardware. Computer software or programs (also referred to as program products) including software routines, applets, and / or macros can be stored in any device-readable data storage medium, and they include program instructions for performing specific tasks. The computer program product may include one or more computer-executable components configured to perform the embodiments when the program is run. The one or more computer-executable components may be at least one piece of software code or a portion thereof. Additionally, it should be noted in this respect that, as Figure 4 Any box in the logical flow can represent a program step, or interconnected logic circuits, boxes and functions, or a combination of program steps and logic circuits, boxes and functions. Software can be stored on physical media such as memory chips or blocks of storage implemented within a processor, magnetic media such as hard disks or floppy disks, and optical media such as DVDs and their data variants, CDs, etc. The physical medium is a non-transient medium.
[0072] Those skilled in the art should understand that the technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments have been described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0073] The above embodiments are merely illustrative of several implementation methods of this application, and their descriptions are relatively specific and detailed, but they should not be construed as limiting the scope of this application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.
Claims
1. A method for detecting a fundamental frequency of speech based on bioacoustic heuristics, the method comprising: Includes the following steps: Acquire the speech signal to be detected, construct multiple frequency channels corresponding to different frequencies within a preset baseband search range, and input the speech signal to be detected into each frequency channel respectively; The bandpass filter output signal of the corresponding frequency in the speech signal to be detected is extracted in the frequency channel. Extreme events in the bandpass filter output signal are detected to obtain an event pulse sequence. In the event pulse sequence, the subsequent event pulse with the time interval closest to the target period is obtained for each event pulse as a candidate pulse pair. Multiple candidate pulse pairs are arranged in chronological order to form a candidate pulse pair sequence. The candidate pulse pairs with continuous periods in the candidate pulse pair sequence are taken as periodic structure candidate intervals. The periodic structure confidence of each periodic structure candidate interval is calculated, and periodic structure candidate intervals with 1 / 2 structure evidence are marked. If there is an event pulse in the midpoint position interval of the periodic structure candidate interval, it means that the corresponding periodic structure candidate interval has 1 / 2 structure evidence. All periodic structure candidate intervals corresponding to the current frame in each frequency channel are obtained as the candidate frequency set of the current frame. Based on the periodic structure confidence of each periodic structure candidate interval and the 1 / 2 structure evidence marked by the periodic structure candidate interval, the fundamental frequency estimate of the current frame is obtained in the candidate frequency set of the current frame. By performing vocalization process recognition on the fundamental frequency estimates of multiple consecutive current frames, a continuous speech fundamental frequency trajectory can be obtained.
2. The method according to claim 1, wherein, In the step of obtaining the event pulse sequence by detecting extreme events in the bandpass filter output signal, a bionic spiking neuron is constructed. The bionic spiking neuron includes a resting state, a decision state, and a firing state. The bionic spiking neuron defaults to the resting state and continuously detects the bandpass filter output signal in the resting state. When it is detected that an extreme event is about to occur, it switches to the decision state. In the decision-making state, it is determined whether the extreme value event meets the preset extreme value confirmation condition. If it does, it enters the release state; otherwise, it enters the rest state. In the firing state, the event pulse corresponding to the extreme event is output, and after a preset refractory period, it enters the resting state. The bionic spiking neuron uses the peak and valley values in the bandpass filtered output signal as extreme events.
3. The method of claim 1, wherein the method is based on a bioacoustic hearing-inspired pitch detection. The target period of the current frequency channel is determined by the reciprocal of the frequency corresponding to that channel.
4. The method of claim 1, wherein the method is based on a bioacoustic hearing-inspired pitch detection. In the step of selecting candidate pulse pairs with continuous periods in the candidate pulse pair sequence as candidate intervals for periodic structures, the difference between the average period of the current candidate pulse pair and its adjacent candidate pulse pairs and the target period is less than a first preset tolerance range; and the difference between the current candidate pulse pair and the target period is less than a second preset tolerance range; and the endpoint amplitude intensity of the current candidate pulse pair, or the endpoint amplitude intensity of the corresponding overlapping candidate pulse pair, is greater than an intensity threshold, is considered a candidate pulse pair with continuous periods.
5. The speech fundamental frequency detection method based on bio-auditory inspiration according to claim 1, characterized in that, Centered on the candidate periodic structure interval, a range corresponding to a preset period multiple is defined within the candidate pulse pair sequence as the corresponding local neighborhood. The confidence level of the periodic structure is calculated based on the event pulse interval and the smoothness of the event pulse amplitude change within the local neighborhood of the candidate periodic structure interval. A preset time range adjacent to the center time position of the candidate periodic structure interval is defined as the midpoint position interval. Event pulses existing in the midpoint position interval are considered as half-period pulses. If the difference between the mean amplitude of the endpoints of the current candidate periodic structure interval and the amplitude of the corresponding half-period pulse endpoint is less than a preset amplitude tolerance, then the current candidate periodic structure interval meets the half-period judgment condition. When the proportion of candidate periodic structure intervals that meet the half-period judgment condition within the corresponding local neighborhood reaches a preset threshold, it is considered that the current candidate periodic structure interval has 1 / 2 structural evidence.
6. The speech fundamental frequency detection method based on bio-auditory inspiration according to claim 1, characterized in that, In the current embodiment, the confidence level of the periodic structure candidate interval is the product of the frequency score and the amplitude score. The formula for calculating the frequency score is as follows: in, For frequency scoring, N is the total number of candidate pulse pairs in the corresponding local neighborhood. For candidate pulse pair index, The attenuation coefficient is... For the frequency of the corresponding candidate pulse pair, The frequency of the corresponding frequency channel; The relative error set is obtained by calculating the relative amplitude errors between the candidate intervals of the periodic structure and the corresponding overlapping candidate pulse pairs. The amplitude change set is obtained by calculating the amplitude change of all candidate pulse pairs in the corresponding local neighborhood. Obtain the median of the relative error set. To obtain the volatility of the set of amplitude changes The basic score calculation formula for the current periodic structure candidate interval is: in, Based on the score, For the index of the candidate pulse pair, The total number of candidate pulse pairs. , , These are preset weight parameters; The product of the baseline score and the intensity factor is used as the magnitude score of the current periodic structure candidate interval, expressed by the formula: in, Intensity factor This is the value obtained by mapping the endpoint average amplitude of all candidate pulse pairs within the corresponding local neighborhood to the linear interval. The minimum smoothing score for the preset candidate pulse pair. The maximum smoothing score is set for the preset candidate pulse pair. The lower limit of the linear interval. The upper limit of the linear interval, The score is based on the magnitude.
7. The speech fundamental frequency detection method based on bio-auditory inspiration according to claim 1, characterized in that, The time window of the current frame is obtained. Periodic structure candidate intervals whose time centers fall within the time window of the current frame are identified as the candidate frequency set for the current frame. Each periodic structure candidate interval in the current frame candidate frequency set is designated as a candidate fundamental frequency layer, and the confidence level of the corresponding periodic structure is used as the fundamental frequency base score for that candidate fundamental frequency layer. The frequency corresponding to the candidate fundamental frequency layer with the highest fundamental frequency base score is taken as the initial fundamental frequency. Based on the approximate harmonic relationship corresponding to the initial fundamental frequency, the candidate frequency set of the current frame is divided into multiple candidate groups, and the group base score for each candidate group is calculated. The candidate group with the highest group base score is selected as the winning candidate group. The frequency corresponding to the periodic structure candidate interval with the highest fundamental frequency score in the winning candidate group is used as the fundamental frequency estimate of the current frame. The time position of the midpoint of the periodic structure candidate interval is used as the corresponding time center. When calculating the group fundamental score, the highest fundamental frequency score in the candidate group is used as the first score. If the candidate group does not contain a periodic structure candidate interval marked with 1 / 2 structure evidence, the first score is used as the group fundamental score. If the candidate group contains a periodic structure candidate interval marked with 1 / 2 structure evidence, the sum of the first score and the preset weighted score of 1 / 2 structure evidence is used as the group fundamental score.
8. The speech fundamental frequency detection method based on bio-auditory inspiration according to claim 1, characterized in that, The fundamental frequency estimation sequence is composed of consecutive frames in chronological order. The frame sequence in which the change in the fundamental frequency estimation value of consecutive frames in the frame-by-frame fundamental frequency estimation sequence is less than a first preset threshold, the change in the corresponding periodic structure confidence value is less than a second preset threshold, and the periodic structure confidence value of the first frame is greater than the phonation threshold is taken as the phonation segment. The area covered by the phonation segment is taken as the voiced region, and the area not covered by the phonation segment is taken as the unvoiced region or silent region. The corresponding fundamental frequency estimation value of the voiced region and the preset invalidation mark of the unvoiced region or silent region are output to obtain the continuous speech fundamental frequency trajectory.
9. A speech fundamental frequency detection device based on bio-auditory inspiration, characterized in that, include: The acquisition module is used to acquire the speech signal to be detected, construct multiple frequency channels corresponding to different frequencies within a preset baseband search range, and input the speech signal to be detected into each frequency channel respectively. The confidence calculation module is used to extract the bandpass filter output signal of the corresponding frequency in the speech signal to be detected in the frequency channel, detect the extreme events in the bandpass filter output signal to obtain the event pulse sequence, and obtain the subsequent event pulse with the time interval closest to the target period for each event pulse in the event pulse sequence as a candidate pulse pair. Multiple candidate pulse pairs are arranged in chronological order to form a candidate pulse pair sequence. The candidate pulse pairs with continuous periods in the candidate pulse pair sequence are taken as periodic structure candidate intervals. The periodic structure confidence of each periodic structure candidate interval is calculated, and periodic structure candidate intervals with 1 / 2 structure evidence are marked. If there is an event pulse in the midpoint position interval of the periodic structure candidate interval, it means that the corresponding periodic structure candidate interval has 1 / 2 structure evidence. The fundamental frequency estimation module is used to obtain all periodic structure candidate intervals corresponding to the current frame in each frequency channel as the candidate frequency set of the current frame. Based on the periodic structure confidence of each periodic structure candidate interval and the half structure evidence marked by the periodic structure candidate interval, the fundamental frequency estimate of the current frame is obtained in the candidate frequency set of the current frame. The fundamental frequency trajectory output module is used to identify the speech fundamental frequency trajectory by performing sound production process recognition on the fundamental frequency estimates of multiple consecutive current frames.
10. An electronic device comprising a memory and a processor, characterized in that, The memory stores a computer program, and the processor is configured to run the computer program to perform a speech fundamental frequency detection method based on bio-auditory inspiration as described in any one of claims 1-8.