A dual-model fusion dynamic quality-aware voiceprint speaker recognition method
By employing a dynamic quality-aware voiceprint recognition method based on dual-model fusion, and utilizing coherent demodulation and phase unrolling processing of ultrasonic and speech signals, the problems of voiceprint recognition resistance to attacks and environmental adaptation are solved, achieving highly secure and adaptive voiceprint recognition.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HESHI THINKING (BEIJING) TECHNOLOGY CO LTD
- Filing Date
- 2026-05-19
- Publication Date
- 2026-06-19
AI Technical Summary
Existing voiceprint recognition technology has weak anti-attack capabilities, cannot verify whether the voice is spoken by a live person, and performs poorly in scenarios with environmental noise, user distance shifts, and changes in speaking volume, making it unsuitable for complex environments.
A dynamic quality-sensing voiceprint speaker recognition method using dual-model fusion is proposed. This method generates a continuous single-frequency ultrasonic detection signal, combines it with air-conducted speech signal and ultrasonic echo signal, performs coherent demodulation and phase unfolding processing, extracts the speech energy envelope and motion envelope, and performs cross-modal cross-correlation calculation for liveness detection and identity verification.
It effectively prevents spoofing attacks, adapts to complex environments, improves identification accuracy, reduces additional hardware and algorithm complexity, and is suitable for scenarios such as fixed security, portable terminals, and high-security authentication.
Smart Images

Figure CN122245325A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of voiceprint recognition and biometric authentication, and in particular to a dynamic quality-sensing voiceprint speaker recognition method using dual-model fusion. Background Technology
[0002] Voiceprint recognition, as a contactless biometric authentication technology, boasts advantages such as convenient data collection, high user acceptance, and strong adaptability to remote authentication, leading to its widespread application in various smart devices in recent years. However, existing voiceprint recognition technologies still suffer from numerous unresolved shortcomings, failing to meet the application requirements of high-security scenarios: Current voiceprint recognition technology has weak resistance to attacks and cannot fundamentally prevent forgery attacks. Voiceprint recognition schemes based on pure speech signals, whether using classical signal processing methods or deep learning methods, can only extract the spectral features of the speech, and cannot verify whether the speech is spoken by a living person. They are extremely vulnerable to attacks by recording and playback, speech synthesis, voice conversion, etc., and have serious security vulnerabilities. In scenarios involving environmental noise, user distance shift, and changes in speaking volume, the quality of voice signals can fluctuate significantly. Existing solutions cannot dynamically adjust recognition strategies based on signal quality, leading to a significant increase in false recognition and rejection rates, and making them unsuitable for complex environments in real-world applications. Some existing technologies use modalities such as face, lip movement, and ultrasound for liveness detection, but only as auxiliary links independent of voiceprint recognition. They do not establish an essential connection between speech signals and auxiliary modalities at the physical level, cannot achieve deep fusion of dual-modal features, and add extra hardware and algorithm complexity, resulting in poor portability.
[0003] Therefore, a dynamic quality-aware voiceprint speaker recognition method based on dual-model fusion is proposed to address the aforementioned problems. Summary of the Invention
[0004] The purpose of this invention is to propose a dynamic quality-aware voiceprint speaker recognition method based on dual-model fusion in order to solve the above-mentioned problems.
[0005] To achieve the above objectives, the present invention adopts the following technical solution: A dynamic quality-aware voiceprint speaker recognition method using dual-model fusion includes: Step 1: Generate a continuous single-frequency ultrasonic detection signal, transmit the ultrasonic detection signal to the target speaker to be authenticated, and at the same time retain the original ultrasonic detection signal as a reference signal; Step 2: The acoustic wave acquisition channel acquires the air-conducted speech signal emitted by the target speaker, and the ultrasonic echo acquisition channel acquires the ultrasonic echo signal, with both acquisition channels acquiring the signal simultaneously. Step 3: Using the reference signal from Step 1, coherently demodulate the ultrasonic echo signal acquired in Step 2, extract the baseband signal, and perform phase unrolling processing on the baseband signal to obtain the motion envelope; Step 4: Extract the speech energy envelope from the air-conducted speech signal acquired in Step 2, and align the speech energy envelope with the motion envelope in the time domain. Step 5: Perform cross-modal cross-correlation calculation on the speech energy envelope and motion envelope to determine liveness; match the cross-correlation features with pre-stored user physical feature templates to verify the identity of the target speaker.
[0006] Preferably, the generation process of the continuous single-frequency ultrasonic detection signal includes: The transmitted signal is a continuous single-frequency sine wave. ,in =40kHz is the carrier frequency. The amplitude of the transmitted signal, For time; A DAC is used to generate a sine wave signal, and a high-precision reference source of 2.5V is selected for the reference voltage of the DAC. Select a power amplifier and adjust the transmission power according to the interaction distance; A surface acoustic wave filter is connected in series between the power amplifier and the transducer to suppress the amplitude of the second harmonic.
[0007] Preferably, before transmitting the ultrasonic detection signal, a multi-stage alignment calibration is performed on the ultrasonic transmission beam to ensure that the beam completely covers the sound-related area of the target to be detected. The specific calibration process is as follows: Fix the device to the test bench, place a reflector at a preset calibration distance in front of the device, adjust the physical installation angle of the ultrasonic transducer component or the beamforming weight of the MEMS ultrasonic array until the amplitude of the received echo signal reaches the maximum value. At this time, the center of the main lobe of the beam is aligned with the area directly in front of the preset calibration distance. Write the current beam weight or physical angle parameter into the non-volatile memory of the device as the default transmission parameter. When a user uses the device for the first time, the system prompts the user to look directly at the device and issue a preset calibration voice. The system automatically adjusts the beamforming weights based on the echo amplitude distribution collected by the MEMS ultrasound array until the echo amplitude covering the user's facial area reaches the maximum value. After calibration, the current beam weights are stored in the corresponding user's personal configuration file. During operation, the device periodically detects the average amplitude of the echo. If the amplitude drops below a preset fluctuation threshold, it determines that the user's position has shifted and automatically adjusts the beam direction to re-align with the user's face.
[0008] Preferably, in step two, the acoustic wave acquisition channel and the ultrasonic echo acquisition channel acquire data synchronously, which is achieved in the following way: The same high-precision clock source is used as the common clock for the sampling and ultrasonic transmission circuits of the two channels. After frequency division, it provides a matching sampling clock for the acoustic wave channel acquisition unit, the ultrasonic echo channel acquisition unit, and the ultrasonic transmission unit. When sampling is initiated, the control unit outputs the same hardware trigger signal, which is simultaneously connected to the trigger terminals of the two acquisition channels to ensure that the sampling start times of the two channels are aligned. Before the equipment leaves the factory, a synchronous acoustic and vibration signal source is used to complete the synchronization accuracy verification. If the peak delay deviation of the two acquired signals exceeds the preset accuracy threshold, the clock circuit is recalibrated.
[0009] Preferably, the specific implementation and acquisition triggering logic of the two acquisition channels in step two are as follows: The acoustic channel uses a low-noise acoustic-to-electric conversion device. The output signal is amplified by low noise and bandpass filtered to match the frequency range of human speech before being transmitted to the acquisition unit. Before data collection, environmental noise for a preset duration is detected, and the link gain level is adaptively adjusted based on the average amplitude of the environmental noise. The ultrasonic echo channel uses an acoustic-to-electric conversion device that matches the frequency of the ultrasonic transmitter. It suppresses the interference of the transmitted direct wave through a scheme of physical isolation between the transmitter and receiver plus a sound-absorbing structure or a time-domain isolation scheme of time-division multiplexing. The output signal is transmitted to the acquisition unit after low-noise amplification and bandpass filtering. The link gain is adaptively adjusted according to the echo amplitude. Before data acquisition, the echo signal for a preset duration is detected. If the average amplitude of the echo exceeds the preset reasonable range, the acquisition parameters are automatically adjusted. The system detects the short-time energy of the sound wave channel in real time. When the short-time energy of multiple consecutive detection windows exceeds a preset multiple of the average energy of the ambient noise, it determines that the target has started to emit sound and triggers synchronous acquisition of the two channels. The two collected data streams are first stored in a circular buffer. Upon triggering, the valid data segment containing the complete speech start segment is extracted for subsequent processing.
[0010] Preferably, the coherent demodulation of the ultrasonic echo signal acquired in step two to extract the baseband signal specifically includes: Generate two orthogonal reference signals, namely the I-channel reference signal. With Q-channel reference signal ; The sampling rate of the reference signal is consistent with the sampling rate of the ultrasonic echo channel to ensure alignment with the sampling points of the echo signal; The collected echo signal Multiplying each signal by the two reference signals yields the mixed signal: ; ; The mixed signal contains two components: one with a frequency of The high-frequency component corresponds to the sum of the reference signal and the echo signal; the other is a frequency of... The low-frequency component corresponds to the difference frequency between the reference signal and the echo signal, where For Doppler frequency shift; And perform I / Q imbalance correction, the formula for calculating the correction matrix is: ; in The phase difference deviation between the two reference signals. The amplitude ratio of the two reference signals, after correction. and They are perfectly orthogonal and have equal amplitudes.
[0011] Preferably, the process of performing phase unwrapping processing on the baseband signal to obtain the motion envelope is as follows: For the I and Q values at each sampling point, calculate the package phase. The obtained phase value range is ; The phase expansion method using adjacent-point differential correction has the following specific steps: Starting from the second sampling point, calculate the difference between the wrap phase of the current sampling point and the unfold phase of the previous sampling point. ,like Then subtract the current sampling point's envelope phase. ;like Then add the wrap phase of the current sampling point. until the absolute value of the difference is less than The expanded phase of the current sampling point is obtained. ; Phase after unfolding radial displacement of the reflecting surface The relationship is ,in The wavelength of ultrasound is 40 kHz, therefore the displacement... ; For displacement By differentiation, the radial velocity is obtained. .
[0012] Preferably, the process of extracting the speech energy envelope from the air-conducted speech signal specifically includes: Temporal energy correlation features are extracted from the collected air-conducted speech signals to obtain a speech envelope sequence that reflects the variation pattern of speech energy. Temporal motion correlation features were extracted from the baseband signal obtained after coherent demodulation of ultrasound to obtain a motion envelope sequence that reflects the motion law of the vocal organ; The validity of the two types of envelope sequences is pre-validated, and invalid sequences without valid characteristic fluctuations are removed.
[0013] Preferably, after completing the validity pre-verification of the two types of envelope sequences, the speech envelope sequence and the motion envelope sequence are aligned with the time domain reference by combining the signal propagation delay parameter in the air and the inherent transmission delay parameter of the system, so that the time domain relationship of the two types of sequences matches the physical causal logic of the sound production process. After alignment, the two types of envelope sequences are normalized to eliminate non-correlated differences in the amplitude dimension, retaining only the temporal fluctuation correlation features for subsequent matching.
[0014] Preferably, step five specifically includes: Perform cross-correlation operations on the speech envelope sequence and motion envelope sequence that have undergone time-domain alignment and normalization to obtain a cross-correlation number sequence covering the preset time delay search range; First, based on the peak prominence of the cross-correlation coefficient sequence and the reasonableness of the time delay range corresponding to the peak, it is determined whether the signal to be detected comes from live vocalization, thus eliminating fake audio attacks that do not have corresponding vocal organ movement; If the voice is determined to be from a living person, the cross-correlation sequence is used as a unique physiological feature of the user and matched with a pre-stored legitimate user feature template to complete the verification and authentication of the user's identity.
[0015] In summary, due to the adoption of the above technical solution, the beneficial effects of the present invention are: 1. This invention establishes a natural and deterministic correlation between speech signals and the movement of vocal organs from the perspective of physical acoustics and the physiological mechanism of human vocalization. The generation of speech is necessarily accompanied by the movement of corresponding living organs. Various forgery attacks such as recording playback, speech synthesis, and static masks cannot replicate the living physiological movement characteristics that are strictly synchronized with speech, thus addressing the core security vulnerability of traditional voiceprint recognition that is easily forged and cracked.
[0016] 2. This invention employs a dynamic quality perception mechanism to quantitatively evaluate the quality of both voice and ultrasound signals in real time. It dynamically adjusts the decision weights of the dual-model approach (voiceprint recognition and motion feature matching) based on signal quality, fully leveraging the effective information from both signals. This allows for adaptive adaptation to various complex usage scenarios, including environmental noise, user position shifts, and changes in speaking volume, avoiding the decline in recognition performance in complex environments common with traditional solutions. The deep integration of liveness detection and identity recognition enables dual determination with a single signal acquisition, eliminating the need for additional authentication processes and significantly optimizing the user experience. Furthermore, it features adaptable hardware and algorithm solutions designed for different application scenarios, making it widely applicable to various scenarios such as fixed security, portable terminals, and high-security authentication, demonstrating strong portability and practicality. Attached Figure Description
[0017] Further details, features, and advantages of this application are disclosed in the following description of exemplary embodiments in conjunction with the accompanying drawings, in which: Figure 1 This is a flowchart of the method of the present invention. Detailed Implementation
[0018] Several embodiments of this application will now be described in more detail with reference to the accompanying drawings to enable those skilled in the art to implement this application. This application may be embodied in many different forms and for various purposes and should not be limited to the embodiments set forth herein. These embodiments are provided to make this application thorough and complete, and to fully convey the scope of this application to those skilled in the art. The embodiments described do not limit this application.
[0019] Unless otherwise defined, all terms used herein (including technical and scientific terms) shall have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains. It will be further understood that terms such as those defined in commonly used dictionaries shall be interpreted as having a meaning consistent with their meaning in the relevant field and / or the context of this specification, and shall not be interpreted in an idealized or overly formal sense unless expressly defined herein.
[0020] Example 1 Its specific implementation method is combined with the appendix Figure 1 Please provide a detailed explanation.
[0021] Appendix Figure 1 The flowchart of a dynamic quality-perceived voiceprint speaker recognition method with dual-model fusion provided in this embodiment of the invention shows the complete steps from generating a continuous single-frequency ultrasonic detection signal to performing cross-modal cross-correlation calculation of the speech energy envelope and motion envelope.
[0022] In this embodiment, it includes: Step 1: Generate a continuous single-frequency ultrasonic detection signal. The ultrasonic detection signal is emitted towards the target speaker to be authenticated through an ultrasonic transducer. At the same time, the original ultrasonic detection signal is retained as a reference signal. The ultrasonic detection signal is used to sense the minute movements of the facial muscles and vocal organs related to the target speaker's voice, providing a spatial reference for subsequent echo detection and signal processing. The goal of this step is to generate a stable, pure, and directional ultrasound reference signal to provide a spatial reference for subsequent capture of the minute movements of the vocal organs. All designs in this step revolve around three core principles: reducing signal distortion, ensuring detection coverage, and meeting human safety and auditory requirements. Ultrasonic transducer selection and adaptation: The system can select two types of ultrasonic transducers according to the application scenario. The parameter requirements and suitable scenarios for the two types of transducers are as follows: Piezoelectric ceramic ultrasonic transducers are suitable for applications such as smart door locks, smart speakers, and access control gates where size is not a major concern and a long detection distance is required. The core parameters for these transducers are: center frequency tolerance ≤ ±1%, meaning that for a 40kHz carrier frequency, the actual center frequency deviation should not exceed 400Hz to avoid frequency mismatch during demodulation; -3dB bandwidth ≥ ±2kHz to cover the maximum Doppler frequency shift range caused by vocal organ movement; transmitting sound pressure level ≥ 100dB@10cm to ensure sufficient signal-to-noise ratio for the echo signal at a typical interaction distance of 0.1~0.5m; and receiving sensitivity ≥ -60dBV / Pa to capture weak echo signals at the millivolt level. Because piezoelectric ceramic transducers exhibit capacitive impedance characteristics, a dedicated LC impedance matching network must be designed to match the transducer's impedance to the output impedance of the transmitting power amplifier (typically 50Ω). Setting the Q value of the matching network to 4 ensures transmission efficiency while suppressing harmonic components and preventing audible low-frequency noise.
[0023] MEMS ultrasonic transducer arrays are suitable for portable devices such as mobile phones, smartwatches, and Bluetooth headsets. The core parameters for these transducers are: single-channel sampling rate ≥ 256kHz, meeting the Nyquist sampling requirements for 40kHz ultrasonic signals with sufficient margin; -3dB bandwidth ≥ 80kHz, simultaneously covering the transmitted carrier and demodulated high-frequency components; overall sensitivity ≥ -38dBFS; and an integrated low-noise amplifier that directly amplifies the echo signal to the range that the ADC can acquire. The array typically consists of 4-8 elements with an element spacing of half a wavelength (4.25mm, corresponding to a 40kHz wavelength), enabling digital beamforming and controlling the main lobe width of the transmitted beam within ±15 degrees. This ensures coverage of the user's oral cavity, jaw, and lips while avoiding the reception of reflected echoes from surrounding irrelevant objects.
[0024] Criteria for carrier frequency selection: This scheme selects 40kHz as the carrier frequency, which is the optimal result after multi-dimensional trade-offs, and the specific basis is as follows: Hearing safety: The audible frequency range of the human ear is 20Hz~20kHz. 40kHz is completely beyond the upper limit of human hearing, and even prolonged transmission will not cause any auditory discomfort or hearing damage. If a frequency of 20~30kHz is selected, some children or young people with sensitive hearing may perceive low-frequency harmonic components, which will cause a harsh and uncomfortable feeling. If a frequency above 100kHz is selected, although the hearing safety is higher, the air attenuation is too great, making it unsuitable for use at normal interaction distances.
[0025] Propagation attenuation characteristics: The attenuation coefficient of ultrasound in air is approximately proportional to the square of the frequency. The attenuation coefficient of 40kHz ultrasound in standard temperature and humidity conditions is approximately 2dB / m, meaning that at a maximum interaction distance of 0.5m, the single-pass attenuation is only 1dB. Adding the reflection loss from facial skin and clothing (approximately 10-15dB), the echo signal amplitude can still remain within the dynamic range of the ADC. If a 100kHz carrier is selected, the attenuation coefficient will rise to 12dB / m, resulting in a single-pass attenuation of 6dB at 0.5m. After adding reflection losses, the echo amplitude will decrease to less than 1% of its original value, requiring an increase of at least 40dB in receiver gain. This will significantly amplify circuit noise and reduce system stability.
[0026] Spatial resolution: The detection resolution of ultrasound is directly related to its wavelength; the shorter the wavelength, the higher the resolution. The wavelength of a 40kHz ultrasound wave in air is... = With a wavelength of 8.5mm and a half-wavelength resolution of 4.25mm, it can accurately capture millimeter-level movements of the lips and jaw during human vocalization. If a 20kHz carrier wave is selected, with a wavelength of 17mm and a half-wavelength resolution of 8.5mm, the minute movements of the soft palate and lips cannot be detected, leading to insufficient accuracy in subsequent motion features.
[0027] If the application scenario requires higher resolution (such as medical-grade sound function detection), an 80kHz carrier can be selected. At this time, the wavelength is 4.25mm, and the half-wavelength resolution is 2.125mm. However, the maximum interaction distance needs to be reduced to within 0.3m, and the transmission power needs to be appropriately increased to compensate for the attenuation.
[0028] Design of the generation and transmission link for continuous single-frequency ultrasonic detection signals: The signal transmitted by this scheme is a continuous single-frequency sine wave. ,in =40kHz is the carrier frequency. The amplitude of the transmitted signal, The time variable, in seconds (s), represents the signal's time coordinate; it needs to be adjusted to conform to local sound wave radiation standards (typically, peak amplitude corresponds to a sound pressure level ≤110dB@10cm, complying with FCC Part 18 and the domestic "Guidelines for Environmentally Friendly Design of Electronic and Electrical Products"). The specific implementation details of the signal generation and transmission link are as follows: Signal Generation: A 12-bit or higher resolution DAC is used to generate a sine wave signal. The DAC sampling rate is set to 256kHz, which is 6.4 times the carrier frequency. This ensures that the total harmonic distortion (THD) of the generated sine wave is ≤-60dB, avoiding the generation of audible low-frequency harmonic components. A high-precision 2.5V reference voltage is used for the DAC, with a temperature drift of ≤10ppm / ℃, to prevent temperature changes from causing amplitude drift in the transmitted signal.
[0029] Power Amplification: A Class AB power amplifier with a bandwidth ≥100kHz and adjustable gain (0~30dB) is selected, and the transmission power can be adjusted according to the interaction distance. A 100nF DC blocking capacitor is connected in series at the amplifier output to prevent DC components from flowing into the transducer and causing aging of the piezoelectric ceramic.
[0030] Harmonic suppression: By connecting a 40kHz surface acoustic wave (SAW) filter in series between the power amplifier and the transducer, the amplitude of the second harmonic (80kHz) can be suppressed to below -50dB, further reducing signal distortion.
[0031] Alignment and coverage calibration of the transmitted beam: To ensure that the detection beam completely covers the oral cavity, jaw, and lip areas related to sound production, beam alignment calibration must be performed both at the factory and during first use. The specific procedure is as follows: Factory calibration: Fix the device on the test bench and place a standard reflector (acrylic plate with a reflectivity ≥90%) 30cm in front of it. Adjust the physical angle of the transducer or the beamforming weights of the MEMS array to maximize the amplitude of the echo signal, at which point the main lobe center of the beam is aligned 30cm directly in front. Write the beam weights or physical angle parameters at this point into the device's non-volatile memory as default parameters.
[0032] First-time calibration: When a user uses the device for the first time, the system will prompt the user to look directly at the device, about 30cm away, and make an "ah" sound for 3 seconds. The system will automatically adjust the beam weights based on the echo amplitude distribution of the MEMS array to maximize the echo amplitude covering the user's face area. After calibration, the weights will be stored in the user's personal configuration and will be automatically recalled during subsequent use.
[0033] Real-time dynamic calibration: During use, the system will check the average amplitude of the echo every 100ms. If the amplitude drops by more than 3dB, it indicates that the user's position has shifted. The system will automatically adjust the beam direction and realign it with the user's face to ensure the stability of the detection.
[0034] Step 2: Two acquisition channels driven by the same source clock are used for strict synchronous acquisition. The first channel is the acoustic acquisition channel, which is used to acquire the air-conducted speech signal emitted by the target speaker. The second channel is the ultrasonic echo acquisition channel, which is used to acquire the ultrasonic echo signal reflected by the target speaker's face and carrying the movement information of the vocal organs. The clocks of the two acquisition channels are strictly aligned to provide a unified time domain basis for subsequent cross-modal signal correlation calculation. The core objective is to achieve strictly synchronized acquisition of the acoustic wave channel and the ultrasonic echo channel, ensuring that the time deviation between the two channels is less than 1 microsecond, thus providing a reliable foundation for subsequent cross-correlation calculations. All designs in this step revolve around three core principles: clock synchronization accuracy, signal acquisition quality, and robustness to abnormal scenarios. Clock synchronization implementation schemes and precision control: Clock synchronization between the two channels is the core foundation of the entire solution. If the clock deviation exceeds 1ms, the cross-correlation peak will shift outside the search range, leading to misjudgments. This solution employs a synchronization scheme using a shared clock source and hardware triggering, which can control the time deviation between the two channels to within 1 microsecond. The specific implementation is as follows: Same-source clock design: Both ADC and DAC channels use the same 10ppm precision temperature-compensated crystal oscillator (TCXO) as the clock source. The output frequency of the crystal oscillator is 25.6MHz. After frequency division, it provides a 256kHz sampling clock for the DAC, a 48kHz sampling clock for the acoustic channel ADC, and a 256kHz sampling clock for the ultrasonic channel ADC. The phase deviation of the three clocks is less than 10ns, which avoids time deviation caused by clock drift from the source.
[0035] Hardware-triggered synchronization: When the system starts data acquisition, the MCU sends a hardware trigger signal, which is simultaneously connected to the trigger pins of both ADC channels, ensuring that sampling of both channels starts simultaneously with a start-up time deviation of less than 100ns. If a discrete device implementation is adopted, the PTP precision time protocol (IEEE1588) can be used to achieve synchronization, with a synchronization accuracy of less than 1 microsecond, meeting the system requirements.
[0036] Synchronization accuracy verification: The equipment must undergo a synchronization accuracy test at the factory. A synchronized sound and vibration signal source (i.e., the speaker and the vibration table share the same trigger signal, emitting a 2kHz sound and a 40kHz vibration at the same time) is placed 30cm in front of the equipment. The signals from the two channels are collected, and the peak time delay of the two signals is calculated. If the time delay deviation exceeds 1 microsecond, it is judged as unqualified and the clock circuit needs to be readjusted.
[0037] Acquisition link design for acoustic wave channel: The acoustic channel is used to acquire airborne voice signals in the 20Hz-20kHz range. The link design must ensure the fidelity of the voice signal while suppressing environmental noise and power frequency interference. Specific details are as follows: Microphone Selection: Choose a low-noise MEMS microphone or electret microphone with a frequency response range of 20Hz-20kHz, frequency response fluctuation ≤±1dB, sensitivity of -38dBFS, and equivalent input noise ≤28dBA, capable of capturing clear voice signals in noisy environments. The microphone opening should be located away from internal noise sources such as fans and speakers. A dustproof, waterproof, and sound-permeable membrane should be applied to the opening to prevent dust and water droplets from entering the equipment.
[0038] Preamplification and Filtering: The microphone output signal first passes through a low-noise preamplifier with a gain of 20dB and a noise figure ≤2dB. Then it passes through a 4th-order Butterworth bandpass filter with a passband range of 20Hz-20kHz, which can suppress power frequency vibration noise below 20Hz and ultrasonic interference above 20kHz. A 500Ω current-limiting resistor is connected in series at the filter output to prevent electrostatic discharge (ESD) pulses from damaging the ADC.
[0039] Sampling parameter settings: The ADC resolution of the acoustic channel is 16 bits, and the sampling rate is 48kHz, which fully covers the range of human hearing and conforms to the sampling standards of general audio signals. The input dynamic range of the ADC is ±1V, which can accommodate speech signals with a peak amplitude of 1V and avoid clipping distortion.
[0040] Gain adaptive adjustment: The system will detect ambient noise for 100ms before acquisition. If the average amplitude of the ambient noise is lower than -40dBFS, the gain of the preamplifier will be adjusted to 20dB; if the ambient noise is between -40dBFS and -20dBFS, the gain will be adjusted to 10dB; if the ambient noise is higher than -20dBFS, the gain will be adjusted to 0dB, and the user will be prompted to reduce the ambient noise to ensure that the signal-to-noise ratio of the voice signal is ≥20dB.
[0041] Ultrasonic echo channel acquisition link design: The ultrasound echo channel is used to acquire 40kHz ultrasound signals reflected from the face. The link design must ensure the amplification of weak echo signals while suppressing direct wave interference and high-frequency environmental noise at the transmitting end. Specific details are as follows: Receiver transducer selection: Use piezoelectric ceramic or MEMS transducers of the same specifications as the transmitter transducer to ensure center frequency matching. If a separate transmit / receive design is adopted, the distance between the receiver and transmitter transducers must be ≥2cm, and sound-absorbing material should be placed in between to suppress the amplitude of the transmitted direct wave to below -40dB, avoiding ADC saturation caused by the direct wave. If a MEMS array-based transmit / receive integrated design is adopted, a time-domain isolation scheme must be used, i.e., after transmitting a 1ms ultrasonic signal, the transmitting circuit is turned off and the receiving circuit is turned on to avoid direct wave interference.
[0042] Low-noise amplification and filtering: The output signal from the receiving transducer first passes through a low-noise amplifier with a gain of 60dB and a noise figure ≤2dB. Then it passes through a 4th-order Butterworth bandpass filter with a passband range of 38kHz~42kHz, which suppresses ambient noise and emission harmonics outside the passband. The filter output then passes through a programmable amplifier with adjustable gain, ranging from 0-20dB. This amplifier automatically adjusts its gain based on the echo amplitude, ensuring the peak amplitude of the echo signal remains within the dynamic range of the ADC.
[0043] Sampling parameter settings: The ultrasound channel's ADC resolution is 12 bits, and the sampling rate is 256kHz, which is 6.4 times the carrier frequency, enabling complete capture of the echo's phase and amplitude information. The ADC's input dynamic range is ±1V, accommodating amplified echo signals and avoiding clipping distortion.
[0044] Echo quality verification: The system will detect the echo signal for 100ms before acquisition. If the average amplitude of the echo is lower than -60dBFS, it means that the user is too far away from the device or the beam is not aligned. The system will prompt the user to move closer to the device or adjust the position. If the average amplitude of the echo is higher than 0dBFS, it means that the echo is saturated. The system will automatically reduce the receiving gain until the echo amplitude is within a reasonable range.
[0045] Data collection triggering and data caching mechanism: To avoid collecting invalid static data, the system uses a voice-triggered data collection mechanism, the specific process of which is as follows: Trigger threshold setting: The system detects the short-time energy of the sound wave channel in real time. The calculation window for short-time energy is 10ms. If the short-time energy of three consecutive windows exceeds 10 times the average energy of the ambient noise, it is determined that the user has started speaking, and synchronous acquisition is triggered.
[0046] Acquisition duration setting: Each acquisition session lasts 2.5 seconds, ensuring at least 3 complete syllables are included to provide sufficient feature data for subsequent cross-correlation calculations. If the total energy of the acquired speech segment is below the threshold, it indicates that the user's voice is too soft, and the system will prompt the user to increase the volume and re-acquire the data.
[0047] Data caching: The acquired data from the two channels is first stored in a circular buffer with a size of 10 seconds, storing the most recent 10 seconds of acquired data to avoid loss of speech segments due to trigger delay. After acquisition, the system retrieves the data from the buffer corresponding to 0.5 seconds before the trigger to 2 seconds after the trigger as the raw data for subsequent processing, ensuring that the complete speech start segment is included.
[0048] Step 3: Call the reference signal retained in Step 1, perform coherent demodulation on the ultrasonic echo signal acquired in Step 2, filter out high-frequency components and extract the baseband signal, perform phase expansion processing on the baseband signal to eliminate phase wrapping error, and finally obtain the motion envelope that directly reflects the radial motion law of the target speaker's vocal organs, providing motion dimension feature input for subsequent cross-modal feature association; The core step in converting high-frequency ultrasound echoes into the motion trajectory of the vocal organs is designed around preserving phase information, eliminating demodulation errors, and extracting high-precision motion parameters. Implementation and error correction of orthogonal heterodyne demodulation: This scheme employs orthogonal heterodyne demodulation instead of traditional envelope detection. This is because orthogonal demodulation can completely preserve the phase information of the echo, enabling the detection of not only the velocity of the reflecting surface but also its direction (approaching or moving away from the transducer). Furthermore, its demodulation accuracy is significantly higher than that of envelope detection. Specific implementation details are as follows: Quadrature reference signal generation: Locally generates two orthogonal reference signals, namely the I-channel reference signal. With Q-channel reference signal The amplitudes of the two signals are equal, and the phase difference is strictly 90 degrees.
[0049] The sampling rate of the reference signal is consistent with that of the ultrasonic echo channel, which is 256kHz, to ensure complete alignment with the sampling points of the echo signal.
[0050] Frequency mixing operation: This process converts the acquired echo signal into a frequency-mixing signal. Multiplying each signal by the two reference signals yields the mixed signal: ; ; According to the product-sum formula, the mixed signal contains two components: one with a frequency of... The high-frequency component corresponds to the sum of the reference signal and the echo signal; the other is a frequency of... The low-frequency component corresponds to the difference frequency between the reference signal and the echo signal, where The Doppler frequency shift is proportional to the velocity of the vocal organs and satisfies the formula. ,in It is the speed of movement of the vocal organs. It's the speed of sound.
[0051] Product-to-sum formula: Formula for I-channel mixing (echo × cosine reference signal) The I-path of quadrature demodulation multiplies the echo with a cosine reference signal of the same frequency and phase, using the product-to-difference identity of cosines: ; Q-channel mixing (echo × sine reference signal) corresponding formula The Q-path of quadrature demodulation multiplies the echo with a sinusoidal reference signal of the same frequency and quadrature, using the product-to-difference identity of cosine multiplication by sine: ; The echo signal reflected by the moving vocal organs is a cosine signal with a Doppler frequency shift, in the form of: ,in Echo amplitude, The center frequency of the transmission carrier. The Doppler shift caused by the movement of the vocal organs. This is the initial phase of the echo.
[0052] The local I-channel reference signal is a cosine signal that is in phase and frequency with the transmitted carrier. Substitute it and the echo into the first product-to-sum formula above, and let... The mixed signal can be obtained: ; The two items obtained are the two components after mixing: First item These are high-frequency components of the sum-frequency band, with frequencies around twice the carrier frequency. They are redundant components that need to be filtered out by subsequent low-pass filtering. Second item This is the difference frequency baseband component, and its frequency is equal to the Doppler frequency shift. The signal carries the velocity and phase information of the vocal organs and is a valid signal that needs to be retained. Consistent with the derivation logic of the Q-path reference signal mixing, the final sum-frequency component is also a high-frequency signal near twice the carrier frequency, while the difference-frequency component is an orthogonal baseband Doppler signal used for subsequent phase expansion and motion direction identification.
[0053] I / Q Imbalance Correction: Due to device manufacturing errors, the amplitudes of the two reference signals may not be equal, and the phase difference may not be exactly 90 degrees, which will lead to I / Q imbalance and introduce image frequency interference. Therefore, I / Q imbalance correction needs to be performed before demodulation.
[0054] The calibration method is as follows: When the equipment leaves the factory, the transducer is aligned with a fixed reflecting surface, at which point the frequency of the echo signal is... The low-frequency component after mixing is a DC signal. The parameters of the correction matrix are adjusted so that the DC components of the I and Q paths are equal, and the orthogonality error is ≤1 degree. The formula for calculating the correction matrix is: ; in The phase difference deviation between the two reference signals. The amplitude ratio of the two reference signals, after correction. and They are perfectly orthogonal and have equal amplitudes.
[0055] Low-pass filter design and baseband signal extraction: The mixed signal contains high frequencies. The low-frequency Doppler components need to be filtered out using a low-pass filter to retain the baseband signal. The design details of the low-pass filter are as follows: Filter type selection: A linear-phase FIR filter is used instead of an IIR filter because the FIR filter has strict linear phase characteristics and will not distort the phase of the baseband signal, thus ensuring the accuracy of subsequent phase expansion and motion parameter extraction.
[0056] Filter parameter design: The cutoff frequency of the low-pass filter is calculated based on the maximum movement speed of the human vocal organs. According to extensive statistical data in speech physiology, the maximum movement speed of the lips during human vocalization is 1.2 m / s, the maximum movement speed of the mandible is 0.7 m / s, the maximum movement speed of the soft palate is 0.3 m / s, and the maximum radial movement speed of all vocal organs does not exceed 1.5 m / s. Substituting these values into the Doppler frequency shift formula yields the maximum Doppler frequency shift. =2×1.5×40000 / 340≈353Hz. To allow sufficient margin, the cutoff frequency of the low-pass filter is set to 500Hz, which can cover all possible Doppler shift components.
[0057] Setting the filter's stopband to 70kHz or higher can... High-frequency components in the vicinity are suppressed to below -60dB. A Hamming window filter with an order of 32 is used to ensure a stopband attenuation of ≥60dB and a passband ripple of ≤0.1dB, meeting the system requirements.
[0058] Zero-phase filtering implementation: To avoid the group delay introduced by the FIR filter (which is equal to half the filter order, i.e., 16 sampling points, corresponding to 62.5 microseconds), a zero-phase filtering scheme is adopted: First, the signal is passed forward through the FIR filter, then the filtered signal is reversed and passed through the same FIR filter again, and finally the signal is reversed again. The resulting output signal is completely in phase with the input signal, without any phase delay, thus ensuring the accuracy of time-domain alignment.
[0059] Downsampling: The highest frequency of the filtered baseband signal is 500Hz. According to the Nyquist sampling theorem, a sampling rate of ≥1kHz is sufficient to fully preserve signal information. Therefore, the sampling rate of the baseband signal is downsampled from 256kHz to 2kHz, which preserves all effective information and reduces the computational load of subsequent processing. Cubic spline interpolation is used for downsampling to avoid spectral aliasing.
[0060] Phase unfolding and motion parameter extraction: The I and Q baseband signals obtained after low-pass filtering contain the phase information of the echo. The displacement and velocity parameters of the vocal organs can be extracted by phase unrolling. The specific implementation details are as follows: Wrap-up phase calculation: For the I and Q values of each sampling point, the wrap-up phase is calculated using the four-quadrant arctangent function. The obtained phase value range is .
[0061] Because the arctangent function in the four quadrants will exceed Phase wrapping Therefore, phase expansion is required to obtain continuous true phase.
[0062] Phase unwrapping method: The phase unwrapping method with adjacent point differential correction is adopted. The specific process is as follows: Starting from the second sampling point, the difference between the wrapped phase of the current sampling point and the unwrapped phase of the previous sampling point is calculated. ,like Then subtract the current sampling point's envelope phase. ;like Then add the wrap phase of the current sampling point. until the absolute value of the difference is less than The expanded phase of the current sampling point is obtained. .
[0063] To avoid erroneous phase expansion caused by noise, a moving average filter is applied to the I and Q signals before phase expansion, with a window size of 5 sampling points (corresponding to 2.5ms). This effectively suppresses the influence of high-frequency noise. If a data segment contains more than 5 phase transitions (i.e., the absolute value of the phase difference between adjacent points is greater than...), the expansion will fail. If the noise level is too high, the data is deemed invalid and needs to be collected again.
[0064] Motion parameter calculation: Phase after unfolding radial displacement of the reflecting surface The relationship is ,in =8.5mm is the wavelength of a 40kHz ultrasound, therefore the displacement .
[0065] For displacement By differentiation, the radial velocity can be obtained. To avoid differential amplification noise, a third-order polynomial fitting method is used to obtain the derivative: the phase data of five sampling points before and after each sampling point are fitted with a third-order polynomial, and then the derivative of the fitted polynomial is obtained to obtain the instantaneous velocity at that point, which can effectively suppress the influence of noise.
[0066] Error compensation: Since the speed of sound changes with ambient temperature, the formula for calculating the speed of sound is... =331.4+0.6× ,in The ambient temperature (degrees Celsius) is used as the reference temperature, so the system has a built-in temperature sensor with an accuracy of ±0.5℃ to collect the ambient temperature in real time, correct the sound velocity value, and ensure the accuracy of displacement and velocity calculations. In addition, the system collects a 1-second static echo every time it is powered on, calculates the static phase shift, and then subtracts this shift in subsequent processing to eliminate phase errors caused by circuit temperature drift.
[0067] Step 4: Extract the speech energy envelope from the air-conducted speech signal acquired in Step 2. Simultaneously, normalize the motion envelope obtained in Step 3 to eliminate interference from distance and individual amplitude differences. Based on the inherent physical causal relationship between the movement of the vocal organs and speech production, align the speech energy envelope and the motion envelope in the time domain so that the temporal correspondence between the two features conforms to the physiological laws of human vocalization, providing aligned dual-modal feature input for subsequent matching and authentication. Physically related envelope features are extracted from speech and motion signals respectively, and temporal alignment of the two is achieved to provide a consistent feature sequence for subsequent cross-correlation matching.
[0068] Methods and parameter optimization for extracting speech energy envelope: The speech energy envelope reflects the change of speech signal energy over time and has a natural temporal correlation with the movement of the vocal organs. This solution provides two envelope extraction methods, which can be selected according to the scenario: Short-time energy method: Suitable for scenarios with high environmental noise, this method is more robust. The specific implementation process is as follows: Framing and Windowing: The acquired speech signal is segmented into frames with a frame length of 20ms (corresponding to 960 sampling points at a 48kHz sampling rate) and a frame shift of 10ms (corresponding to 480 sampling points). Adjacent frames have 50% overlap to avoid abrupt changes in frame boundaries. Hamming windows are used to window each frame during framing, reducing spectral leakage and improving the stability of short-time energy calculations.
[0069] Short-time energy calculation: Calculate the sum of squares of the sampling points in each frame, then divide by the frame length to obtain the average short-time energy of that frame, forming a frame-level short-time energy sequence.
[0070] Upsampling alignment: Since the sampling rate of the short-time energy sequence is 100Hz (one point every 10ms), while the sampling rate of the motion envelope is 2kHz, cubic spline interpolation is used to upsample the short-time energy sequence to 2kHz to keep it consistent with the sampling rate of the motion envelope.
[0071] Smoothing: A moving average filter is applied to the upsampled energy sequence with a window size of 10ms (corresponding to 20 sampling points) to eliminate the influence of high-frequency noise and obtain the final speech energy envelope. .
[0072] Hilbert transform: Suitable for scenarios with low environmental noise, this method offers higher time resolution and can more accurately capture speech bursts. The specific implementation process is as follows: Preprocessing: First, the speech signal is subjected to 50Hz / 60Hz power frequency notch filtering to eliminate power frequency interference. Then, noise reduction is performed, using spectral subtraction to suppress environmental noise and improve the signal-to-noise ratio of the speech signal.
[0073] Hilbert Transform: Performing the Hilbert transform on the preprocessed speech signal yields the analytic signal. ,in The original speech signal. This is the Hilbert transform. It analyzes the amplitude of the signal. This refers to the instantaneous amplitude envelope of the speech.
[0074] Boundary effect processing: To avoid boundary distortion of the Hilbert transform, zeros of 10% length are padded at the beginning and end of the speech signal before processing. After processing, the padded zeros are removed to obtain the complete amplitude envelope.
[0075] Normalization and smoothing: The amplitude envelope is smoothed with a window size of 5ms (corresponding to 10 sampling points), and then upsampled to 2kHz to obtain the final speech energy envelope. .
[0076] Normalization and validity verification of motion envelope: The motion envelope is the radial motion velocity sequence of the vocal organs extracted in step three. Normalization is required to eliminate the influence of individual differences and environmental factors. The specific process is as follows: DC offset removal: Due to the influence of DC drift and static reflection in the circuit, there will be a fixed DC component in the motion velocity sequence. First, calculate the average value of the velocity sequence in the entire acquisition segment, and then subtract the average value from the velocity of each sampling point to obtain the velocity sequence with zero mean.
[0077] Amplitude normalization: Select different normalization methods according to the application scenario: If used for liveness detection, Z-score normalization is adopted: the mean of the velocity sequence is subtracted and the standard deviation is divided. The resulting sequence has a mean of 0 and a standard deviation of 1, which can eliminate the differences in motion amplitude among different users and retain only the fluctuation characteristics in the time domain.
[0078] If used for identity recognition, percentile normalization is adopted: the 99th percentile value of the velocity sequence is taken as the maximum value, and all values are mapped to the range of [-1,1]. This not only eliminates the amplitude difference caused by distance, but also preserves the user's motion amplitude characteristics, providing more dimensions of information for identity recognition.
[0079] Validity verification: Calculate the variance of the normalized motion envelope. If the variance is less than 0.1, it indicates that the user's face does not have obvious movement, which may be due to static attack or the user not speaking. The feature is deemed invalid and needs to be re-collected. If the maximum absolute value of the motion envelope exceeds 1.5 times the 99th percentile, it indicates the presence of outliers caused by noise. Median filtering is used to replace the outliers with the average value of the adjacent points.
[0080] Physical basis and delay compensation for time domain alignment: Temporal alignment of speech and motion signals is a prerequisite for accurate cross-correlation calculations, and their synchronization has a clear physiological basis in speech: When humans speak, there is a strict causal relationship between the movement of the vocal organs and the production of speech. For example, when producing the plosives / p / and / b / , the lips first close to obstruct airflow, then quickly open, and the burst of airflow forms the energy peak of the speech. The time difference between the maximum speed of lip opening and the peak point of speech energy is no more than 1 ms. When producing the vowel / a / , the mouth opening is at its maximum when the jaw descends to its lowest point, and the speech energy reaches its peak. The time difference between the two is no more than 2 ms. Numerous experimental statistics show that in 99% of living samples, the time difference between the peak point of the speech envelope and the peak point of the motor envelope is within the range of [-2 ms, +2 ms].
[0081] To achieve accurate time-domain alignment, two types of system latency need to be compensated: Propagation delay compensation: It takes time for a voice signal to travel from the user's mouth to the microphone in the sound wave channel, and it also takes time for an ultrasonic signal to travel from the transmitting transducer to the face and back to the receiving transducer. The propagation delay of both is related to the distance between the user and the device. Proportional, that is .distance The time of flight of an ultrasonic echo can be calculated by transmitting a 100-microsecond ultrasonic pulse and measuring the time difference between pulse transmission and echo reception. ,but Shift the speech envelope forward. The time allotted can compensate for the transmission delay.
[0082] Inherent system delay compensation: The ADC and filter circuits of the two channels will have a fixed inherent delay, which is calibrated at the factory: using a synchronized acoustic and vibration signal source, the signals of the two channels are acquired, and the peak time difference between them is calculated. The value is stored in the device's non-volatile memory, and the speech envelope is shifted forward during processing. The time is used to compensate for inherent delays.
[0083] After the above two types of delay compensation, the time axes of the speech envelope and the motion envelope are completely aligned, which can be used for subsequent cross-correlation calculations. Its core physical logic is: the burst point of the sound (such as the plosive / p / ) must strictly coincide with the maximum velocity point of the oral cavity opening on the time axis.
[0084] Step 5: Perform cross-modal cross-correlation calculation on the time-domain aligned speech energy envelope and motion envelope output from Step 4. Based on the peak significance and time delay reasonableness of the cross-correlation curve, determine the liveness of the target speaker and exclude spoofed speech attacks without corresponding vocalization motion. Further, match the cross-correlation features with the pre-stored user-specific physical feature templates. If the match is successful, the identity verification of the target speaker is completed. By performing cross-correlation calculations, it is determined whether the voice signal and motion signal are synchronized, thereby achieving liveness detection and identity recognition.
[0085] Calculation and parameter setting of the normalized cross-correlation function: The cross-correlation function is used to measure the similarity between two signals at different time delays. To eliminate the influence of the energy difference between the two signals on the cross-correlation results, this scheme uses the normalized cross-correlation coefficient (i.e., the Pearson correlation coefficient), and the specific calculation method is as follows: Discrete cross-correlation calculation: Let the aligned speech envelope sequence be... The motion envelope sequence is Both sequences have a length of N and a sampling rate of Then the normalized cross-correlation coefficient The calculation formula corresponds to the continuous domain formula. Discrete implementation: ; in This is the delay index, and the corresponding actual delay is... , for The mean, for The mean, for standard deviation for The standard deviation. The value range is [-1, 1]. The closer the value is to 1, the stronger the correlation between the two signals under that time delay.
[0086] Delay search range setting: According to physiological statistics, the delay between speech and motion will not exceed ±10ms. Therefore, the search range of k is set to [-20,20] (corresponding to a delay of ±10ms, 1ms corresponds to 2 sampling points at a 2kHz sampling rate). This ensures coverage of all possible live latency ranges while avoiding random peak interference caused by an excessively large search range.
[0087] To improve computational efficiency and suit real-time processing in embedded devices, the FFT fast cross-correlation algorithm is adopted: two sequences are padded with zeros to a length of... Calculate the FFT separately, multiply the conjugate of the FFT result of the speech envelope with the FFT result of the motion envelope, and then perform IFFT to obtain the cross-correlation sequence. The computational complexity is reduced from... Reduce to .
[0088] Liveness Authentication Logic and Threshold Settings
[0089] The core logic of liveness detection is as follows: if the speech originates from a living entity in front of you, the speech envelope and motion envelope will have a very strong correlation, and the cross-correlation curve will show a significant peak within a very small time delay. However, if the attack is a recording playback or a static mask attack, there is no corresponding movement of the vocal organ, and the cross-correlation curve will not show a significant peak. The specific determination process is as follows: Peak extraction: Find the maximum value of the cross-correlation coefficient within the search range. and the corresponding peak latency .
[0090] Peak significance determination: Calculate the average of all cross-correlation coefficients within the search range. with standard deviation The significance of the peak is defined as follows: This refers to the peak value exceeding the average level by a multiple.
[0091] Judgment rule: A being is judged to be alive if all three of the following conditions are met: ,in This is the cross-correlation threshold, with a default value of 0.65; ,in This is the significance threshold; the default value is 3. That is, the peak latency is within the physiologically reasonable range of ±2ms.
[0092] The threshold setting can be adjusted according to the scenario: in a quiet environment, it can be adjusted. Increased to 0.7, Increased to 3.5, reducing the false positive rate; in noisy environments, it can... Reduced to 0.6, The threshold was reduced to 2.5 to improve the pass rate. The threshold was determined based on large-scale test data: 10,000 live samples from users of different ages, genders, and accents were collected, along with 10,000 attack samples (including audio playback, 3D masks, dynamic simulated faces, etc.). ROC curves were plotted, and the threshold with the lowest equal error rate (EER) was selected to theoretically eliminate the possibility of audio playback attacks. Since the recordings do not have corresponding real-time ultrasonic frequency shifts, they cannot generate the required cross-correlation peaks.
[0093] Feature matching and template management for identity recognition: The core logic of identity verification is that different people's vocal habits (glottis opening and sound emission delay, jaw movement amplitude, etc.) constitute unique physical fingerprints that cannot be forged or imitated. These differences are reflected in the shape of the cross-correlation curve, and identity verification can be achieved through feature matching. The specific implementation process is as follows: Feature Template Construction: During user registration, 3-5 valid voice and motion data samples are collected, each lasting 2.5 seconds. For each sample, a cross-correlation sequence within the range of [-10ms, +10ms] is calculated, forming a 41-dimensional feature vector (corresponding to 41 values of k from -20 to +20). The average of these 3-5 feature vectors is then taken to obtain the user's unique feature template, which is stored in an encrypted database. The template is stored using hash encryption, ensuring that even if the database is leaked, the user's biometric information cannot be recovered.
[0094] Feature matching: During authentication, the cross-correlation feature vectors calculated in real time are compared with the pre-stored templates to calculate the Euclidean distance. The formula for calculating the Euclidean distance D is: ; in Let i be the i-th value of the real-time feature vector. Let be the i-th value of the template feature vector.
[0095] Judgment rule: If If the match is successful, the user's identity is considered valid. The matching threshold is set to 0.15 by default.
[0096] The threshold was also determined based on large-scale test data: 100,000 matching data points were collected from 10,000 users, where the average distance for positive matches (the same person) was 0.08, and the average distance for negative matches (different people) was 0.32. When the accuracy is 0.15, the recognition accuracy is 99.89% and the false recognition rate is 0.02%, which meets the security requirements of financial grade.
[0097] Dynamic template updates: Since users' vocal habits may change over time (such as weight changes, orthodontic treatment, aging, etc.), after each successful authentication, the feature vector of this time is added to the template with a weight of 0.1, that is, new template = 0.9 old template + 0.1 new feature vector, so as to realize dynamic updates of the template and ensure the accuracy of long-term use.
[0098] Attack prevention and anomaly handling: This solution prevents most common identity spoofing attacks from the underlying physical principles. The specific prevention logic is as follows: Recording replay attack: When the attacker plays a pre-recorded audio message, there is no corresponding movement of the vocal organs; the motion envelope is random noise, and the cross-correlation coefficient is low. Typically less than 0.3, it cannot meet the criteria for determining liveness, and the interception rate can reach 100%.
[0099] 3D mask attack: 3D masks are static objects, and the variance of their motion envelope is close to 0, so they will be judged as invalid features, with an interception rate of up to 100%.
[0100] Dynamic simulated face attack: Even if the attacker uses a movable robotic face to simulate vocalization, the synchronization between its movement and speech, as well as the shape of its cross-correlation curve, are significantly different from human physiological characteristics, making it impossible to match the pre-stored user template.
[0101] Brute-force attack: If authentication fails 5 times in a row, the system will automatically lock the account for 10 minutes and trigger an anomaly alarm to prevent brute-force attacks.
[0102] For handling abnormal scenarios: If the collected voice or motion features are invalid, the system will give clear prompts, such as "Please move closer to the device", "Please increase the volume", "Please do not cover your face", etc., to guide the user to complete the authentication; if the ambient noise is too loud, the system will prompt the user to go to a quiet environment to perform the authentication, ensuring the success rate of the authentication.
[0103] The foregoing has only described certain exemplary embodiments of the present invention by way of illustration. Undoubtedly, those skilled in the art can modify the described embodiments in various ways without departing from the spirit and scope of the present invention. Therefore, the foregoing drawings and descriptions are illustrative in nature and should not be construed as limiting the scope of protection of the claims of the present invention.
[0104] It should be noted that, in this document, the use of relational terms such as "first" and "second" is merely for distinguishing one entity or operation from another, and does not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes the element.
[0105] It should be understood that in the various embodiments of this application, the order of the above-mentioned processes does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.
[0106] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0107] Those skilled in the art will understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0108] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0109] In addition, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.
[0110] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
[0111] The foregoing has only described certain exemplary embodiments of the present invention by way of illustration. Undoubtedly, those skilled in the art can modify the described embodiments in various ways without departing from the spirit and scope of the present invention. Therefore, the foregoing drawings and descriptions are illustrative in nature and should not be construed as limiting the scope of protection of the claims of the present invention.
Claims
1. A dynamic quality-perceived voiceprint speaker recognition method using dual-model fusion, characterized in that, include: Step 1: Generate a continuous single-frequency ultrasonic detection signal, transmit the ultrasonic detection signal to the target speaker to be authenticated, and at the same time retain the original ultrasonic detection signal as a reference signal; Step 2: The acoustic wave acquisition channel acquires the air-conducted speech signal emitted by the target speaker, and the ultrasonic echo acquisition channel acquires the ultrasonic echo signal, with both acquisition channels acquiring the signal simultaneously. Step 3: Using the reference signal from Step 1, coherently demodulate the ultrasonic echo signal acquired in Step 2, extract the baseband signal, and perform phase unrolling processing on the baseband signal to obtain the motion envelope; Step 4: Extract the speech energy envelope from the air-conducted speech signal acquired in Step 2, and align the speech energy envelope with the motion envelope in the time domain. Step 5: Perform cross-modal cross-correlation calculation on the speech energy envelope and motion envelope to determine liveness; match the cross-correlation features with pre-stored user physical feature templates to verify the identity of the target speaker.
2. The method for dynamic quality-perceived voiceprint speaker recognition based on dual-model fusion according to claim 1, characterized in that, The generation process of continuous single-frequency ultrasonic detection signals includes: The transmitted signal is a continuous single-frequency sine wave. ,in =40kHz is the carrier frequency. The amplitude of the transmitted signal, For time; A DAC is used to generate a sine wave signal, and a high-precision reference source of 2.5V is selected for the reference voltage of the DAC. Select a power amplifier and adjust the transmission power according to the interaction distance; A surface acoustic wave filter is connected in series between the power amplifier and the transducer to suppress the amplitude of the second harmonic.
3. The method for dynamic quality-perceived voiceprint speaker recognition based on dual-model fusion according to claim 2, characterized in that, Before transmitting the ultrasonic detection signal, a multi-stage alignment calibration is performed on the ultrasonic transmission beam to ensure that the beam completely covers the sound-related area of the target to be detected. The specific calibration process is as follows: Fix the device to the test bench, place a reflector at a preset calibration distance in front of the device, adjust the physical installation angle of the ultrasonic transducer component or the beamforming weight of the MEMS ultrasonic array until the amplitude of the received echo signal reaches the maximum value. At this time, the center of the main lobe of the beam is aligned with the area directly in front of the preset calibration distance. Write the current beam weight or physical angle parameter into the non-volatile memory of the device as the default transmission parameter. When a user uses the device for the first time, the system prompts the user to look directly at the device and issue a preset calibration voice. The system automatically adjusts the beamforming weights based on the echo amplitude distribution collected by the MEMS ultrasound array until the echo amplitude covering the user's facial area reaches the maximum value. After calibration, the current beam weights are stored in the corresponding user's personal configuration file. During operation, the device periodically detects the average amplitude of the echo. If the amplitude drops below a preset fluctuation threshold, it determines that the user's position has shifted and automatically adjusts the beam direction to re-align with the user's face.
4. The method for dynamic quality-perceived voiceprint speaker recognition based on dual-model fusion according to claim 1, characterized in that, In step two, the acoustic wave acquisition channel and the ultrasonic echo acquisition channel acquire data synchronously, which is achieved through the following method: The same high-precision clock source is used as the common clock for the sampling and ultrasonic transmission circuits of the two channels. After frequency division, it provides a matching sampling clock for the acoustic wave channel acquisition unit, the ultrasonic echo channel acquisition unit, and the ultrasonic transmission unit. When sampling is initiated, the control unit outputs the same hardware trigger signal, which is simultaneously connected to the trigger terminals of the two acquisition channels to ensure that the sampling start times of the two channels are aligned. Before the equipment leaves the factory, a synchronous acoustic and vibration signal source is used to complete the synchronization accuracy verification. If the peak delay deviation of the two acquired signals exceeds the preset accuracy threshold, the clock circuit is recalibrated.
5. The method for dynamic quality-perceived voiceprint speaker recognition based on dual-model fusion according to claim 4, characterized in that, The specific implementation and acquisition triggering logic of the two acquisition channels in step two are as follows: The acoustic channel uses a low-noise acoustic-to-electric conversion device. The output signal is amplified by low noise and bandpass filtered to match the frequency range of human speech before being transmitted to the acquisition unit. Before data collection, environmental noise for a preset duration is detected, and the link gain level is adaptively adjusted based on the average amplitude of the environmental noise. The ultrasonic echo channel uses an acoustic-to-electric conversion device that matches the frequency of the ultrasonic transmitter. It suppresses the interference of the transmitted direct wave through a scheme of physical isolation between the transmitter and receiver plus a sound-absorbing structure or a time-domain isolation scheme of time-division multiplexing. The output signal is transmitted to the acquisition unit after low-noise amplification and bandpass filtering. The link gain is adaptively adjusted according to the echo amplitude. Before data acquisition, the echo signal for a preset duration is detected. If the average amplitude of the echo exceeds the preset reasonable range, the acquisition parameters are automatically adjusted. The system detects the short-time energy of the acoustic channel in real time. When the short-time energy of multiple consecutive detection windows exceeds a preset multiple of the average energy of the ambient noise, it determines that the target has started to emit sound and triggers synchronous acquisition of the two channels. The two collected data streams are first stored in a circular buffer. Upon triggering, the valid data segment containing the complete speech start segment is extracted for subsequent processing.
6. The method for dynamic quality-perceived voiceprint speaker recognition based on dual-model fusion according to claim 1, characterized in that, The ultrasonic echo signal acquired in step two is coherently demodulated to extract the baseband signal, specifically including: Generate two orthogonal reference signals, namely the I-channel reference signal. With Q-channel reference signal ; The sampling rate of the reference signal is consistent with the sampling rate of the ultrasonic echo channel to ensure alignment with the sampling points of the echo signal; The collected echo signal Multiplying each signal by the two reference signals yields the mixed signal: ; ; The mixed signal contains two components: one with a frequency of The high-frequency component corresponds to the sum of the reference signal and the echo signal; the other is a frequency of... The low-frequency component corresponds to the difference frequency between the reference signal and the echo signal, where For Doppler frequency shift; And perform I / Q imbalance correction, the formula for calculating the correction matrix is: ; in The phase difference deviation between the two reference signals. The amplitude ratio of the two reference signals, after correction. and They are perfectly orthogonal and have equal amplitudes.
7. The method for dynamic quality-perceived voiceprint speaker recognition based on dual-model fusion according to claim 6, characterized in that, The process of performing phase unwrapping on the baseband signal to obtain the motion envelope is as follows: For the I and Q values at each sampling point, calculate the package phase. The obtained phase value range is ; The phase expansion method using adjacent-point differential correction has the following specific steps: Starting from the second sampling point, calculate the difference between the wrap phase of the current sampling point and the unfold phase of the previous sampling point. ,like Then subtract the current sampling point's envelope phase. ;like Then add the wrap phase of the current sampling point. until the absolute value of the difference is less than The expanded phase of the current sampling point is obtained. ; Phase after unfolding radial displacement of the reflecting surface The relationship is ,in The wavelength of ultrasound is 40 kHz, therefore the displacement... ; For displacement Differentiate to obtain the radial velocity. .
8. The method for dynamic quality-perceived voiceprint speaker recognition based on dual-model fusion according to claim 1, characterized in that, Envelope extraction of air-conducted speech signals yields the speech energy envelope, specifically including: Temporal energy correlation features are extracted from the collected air-conducted speech signals to obtain a speech envelope sequence that reflects the variation pattern of speech energy. Temporal motion correlation features were extracted from the baseband signal obtained after coherent demodulation of ultrasound to obtain a motion envelope sequence that reflects the motion law of the vocal organ; The validity of the two types of envelope sequences is pre-validated, and invalid sequences without valid characteristic fluctuations are removed.
9. The method for dynamic quality-perceived voiceprint speaker recognition based on dual-model fusion according to claim 8, characterized in that, After completing the validity pre-verification of the two types of envelope sequences, the speech envelope sequence and the motion envelope sequence are aligned in the time domain by combining the signal propagation delay parameter in the air and the inherent transmission delay parameter of the system, so that the time domain relationship of the two types of sequences matches the physical causal logic of the sound production process. After alignment, the two types of envelope sequences are normalized to eliminate non-correlated differences in the amplitude dimension, retaining only the temporal fluctuation correlation features for subsequent matching.
10. The method for dynamic quality-perceived voiceprint speaker recognition based on dual-model fusion according to claim 1, characterized in that, Step five specifically includes: Perform cross-correlation operations on the speech envelope sequence and motion envelope sequence that have undergone time-domain alignment and normalization to obtain a cross-correlation number sequence covering the preset time delay search range; First, based on the peak prominence of the cross-correlation coefficient sequence and the reasonableness of the time delay range corresponding to the peak, it is determined whether the signal to be detected comes from live vocalization, thus eliminating fake audio attacks that do not have corresponding vocal organ movement; If the voice is determined to be from a living person, the cross-correlation sequence is used as a unique physiological feature of the user and matched with a pre-stored legitimate user feature template to complete the verification and authentication of the user's identity.