A speech recognition method and system based on domain dynamic constraints and physical feature separation measurement
By employing domain-restricted hash lookup and physical feature separation measurement in the speech recognition system, the problems of high computational load and high latency in existing technologies are solved, achieving low-latency, high-precision domain-adaptive speech recognition and outputting structured speaker physical state data.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- 王秉钦
- Filing Date
- 2026-03-25
- Publication Date
- 2026-06-12
AI Technical Summary
Existing automatic speech recognition systems suffer from high computational costs and latency in open vocabulary spaces, and cannot effectively output the speaker's physical acoustic state information, resulting in low recognition accuracy and inflexible vocabulary updates.
The language model inference layer is replaced by a domain-restricted hash lookup table. Combined with a three-level waterfall lookup table and physical feature separation measurement, the structured speaker physical state data is output and adaptively adjusted through semantic closed-loop feedback.
It significantly reduces inference latency, improves recognition accuracy, outputs precise physical feature data, enables hot-swappable expansion of the lexicon, and enhances the system's adaptability.
Smart Images

Figure CN122201295A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of speech recognition technology, and in particular to a speech recognition method and system that replaces the speech recognition inference layer from probabilistic search with a domain-restricted hash lookup table, while simultaneously separating physical feature measurement and semantic judgment. Background Technology
[0002] Existing automatic speech recognition (ASR) systems typically employ an end-to-end deep learning architecture, comprising an acoustic encoder and a language model inference layer. The acoustic encoder is responsible for converting audio signals into acoustic features and candidate token sequences, while the language model inference layer determines the final recognized text in an open lexical space using a beam search method with equal probability.
[0003] The existing technology has the following problems: First, the language model inference layer performs probabilistic searches in an open vocabulary space of more than 100,000 words, which involves a large amount of computation and high latency (usually 100-300 milliseconds), becoming the main bottleneck of end-to-end latency.
[0004] Second, the open vocabulary space leads to cross-domain misidentification. For example, in a speech technology discussion scenario, "formant" might be misidentified as "gongzhengfeng," and "mel spectrum" might be misidentified as "meier spectrum." Traditional methods rely on larger language models to improve recognition accuracy, but this further increases computational cost and latency.
[0005] Third, existing ASR systems, while outputting recognized text, typically do not provide information about the speaker's physical acoustic state (such as fundamental frequency, speech rate, vocal cord stability, etc.), or only provide coarse-grained emotion classification labels. Downstream AI systems cannot obtain accurate physical measurement data, limiting their ability to understand the user's state.
[0006] Fourth, existing systems typically require retraining or fine-tuning the model to update the vocabulary, making it impossible to achieve hot-swappable vocabulary expansion.
[0007] Therefore, there is a need for a speech recognition method that can significantly reduce inference latency, improve domain recognition accuracy, and output structured physical feature data while retaining all the capabilities of the acoustic model. Summary of the Invention
[0008] Technical problems to be solved The technical problem to be solved by this invention is: how to achieve low-latency, high-precision domain-adaptive speech recognition and synchronously output structured speaker physical state data by replacing the inference layer without modifying the acoustic encoder. Technical solution
[0009] To address the aforementioned technical problems, this invention provides a speech recognition method based on domain dynamic constraints and physical feature separation measurement, comprising the following steps: S1. Acoustic coding steps: Receive audio signals, extract acoustic features through a pre-trained acoustic encoder, and generate candidate token sequences and corresponding confidence scores; the acoustic encoder is the acoustic front end of an existing ASR system, and this invention does not modify it; S2. Domain Label Acquisition Steps: Receive the domain labels of the current dialogue from the downstream AI system; the domain labels are selected from a predefined domain classification tree and are not allowed to be generated freely, achieving zero-illusion label constraints; when there are no domain labels, a general full vocabulary is used; S3. Domain-Restricted Text Correction Steps: Load the corresponding restricted thesaurus based on the domain tags, and perform a three-level waterfall lookup correction on the candidate token sequence: Level 1: Complete Matching – Performs a complete hash match between candidate text fragments and the domain thesaurus, with a latency of less than 0.5 milliseconds; Level 2: Phonetic Matching – For segments not matched in Level 1, generalize the matching in the domain lexicon according to the phonetic matching rules (same initial consonant, similar final vowel), with a delay of less than 1 millisecond; Level 3: Pinyin fallback – For segments that are still not matched at Level 2, a fallback match is made in the basic character set based on the pinyin of each character, with a delay of less than 1 millisecond; The three-tiered waterfall ensures that there is never a silent failure and that there is always text output; S4. Physical Feature Extraction Steps: Simultaneously extract multi-dimensional physical acoustic features from the same audio signal, including but not limited to: median, mean, and range of fundamental frequency (F0), fundamental frequency change rate, speech rate (words / second), root mean square energy (RMS), energy change trend, fundamental frequency jitter, amplitude jitter, harmonic-to-noise ratio (HNR), pause duration and pause frequency, spectral centroid and spectral tilt; the physical feature extraction only performs measurement operations and does not perform any semantic judgment or sentiment classification; S5. FAME Eight-Dimensional Mapping Steps: After normalizing the physical features, map them to the measurable dimensions of the FAME eight-dimensional emotional space. The eight dimensions include satisfaction (μ), curiosity (χ), empathy (ε), confidence (κ), anxiety (ν), conflict (δ), fatigue (ρ), and creativity (λ). Among them, the dimensions that can be directly mapped from the physical features include curiosity, confidence, anxiety, and fatigue. The remaining dimensions are supplemented by the downstream AI system based on semantic information. S6. BVC Status Code Packaging Step: Pack the corrected recognized text, confidence score, domain label, physical feature vector, and FAME eight-dimensional original value into a structured BVC (Behavior-Speech Coding) status code and output it to the downstream AI system; S7. Semantic Loop Feedback Step: The downstream AI system updates the dialogue context understanding based on the received BVC status code and dynamically adjusts the domain labels sent to step S2 to form Semantic Loop Adaptive Correction (SCAC).
[0010] The present invention also provides a speech recognition system for implementing the above method, comprising: The acoustic coding module is used to perform step S1; The domain label management module is used to execute step S2 and maintain a predefined domain classification tree and the corresponding restricted vocabulary. The three-level waterfall lookup engine is used to perform step S3, including the L1 hash complete matcher, the L2 phonetic generalization matcher, and the L3 pinyin catch-all matcher. The physical feature extraction module is used to perform step S4; The FAME mapping module is used to execute step S5; The BVC packaging module is used to execute step S6; A closed-loop feedback interface is used to execute step S7. Attached image description: Figure 1 is a system architecture diagram of a speech recognition method and system based on domain dynamic constraints and physical feature separation measurement according to the present invention. It shows the complete data flow from microphone audio input to candidate token sequence generated by acoustic encoder (S1) and entering the ThinkingEars inference layer. It includes the connection relationship and data interaction mode between the modules of domain label management (S2), three-level waterfall lookup table correction (S3, including L1 complete matching, L2 similar sound matching, L3 pinyin fallback), physical feature extraction (S4), FAME eight-dimensional mapping (S5), BVC status code packaging (S6), and downstream AI system dynamically updating domain labels through SCAC semantic closed loop (S7). At the same time, the performance comparison data with traditional inference methods is marked. Figure 2 Method flowchart Beneficial effects
[0011] Compared with the prior art, the present invention has the following beneficial effects: 1. Significantly reduced inference latency. Replacing the language model inference layer (100-300ms) with hash lookup (<2ms) reduces the end-to-end latency from 200-500ms to 25-55ms, an improvement of 4-10 times.
[0012] 2. Significantly improved domain identification accuracy. AI-generated domain labels narrow the search space from hundreds of thousands to thousands, greatly reducing homonym ambiguity. The more subdivided the domain, the smaller the search space, and the higher the accuracy.
[0013] 3. Acoustic model capabilities are fully preserved. Without modifying the acoustic encoder, all the existing acoustic model's accumulated features in terms of accent robustness, noise immunity, and mixed Chinese and English speech are fully inherited, and the model automatically benefits from the evolution of upstream acoustic models.
[0014] 4. Decoupling of physical measurement and semantic judgment. ThinkingEars only reports accurate physical measurement values and does not perform semantic judgments such as sentiment classification, thus avoiding the error propagation caused by the "classification before use" approach in traditional sentiment recognition.
[0015] 5. Zero-illusion labeling mechanism. Domain labels are selected from a predefined classification tree rather than generated freely, completely eliminating labeling errors caused by AI illusions.
[0016] 6. Hot-updating the thesaurus eliminates the need for retraining. Adding new entries, categories, or adjusting tags takes effect immediately, allowing users to independently expand the professional thesaurus. The system is designed to become more accurate with use.
[0017] 7. Reuse of a single source vocabulary. The TTS (VocalDNA) and ASR (ThinkingEars) platforms share the same vocabulary annotation system, allowing for one-time annotation and reuse across both platforms, thus reducing maintenance costs.
Claims
1. A speech recognition method based on domain dynamic constraints and physical feature separation measurement, characterized in that, Includes the following steps: S1. Receive audio signals, extract acoustic features through a pre-trained acoustic encoder, and generate candidate token sequences and confidence scores; S2. Receive the domain labels of the current dialogue from the downstream AI system, wherein the domain labels are selected from a predefined domain classification tree; S3. Load the corresponding restricted vocabulary according to the domain label, and perform a three-level waterfall table lookup correction on the candidate token sequence, including L1 level complete matching, L2 level similar sound matching and L3 level pinyin catch-all matching; S4. Synchronously extract multi-dimensional physical acoustic features from the audio signal; S5. After normalizing the physical features, map them to the measurable dimensions of the FAME eight-dimensional emotional space; S6. Pack the corrected recognition text, physical feature vector, and FAME eight-dimensional value into a structured BVC status code output; S7. The downstream AI system dynamically adjusts the domain labels based on the BVC status code to form a semantic closed-loop adaptive correction.
2. The method according to claim 1, characterized in that, The three-level waterfall table lookup correction in step S3 specifically includes: Level 1: Performs a hash-based complete match between candidate text fragments and the domain thesaurus, with a matching latency of less than 0.5 milliseconds; Level L2: For segments not matched in L1, generalize the matching in the domain vocabulary according to the similarity rule of the same initial consonant and similar final vowel, with a matching delay of less than 1 millisecond. Level L3: For segments that are still not matched at Level L2, a last-resort matching is performed in the basic character set based on the pinyin of each character, with a matching delay of less than 1 millisecond.
3. The method according to claim 1, characterized in that, The acquisition of domain labels in step S2 further includes: When there are no domain tags in the initial stage of the conversation, load the general full vocabulary; After one to three rounds of dialogue, the downstream AI system locks the domain label from the predefined domain classification tree based on the dialogue content and switches to the restricted word library; When a topic switch is detected, the domain tags are dynamically updated, and the restricted thesaurus is switched seamlessly. When there is no valid signal for a continuous period of time, it will automatically degrade back to general mode.
4. The method according to claim 1, characterized in that, The physical acoustic features extracted in step S4 include: median and range of fundamental frequency, rate of change of fundamental frequency, speech rate, root mean square value of energy, trend of energy change, fundamental frequency perturbation, amplitude perturbation, harmonic-to-noise ratio, pause duration, pause frequency, spectral centroid and spectral tilt; the physical feature extraction only performs measurement operations and does not perform semantic judgment.
5. The method according to claim 1, characterized in that, The FAME eight-dimensional emotional space includes eight dimensions: satisfaction, curiosity, empathy, confidence, anxiety, conflict, fatigue, and creativity. Among them, curiosity, confidence, anxiety, and fatigue are directly mapped from physical characteristics, while the other dimensions are supplemented by the downstream AI system based on semantic information.
6. The method according to claim 1, characterized in that, The BVC status code includes the following fields: corrected recognition text, confidence score, domain label, physical feature vector, and FAME eight-dimensional original value.
7. The method according to claim 1, characterized in that, The restricted lexicon supports hot updates, including three operations: adding entries, adding domain categories, and appending domain labels to existing entries. The updates take effect immediately without the need to retrain the model.
8. The method according to claim 1, characterized in that, The restricted lexicon shares the same data source as the word annotation system of the text-to-speech system. Word annotation can be used simultaneously for table lookup correction in the speech recognition end and synthesis parameter indexing in the text-to-speech end with a single annotation.
9. A speech recognition system based on domain dynamic constraints and physical feature separation measurement, characterized in that, include: The acoustic coding module is used to receive audio signals and generate candidate token sequences; The domain label management module is used to maintain a predefined domain classification tree, receive domain labels issued by downstream AI systems, and load the corresponding restricted word library; The three-level waterfall lookup engine is used to perform L1 complete matching, L2 similar sound matching and L3 pinyin catch-all matching on the candidate token sequence in sequence. The physical feature extraction module is used to synchronously extract multi-dimensional physical acoustic features from audio signals; The FAME mapping module is used to map physical features to the FAME eight-dimensional emotional space; The BVC packing module is used to pack correction text, physical features, and FAME values into structured status code output.
10. The system according to claim 9, characterized in that, It also includes a closed-loop feedback interface, which is used to receive domain labels dynamically updated by the downstream AI system based on the BVC status code, so as to achieve semantic closed-loop adaptive correction.