Speech recognition method and device, computer device and computer readable storage medium

By recording the timing of voice activity detection and calculating the delay duration, the audio data for voice recognition is supplemented, solving the problem of incomplete voice recognition caused by VAD delay and improving the coherence and reliability of intelligent voice interaction.

CN122245319APending Publication Date: 2026-06-19SHENZHEN TCL NEW-TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHENZHEN TCL NEW-TECH CO LTD
Filing Date
2026-03-17
Publication Date
2026-06-19

Smart Images

  • Figure CN122245319A_ABST
    Figure CN122245319A_ABST
Patent Text Reader

Abstract

This application provides a speech recognition method, apparatus, computer device, and computer-readable storage medium to ensure that speech recognition can completely recognize speech commands, avoid misinterpretation of user intent, and improve the coherence and reliability of intelligent voice interaction. The method includes: recording the first moment when speech activity detection is initiated on real-time audio data and the second moment when the real-time audio data is first determined to be in a speech active state; determining the delay duration for speech activity detection on the real-time audio data based on the second moment and the first moment; determining retransmitted audio data based on the delay duration; generating audio data to be recognized based on the retransmitted audio data and the first audio data, wherein the first audio data is the initial audio data for speech recognition after speech activity detection; and performing speech recognition on the audio data to be recognized to obtain a recognition result.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of intelligent voice interaction, specifically to a voice recognition method, device, computer equipment, and computer-readable storage medium. Background Technology

[0002] With the development of artificial intelligence technology, intelligent voice interaction has become an important method of human-computer interaction. Among them, full-duplex streaming technology has gained widespread application because it supports continuous and natural voice dialogue between users and devices without the need for frequent triggering of wake words, thus improving the fluency of interaction. In the full-duplex streaming technology architecture, the Voice Activity Detection (VAD) module plays a key role. Its main function is to distinguish between speech signals and non-speech signals (such as background noise) in real time to determine the start and end boundaries of the speaker's voice, i.e., the active speech range. The output of the VAD module is used to control whether the subsequent Automatic Speech Recognition (ASR) module starts processing. By sending only the detected valid speech segments to the ASR module, the system's computational load can be effectively reduced, and the interference of background noise on the recognition results can be avoided, thereby ensuring the processing efficiency and recognition accuracy of the ASR system.

[0003] To ensure accuracy, existing VAD technology requires accumulating audio (200-500ms) to determine the presence of speech, which introduces an inherent delay. However, in full-duplex streaming, ASR systems process data in real time, and the first keyword of Chinese commands (such as "play") has a very short duration (120-180ms).

[0004] Therefore, the delay in VAD often causes these crucial information to be missed when ASR is initiated, resulting in the loss of voice front-end information. This can lead to problems such as incomplete ASR recognition (e.g., recognizing "play music" as "music") and misinterpretation of user intent, seriously affecting the reliability of interaction and user experience. This is a problem that urgently needs to be solved by current technology. Summary of the Invention

[0005] This application provides a speech recognition method, apparatus, computer device, and computer-readable storage medium to ensure that speech recognition can fully recognize speech commands, avoid misjudging user intent, and improve the coherence and reliability of intelligent voice interaction.

[0006] The technical solution adopted by this invention to solve the problem is as follows: In a first aspect, this application provides a speech recognition method, including: Record the first moment when voice activity detection is initiated on real-time audio data and the second moment when the real-time audio data is first determined to be in a voice active state; The delay duration for voice activity detection is determined based on the second moment and the first moment; The audio data to be retransmitted is determined based on the delay duration; Based on the retransmitted audio data and the first audio data, audio data to be recognized is generated. The first audio data is the initial audio data for speech recognition after speech activity detection. Speech recognition is performed on the audio data to be recognized to obtain the recognition result.

[0007] In some embodiments of this application, determining the retransmission of audio data based on the delay duration includes: Get the start timestamp of the first audio data; The duration of the retransmitted audio data is determined based on the delay duration. The retransmitted audio data is extracted from the cached audio data based on the extraction duration and the start timestamp. The cached audio data is obtained by caching the real-time audio data. The start timestamp, the first moment, and the second moment are calculated based on a unified time base.

[0008] In some embodiments of this application, determining the truncation duration of the retransmitted audio data based on the delay duration includes: Check whether the delay duration meets the preset duration, which is obtained based on historical delay duration statistics; When the delay duration meets the preset duration, the delay duration is determined to be the intercept duration; If the delay duration does not meet the preset duration, the preset duration is determined to be the intercept duration.

[0009] In some embodiments of this application, the process of extracting the retransmitted audio data from the cached audio data based on the extraction duration and the start timestamp includes: The cutoff point is determined from the cached audio data based on the start timestamp; Based on the cutoff point and the cutoff duration, the retransmitted audio data is extracted from the cached audio data. The duration of the retransmitted audio data is the cutoff duration, and the end time of the retransmitted audio data is the start timestamp.

[0010] In some embodiments of this application, generating the audio data to be identified based on the retransmitted audio data and the first audio data includes: The first audio data and the retransmitted audio data are concatenated according to their timestamps to obtain the audio data to be identified.

[0011] In some embodiments of this application, the method further includes: The caching mechanism is activated at the first moment, and the real-time audio data is segmented and cached according to preset rules to obtain the cached audio data.

[0012] In some embodiments of this application, the real-time audio data is segmented and cached according to preset rules to obtain the cached audio data, including: The real-time audio data is segmented and cached according to the block size of speech recognition and the preset cache duration to obtain the cached audio data.

[0013] In some embodiments of this application, the preset cache duration is dynamically adjusted based on a latency prediction model, which is trained based on historical latency durations.

[0014] In some embodiments of this application, the method further includes: Check whether the cache duration of the cached audio data exceeds the preset cache duration; When the cache duration exceeds the preset cache duration, the cached audio data is cleaned up based on the early cache and early cleanup mechanism.

[0015] Secondly, this application provides a voice recognition device, comprising: The recording module is used to record the first moment when voice activity detection is initiated on real-time audio data and the second moment when the real-time audio data is first determined to be in a voice active state. The processing module is configured to determine the delay duration for voice activity detection of the real-time audio data based on the second time and the first time; determine the retransmitted audio data based on the delay duration; generate audio data to be recognized based on the retransmitted audio data and the first audio data, wherein the first audio data is the initial audio data for speech recognition after voice activity detection; and perform speech recognition on the audio data to be recognized to obtain a recognition result.

[0016] Thirdly, this application also provides a computer device, which includes: One or more processors; Memory; and One or more applications, wherein the applications are stored in memory and configured to be executed by a processor to implement the speech recognition method of any of the first aspects.

[0017] Fourthly, this application also provides a computer-readable storage medium having a computer program stored thereon, the computer program being loaded by a processor to perform the steps of the speech recognition method of any of the first aspects.

[0018] The beneficial effects of this invention are as follows: Recording the start time of voice activity detection and the trigger time for consistently determining the presence of active voice during this detection process allows for the determination of the difference between the trigger time and the start time of the voice activity detection delay. Then, based on this delay, the retransmitted audio data for voice recognition is determined, and the retransmitted audio data is used to complete the audio data to be recognized in voice recognition. This successfully recovers the voice front-end portion that should have been lost due to the voice activity detection delay and seamlessly supplements it to the beginning of the voice recognition processing flow. This ensures that the voice data sent to the voice recognition system is a complete instruction that begins from the initial moment of active voice, thereby guaranteeing that voice recognition can completely recognize voice instructions, avoiding misinterpretation of user intent, and improving the coherence and reliability of intelligent voice interaction. Attached Figure Description

[0019] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0020] Figure 1 This is a schematic diagram of the system architecture provided in an embodiment of the present invention; Figure 2 This is a schematic flowchart of an embodiment of the speech recognition method provided by the present invention; Figure 3 This is a timing diagram illustrating the confirmation delay duration and retransmission of audio data provided in an embodiment of the present invention; Figure 4 This is a flowchart illustrating a speech recognition method provided in an embodiment of the present invention; Figure 5 This is a schematic diagram of a specific embodiment of the speech recognition device provided in this invention. Figure 6 This is a schematic diagram of an embodiment of the computer device provided in this invention. Detailed Implementation

[0021] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0022] In the description of this application, the terms "first," "second," "third," etc., are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Therefore, a feature defined with "first," "second," "third," etc., may explicitly or implicitly include one or more features.

[0023] In this application, the term "exemplary" is used to mean "used as an example, illustration, or description." Any embodiment described as "exemplary" in this application is not necessarily to be construed as being more preferred or advantageous than other embodiments. The following description is provided to enable any person skilled in the art to make and use this application. Details are set forth in the following description for purposes of explanation. It should be understood that those skilled in the art will recognize that this application can be made without using these specific details. In other instances, well-known structures and processes are not described in detail to avoid obscuring the description of this application with unnecessary detail. Therefore, this application is not intended to be limited to the embodiments shown, but is consistent with the broadest scope of the principles and features disclosed in this application.

[0024] It should be noted that since the method in this application embodiment is executed in a computer device, the processing objects of each computer device exist in the form of data or information, such as time, which is essentially time information. It is understood that if size, quantity, position, etc. are mentioned in subsequent embodiments, they are all corresponding data that exist so that the computer device can process them. Specific details will not be elaborated here.

[0025] With the development of artificial intelligence technology, intelligent voice interaction has become an important method of human-computer interaction. Among them, full-duplex streaming technology has gained widespread application because it supports continuous and natural voice dialogue between users and devices without the need for frequent triggering of wake words, thus improving the fluency of interaction. In the full-duplex streaming technology architecture, the Voice Activity Detection (VAD) module plays a key role. Its main function is to distinguish between speech signals and non-speech signals (such as background noise) in real time to determine the start and end boundaries of the speaker's voice, i.e., the active speech range. The output of the VAD module is used to control whether the subsequent Automatic Speech Recognition (ASR) module starts processing. By sending only the detected valid speech segments to the ASR module, the computational load of the system can be effectively reduced, and the interference of background noise on the recognition results can be avoided, thereby ensuring the processing efficiency and recognition accuracy of the ASR system.

[0026] However, existing VAD (Voice over Audio) solutions generally suffer from an inherent flaw. To accurately distinguish between speech and sudden noise, VAD algorithms typically need to accumulate a specific duration of audio data (e.g., 200 to 500 milliseconds) and make a comprehensive judgment based on the energy, zero-crossing rate, and spectrum characteristics of this data to stably determine the start of speech activity. This judgment process introduces an unavoidable start delay. Meanwhile, in full-duplex streaming processing, ASR (Automatic Speech Retrieval) systems typically process the audio stream in small blocks (e.g., 100 or 200 milliseconds) to achieve low-latency real-time feedback. In particular, in language environments such as Chinese, the core keywords in the user's first command (e.g., "open" in "turn on the air conditioner," "play" in "play music") are usually located at the beginning of the sentence and have a short duration (e.g., 120 to 180 milliseconds).

[0027] Therefore, when the VAD module triggers late due to its inherent latency, the ASR module's startup time is already later than the actual start of the speech. This results in the initial part of the voice command, especially the first few syllables or words containing core keywords, failing to be captured by the VAD module and sent to the ASR system in time. The direct consequence is that the ASR system receives incomplete audio data, which may lead to incorrect recognition results (e.g., recognizing "play music" as "music"), failing to accurately understand the user's intent, and causing command execution failure or errors. This loss of voice front-end information seriously impairs the reliability and natural fluency of full-duplex voice interaction, and is a problem that urgently needs to be solved by current technology.

[0028] To address this technical problem, this application provides the following technical solution: recording the first moment when voice activity detection is initiated on real-time audio data and the second moment when the real-time audio data is first determined to be in a voice active state; determining the delay duration for voice activity detection of the real-time audio data based on the second moment and the first moment; determining retransmitted audio data based on the delay duration; generating audio data to be recognized based on the retransmitted audio data and the first audio data, wherein the first audio data is the initial audio data for speech recognition after voice activity detection; and performing speech recognition on the audio data to be recognized to obtain a recognition result. During this process, the start time of voice activity detection and the trigger time for the stable determination of active voice during this voice activity detection are recorded. At this time, the difference between the trigger time and the start time of the judgment delay of voice activity detection can be determined. Then, the retransmitted audio data for speech recognition is determined based on this delay time, and the audio data to be recognized for speech recognition is completed based on the retransmitted audio data. In this way, the voice front-end part that should have been lost due to the voice activity detection delay is successfully recovered and seamlessly supplemented to the beginning position of the speech recognition processing flow. This ensures that the voice data sent to the speech recognition system is a complete instruction that starts from the initial moment of active voice, thereby ensuring that speech recognition can completely recognize voice instructions, avoid misjudgment of user intent, and improve the coherence and reliability of intelligent voice interaction.

[0029] This application provides a speech recognition method, apparatus, computer device, and computer-readable storage medium to ensure that speech recognition can completely recognize speech commands, avoid misinterpretation of user intent, and improve the coherence and reliability of intelligent voice interaction. The electronic device provided in this application can be implemented as various types of user terminals or as a server.

[0030] Electronic devices use the speech recognition method provided in the embodiments of this application to ensure that speech recognition can fully recognize speech commands, avoid misjudging user intent, and improve the coherence and reliability of intelligent voice interaction.

[0031] The above method can be applied to many intelligent voice interaction devices, such as smart TVs, smart speakers, smart air conditioners, etc.

[0032] In one exemplary solution, the speech recognition method can be applied to a voice interaction scenario in a smart TV. For example, in a smart TV voice interaction scenario, after the user issues a voice interaction command (e.g., initiates a wake-up command "XX" or "Hello, XX" or presses a voice interaction button), the smart TV initiates voice activity detection. Simultaneously, it starts a timer for the voice activity detection module and records the first moment of initiation of voice activity detection, the timer for the caching module, and the timer for speech recognition. When voice activity detection first determines the presence of active voice and triggers speech recognition, the second moment is recorded by the voice activity detection timer. On the other hand, the speech recognition timer records the start timestamp of speech recognition. Then, the smart TV determines the voice activity detection based on the first moment and the second moment. The smart TV determines the retransmitted audio data for speech recognition based on the delay duration of the start timestamp (at this time, the retransmitted audio data is the audio data that was missed during voice activity detection and sent to the speech recognition engine); finally, the smart TV concatenates the retransmitted audio data with the first audio data received by the speech recognition engine (at this time, the first audio data is the first audio data sent to the speech recognition engine after triggering speech recognition) to obtain complete audio data to be recognized; the smart TV calls the speech recognition engine to perform speech recognition on the audio data to be recognized to obtain the speech recognition result and provide corresponding interactive operations (such as responding to voice interaction commands and switching playback content).

[0033] It should be understood that the above is only an exemplary application scenario of the speech recognition method, and there are many other possible application scenarios, which are not limited here.

[0034] The speech recognition method provided in this application embodiment is applied to, for example, Figure 1 The system architecture diagram shown is for your reference. Figure 1 To support a speech recognition method, the terminal device 100 connects to the server 300 via network 200, and the server 300 connects to the database 400. Network 200 can be a wide area network (WAN), a local area network (LAN), or a combination of both. The client used to implement the speech recognition scheme is deployed on the terminal device 100, or it can run on the terminal device 100 as a standalone application. The specific form of the client is not limited here.

[0035] The server 300 involved in this application can be an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks, and big data and artificial intelligence platforms.

[0036] Terminal equipment 100, also known as user equipment (UE), mobile station (MS), mobile terminal (MT), customer premises equipment (CPE), etc., can be a device that includes both receiving and transmitting hardware, that is, a device with receiving and transmitting hardware capable of performing bidirectional communication on a bidirectional communication link. Such equipment can include cellular or other communication devices with single-line displays, multi-line displays, or no multi-line displays. Examples include handheld devices with wireless connectivity, vehicle-mounted devices, machine-type communication (MTC) terminals, etc. Currently, terminal devices 100 can include: mobile phones, tablets, laptops, PDAs, mobile internet devices (MIDs), wearable devices, virtual reality (VR) devices, augmented reality (AR) devices, wireless terminals in industrial control, wireless terminals in self-driving vehicles, wireless terminals in remote medical surgery, wireless terminals in smart grids, wireless terminals in transportation safety, wireless terminals in smart cities, or wireless terminals in smart homes, etc. For example, wireless terminals in self-driving vehicles can be drones, helicopters, or airplanes. For example, wireless terminals in vehicle-to-everything (V2X) systems can be in-vehicle equipment, vehicle-mounted equipment, in-vehicle modules, vehicles, or ships, etc. Wireless terminals in industrial control can be cameras, robots, or robotic arms, etc. Wireless terminals in smart homes can be televisions, air conditioners, robot vacuums, speakers, or set-top boxes, etc.

[0037] It should be noted that the terminal device 100 may be a device or apparatus with a chip, or a device or apparatus with integrated circuitry, or a chip, module, or control unit in the device or apparatus shown above; this application does not impose any specific limitations. The solution provided in this application can be implemented by the terminal device 100 and the server 300 working together.

[0038] In short, a database can be viewed as an electronic filing cabinet—a place to store electronic files, where users can perform operations such as adding, querying, updating, and deleting data. A "database" is a collection of data stored together in a certain way, shared by multiple users, with minimal redundancy, and independent of application programs. A Database Management System (DBMS) is a computer software system designed to manage databases, generally possessing basic functions such as storage, retrieval, security, and backup. DBMSs can be classified according to the database model they support, such as relational or Extensible Markup Language (XML); or according to the type of computer they support, such as server clusters or mobile phones; or according to the query language used, such as Structured Query Language (SQL) or XQuery; or according to performance priorities, such as maximum scale or maximum operating speed; or other classification methods. Regardless of the classification method used, some DBMSs can cross categories, for example, supporting multiple query languages ​​simultaneously. In this application, database 400 can be used to store cached audio data, first moment, second moment, start timestamp, and other data.

[0039] Those skilled in the art will understand that Figure 1 The system architecture diagram shown is one possible system architecture for this application and does not constitute a limitation on the system architecture of this application. Other system architectures may include more advanced architectures. Figure 1 The number of more or fewer terminal devices or servers shown, for example Figure 1 The diagram shows one server. It is understood that the system architecture may also include one or more other terminal devices or servers, which are not limited here.

[0040] It should be noted that, Figure 1 The system architecture shown is an example. The servers and scenarios described in the embodiments of this application are for the purpose of more clearly illustrating the technical solutions of the embodiments of this application, and do not constitute a limitation on the technical solutions provided in the embodiments of this application. As those skilled in the art will know, with the evolution of servers and the emergence of new business scenarios, the technical solutions provided in the embodiments of this application are also applicable to similar technical problems.

[0041] like Figure 2 The diagram shown is a flowchart of an embodiment of the speech recognition method in this application. The speech recognition method is described below with a terminal device as the executing entity, and it may include the following steps 201 to 205, as detailed below: 201. Record the first moment when voice activity detection is initiated on the real-time audio data and the second moment when the real-time audio data is first determined to be in a voice active state.

[0042] In this embodiment, the voice activity detection module and / or voice recognition module of the terminal device are always in standby or listening state, waiting for an external or internal trigger event to initiate a complete voice interaction session. When initiating the voice interaction session in response to the trigger event, the terminal device, while generating and transmitting the first frame of real-time audio data, calls the timing module deployed in the VAD processing thread or task (such as the VAD module) or the global timing module of the terminal device before sending the frame data to the VAD analysis function for processing, to obtain the current timestamp and store it as the first moment (T1). Then, the terminal device continuously runs the VAD analysis function to receive audio data frames and calculates a quantization index based on its built-in algorithm (such as based on energy, zero-crossing rate, spectral features, or a deep neural network model). This index characterizes the probability that the current frame belongs to speech. The terminal device then determines the speech activity state and records the second moment (T2) through the VAD analysis function. During this process, the terminal device performs a judgment on each result returned by the VAD analysis function. When the result first meets the predefined "voice active" standard, the timing module of the thread or task deployed in the VAD processing is immediately invoked or the global timing module of the terminal device is invoked to obtain the current timestamp and store it as the second moment (T2).

[0043] Optionally, the terminal device can also set a flag at this time to prevent subsequent voice frames from being recorded repeatedly for T2.

[0044] Optionally, T1 and T2 can store nanosecond values ​​as 64-bit integers (int64) and package them together with metadata such as session identifier, device information, VAD algorithm version, and current network status to form a complete performance analysis log.

[0045] In this embodiment, the recording of the first moment and the second moment are tightly coupled with the corresponding events (data submission, decision output) at the code execution level to minimize measurement latency and jitter, thereby ensuring that the calculated time difference can truly reflect the initial response latency of the VAD algorithm.

[0046] The real-time audio data refers to a continuous sequence of digital signals generated over time by an audio acquisition device (such as a microphone), which objectively represents the changes in sound pressure in the acoustic environment. In this embodiment, the real-time audio data can also specifically refer to the input data processed by the speech activity detection module. It is a pipeline of raw or pre-processed audio data organized in units of fixed time intervals (frames) without large-scale delay processing, which is typically represented as Pulse Code Modulation (PCM) data blocks. For example, a data frame sequence consisting of 16-bit signed integers, with a sampling rate of 16kHz and a frame duration of 20 milliseconds.

[0047] This first moment marks the starting point at which the Voice Activity Detection (VAD) processing unit begins analyzing real-time audio data from a new session. In this embodiment, the first moment refers to the system timestamp recorded by the terminal device at the instant it submits the first valid audio data frame to the VAD analysis engine after receiving a clear "start voice acquisition" instruction (e.g., "Hello, Xiao X" or "Hello, XX"). It can be represented as a high-precision (e.g., microsecond or nanosecond level) absolute timestamp generated by a monotonic clock. This timestamp is captured and stored before the VAD processing function is called to process the first frame of data.

[0048] This second moment marks the point in time when the VAD processing unit successfully determines the audio data frame as "voice active" for the first time during the analysis process. In this embodiment, the second moment can refer to the timestamp recorded by the system when the output of the VAD analysis engine first meets the preset "voice active" determination criteria during the processing of the audio data stream. It can be represented as a timestamp using the same time source and precision as the first moment. When the VAD processing function returns a result that explicitly indicates "voice" (e.g., a Boolean value True, or a confidence score higher than a preset threshold), this timestamp is captured and stored immediately after the function returns.

[0049] Optionally, when the terminal device initiates a complete voice interaction session based on an external or internal triggering event, the triggering event may include the following possible implementation methods: Hardware triggering: This means that the terminal device detects the press event of a physical button (such as the "press call" button) through a general input / output interrupt or a USB event listener. For example, in an in-vehicle system, the driver presses the voice command button on the steering wheel.

[0050] Software triggering: The application's graphical user interface framework captures user clicks or touches on the virtual microphone icon. For example, in an instant messaging scenario, a user long-presses the "Voice Input" button in the chat interface to trigger recording.

[0051] Acoustic triggering: A low-power wake word engine continuously analyzes the audio stream and generates an internal wake-up event when a preset wake word (such as "Hello, assistant") is detected. For example, in a smart speaker scenario, the wake word engine detects "XX classmate" and triggers subsequent steps.

[0052] 202. Determine the delay duration for voice activity detection based on the second moment and the first moment.

[0053] The terminal device reads the first moment (T1) and the second moment (T2) associated with the current voice session from the storage medium or memory; then the terminal device performs a subtraction operation, that is, calculates the difference between the second moment and the first moment: ΔT = T2 - T1. At this time, the terminal device can determine that the difference is the delay duration (ΔT) of voice activity detection.

[0054] Optionally, the terminal device can also associate the latency duration ΔT with other metadata for the voice session. This metadata should include at least: a unique session identifier (Session ID), device ID, VAD algorithm version, and snapshots of environmental characteristics recorded at T1 and T2 (such as estimated background noise levels, signal-to-noise ratio (SNR), acoustic scene classification results, etc.). The terminal device then persistently stores the complete log record containing ΔT and metadata to the local file system or reports it to a remote data analysis platform over the network.

[0055] Optionally, if the recording time units for the first and second moments are high-precision time units (such as nanoseconds), the terminal device can convert the original time unit into a more readable standard time unit, typically milliseconds (ms), according to application requirements.

[0056] In this embodiment, the delay duration represents the time span from when the voice activity detection module of the terminal device begins analyzing a real-time audio stream to when it successfully identifies the voice signal for the first time. This metric is a core performance parameter for measuring the response speed of the VAD system. It can specifically refer to the value obtained by calculating the time difference between the second moment (T2) and the first moment (T1).

[0057] Optionally, when reporting the delay duration, the terminal device can define a data structure in JSON or Protobuf format based on the delay duration for reporting.

[0058] In this embodiment, the terminal device can collect ΔT distributions under different VAD algorithm versions or parameter configurations on a large scale, and select the VAD algorithm version with the fastest response in a data-driven manner. The terminal device can continuously monitor the real-time changes of ΔT in a production environment. If the average value of ΔT changes abruptly, it may indicate that the newly released software version has performance degradation, or that a specific device model has hardware / driver problems, thereby triggering an alarm.

[0059] On the other hand, the value of this latency duration (ΔT) can relate to the risk of VAD front-end truncation. By analyzing the distribution of ΔT, a basis can be provided for setting the length of the adaptive pre-wound buffer. For example, if 99% of the latency is within 300ms, then setting a 350ms pre-wound buffer can effectively avoid most front-end truncation problems.

[0060] 203. Determine the retransmission audio data based on the delay duration.

[0061] During this voice interaction, after initiating voice activity detection, the terminal device sequentially writes each frame of real-time audio data read from the audio acquisition device into a pre-allocated buffer area, thus caching the aforementioned real-time audio data. When the VAD module first determines that voice activity is active at the second moment, the terminal device can trigger voice recognition and simultaneously trigger the retransmission decision logic.

[0062] In this embodiment, the main implementation process of the retransmission decision logic requires determining the retransmitted audio data. This retransmitted audio data refers to historical audio data generated before VAD activation, extracted from the buffer storage area and sent to compensate for the loss of the initial speech portion due to VAD response delay. In this embodiment, the retransmitted audio data can also refer to an audio data segment that is temporally adjacent to the VAD activation point (second moment T2) and traces back, with a length related to the VAD delay duration (ΔT) or a preset safety duration. This segment will be appended to the header of the audio data acquired by the speech recognition module, forming a complete audio data to be recognized.

[0063] In one exemplary scheme, the terminal device can extract corresponding audio data from cached audio in the buffer region based on the calculated delay duration ΔT. It should be understood that the amount of audio data should at least cover the time span corresponding to the delay duration.

[0064] Based on the above description, in this embodiment, the terminal device may adopt the following possible implementation methods when acquiring the retransmitted audio data: In one possible implementation, to enable the terminal device to flexibly cope with various complex acoustic environments during speech recognition, the terminal device can perform the following technical process: the terminal device obtains the start timestamp of the first audio data, wherein the first audio data is the initial audio data for speech recognition after speech activity detection; simultaneously, the terminal device determines the truncation duration of the retransmitted audio data based on the delay duration; finally, based on the truncation duration and the start timestamp, the retransmitted audio data is extracted from the cached audio data, wherein the cached audio data is obtained by caching the real-time audio data when VAD detection is started, and the start timestamp, the first moment, and the second moment are calculated based on a unified time reference.

[0065] Optionally, the cached audio data can be cached in a preset buffer area, which can be configured as a circular buffer. This circular buffer can be understood as a fixed-size storage structure for temporarily storing the latest data, operating like a queue with its ends connected. When the buffer is full, newly written data overwrites the oldest data. In this embodiment, the circular buffer is a memory area that continuously receives and caches the latest real-time audio data frames. Its core function is to provide a historical data storage area, allowing the terminal device to retrieve audio data from before VAD activation when needed. In some embodiments, the circular buffer can be represented as an array or linked list with a fixed capacity (e.g., capable of storing 500 milliseconds of audio data) and equipped with two pointers: a write pointer pointing to the next writable position and a read pointer for data retrieval. The write pointer moves cyclically.

[0066] Optionally, to better meet the needs of the implementation scenario and make the retransmitted audio data more accurate, the terminal device can also detect the delay duration. In an exemplary solution, the terminal device compares the delay duration with a preset duration to determine whether the delay duration and the preset duration meet a preset condition (e.g., whether the difference between the two is within a threshold). If the preset condition is met, the delay duration is determined to be the truncated duration; if the preset condition is not met, the preset duration is determined to be the truncated duration. For example, if the delay duration is much longer or much shorter than the preset duration, the preset duration can be directly used as the truncated duration. If the difference between the delay duration and the preset duration is small, the delay duration can be directly used as the truncated duration.

[0067] Optionally, the preset duration can be obtained based on historical latency statistical analysis; it can also be based on the output of a pre-trained latency prediction model; or it can be set based on the application scenario.

[0068] For example, in noisy environments, VAD increases its activation threshold to avoid accidental activation, resulting in a significant increase in latency. Therefore, in this high-noise scenario, the terminal device can dynamically increase this preset duration, thereby making the acquired retransmitted audio data more complete.

[0069] Alternatively, for users who speak softly / gradually (i.e., whose voice gradually increases in volume), the VAD may need several hundred milliseconds to confirm the start of the speech. Therefore, in this soft / gradual speaking scenario, the terminal device can also dynamically increase the preset duration to make the acquired retransmitted audio data more complete.

[0070] Optionally, when caching the cached audio data, the terminal device can adopt the following scheme: The caching mechanism is activated the moment the terminal device initiates VAD detection, and the real-time audio data is segmented and cached according to preset rules to obtain the cached audio data. This strictly binds the caching activation time to the VAD module's start time. That is, as long as the VAD is running, caching is performed synchronously. Therefore, when the VAD first determines the presence of active voice, the system can ensure that all audio data prior to this is completely stored in the cache. Through this unconditional pre-caching strategy, the 100% availability of the audio data used for compensation (i.e., the audio data within the VAD delay window) is guaranteed from a mechanism perspective.

[0071] Furthermore, to avoid redundant data processing overhead and reduce the latency between VAD triggering and the start of ASR processing of the complete audio, the terminal device significantly improves the operational efficiency of the entire voice front-end processing chain. This allows the terminal device to align the data structure at the data generation source (caching stage) with the consumption end (ASR engine), eliminating the need for data reorganization or re-segmentation on the compensation path. In one exemplary scheme, the terminal device can segment and cache the real-time audio data according to the segment size of speech recognition and a preset cache duration to obtain the cached audio data.

[0072] Based on the above description, the cache area for the cached audio data can be set as a circular storage area. Therefore, during the caching process of the audio data, the terminal device can also perform the following cache management operations: The terminal device detects whether the cache duration of the cached audio data exceeds the preset cache duration; then, if the cache duration exceeds the preset cache duration, it performs overwrite cleanup and caching of the cached audio data based on an early cache early cleanup mechanism. In other words, this circular buffer can be understood as a fixed-size storage structure used to temporarily store the latest data, and its operation is like a queue with its head and tail connected. When the buffer is full, newly written data will overwrite the oldest data.

[0073] Optionally, the preset buffer duration can be dynamically adjusted based on a latency prediction model or based on statistical analysis of historical latency durations. The latency prediction model can be trained based on historical latency durations.

[0074] It should be understood that the delay prediction model can also incorporate contextual features from historical voice conversations during training. In one exemplary approach, the training process of this delay prediction model can be as follows: First, construct a high-quality training dataset rich in contextual information based on historical voice conversations. For example, systematically record data from each voice interaction during the actual operation of the smart voice device. For each interaction, collect and associate a data sample, which should include at least: Feature vector: A set of parameters used to describe the system state and environment at the time of the interaction. The feature vector includes, but is not limited to, one or more of the following: Historical delay characteristics: The sequence of VAD delay durations actually measured in the past N interactions (N is a preset positive integer).

[0075] Ambient acoustic characteristics: background noise levels (such as signal-to-noise ratio SNR, dB) in the current or recent period, noise stability indicators, and classification results of specific noise types (such as music, human voices, traffic noise).

[0076] User behavior characteristics: If the system has voiceprint recognition capabilities, it may include user ID; historical average speech rate, volume, etc.

[0077] System and time characteristics: device type, microphone array operating mode, current system time (e.g., time of day, whether it is a weekday), etc.

[0078] Tag value: The VAD delay duration that actually occurred in this interaction and was obtained through precise measurement (e.g., by backtracking analysis) corresponding to the feature vector.

[0079] Then, to improve the training efficiency and prediction accuracy of the model, the constructed training dataset is preprocessed. This preprocessing includes data cleaning, feature scaling, and feature encoding, among other methods. Details will not be elaborated here.

[0080] Finally, based on the preprocessed training dataset, a regression model is selected and trained to learn the mapping relationship from feature vectors to latency. This training scheme can be based on either a gradient boosting decision tree model or a recurrent neural network model.

[0081] The gradient boosting decision tree model can be XGBoost, LightGBM, or CatBoost. These models exhibit excellent processing capabilities and high prediction accuracy for tabular data.

[0082] This recurrent neural network model can be a long short-term memory network or a gated recurrent unit. Such models are particularly suitable for processing and learning dependencies in time-series data, such as temporal variations in historical delay features and environmental acoustic features.

[0083] The model is trained using any of the above training methods to obtain a trained and validated latency prediction model. This latency prediction model is then embedded and deployed on a smart voice device terminal or a cloud server. During runtime, the model receives real-time feature vectors as input and outputs the predicted latency duration, which is subsequently used or used to calculate a preset buffer duration.

[0084] Optionally, in this embodiment, the delay prediction model can be designed with an iterative update mechanism. That is, the terminal device continuously collects new data samples during continuous operation and achieves continuous model evolution through online learning or periodic offline retraining.

[0085] Based on the above, the following will be based on Figure 3 The timing diagram shown illustrates how the terminal device determines the delay duration and retransmits audio data. In a single voice interaction scenario, the voice recognition module, the voice activity detection module, and the caching module are all deployed with timers based on the same time base; or the voice recognition module, the voice activity detection module, and the caching module utilize a global timer deployed on the terminal device. Then, when the voice activity detection module starts, it records the first moment; at the same time, the caching module starts caching real-time audio data at the first moment to obtain cached audio; when the voice activity detection module first determines that the real-time audio data is active voice, it records the second moment and triggers the speech recognition module; the speech recognition module also records the start timestamp of the speech recognition process when it receives the first frame of audio data; the second moment and the first moment will calculate the delay duration of the voice activity detection; in order to ensure the validity of the retransmitted audio data, the terminal device can use the start timestamp as the end time point of the retransmitted audio data, and the delay duration as the interception duration of the retransmitted audio data, and determine the start time point of the retransmitted audio data based on the end time point and the interception duration; then, based on the start time point and the end time point, the retransmitted audio data is intercepted from the cached audio.

[0086] 204. Generate audio data to be recognized based on the retransmitted audio data and the first audio data, wherein the first audio data is the initial audio data for speech recognition after speech activity detection.

[0087] After obtaining the retransmitted audio data, the terminal device can concatenate the retransmitted audio data with the first audio data to obtain the audio data to be identified.

[0088] In one exemplary solution, the terminal device can concatenate the retransmitted audio data and the first audio data in timestamp order to obtain the audio data to be identified.

[0089] In this embodiment, since the cached audio data can be cached according to the block size of speech recognition, the retransmitted audio data obtained by the terminal device may also be segmented audio data. The retransmitted audio data can be understood as a set consisting of one or more audio data segments (or audio frames), whose overall timestamp range covers the period from the starting timestamp backwards until the data volume covers the truncated duration. Each audio data segment in this set carries its original acquisition timestamp.

[0090] Meanwhile, the audio acquisition module of the terminal device continues to operate, generating a real-time audio stream starting from the start timestamp of the speech recognition module. This first audio data is the initial part of the real-time audio stream, which also consists of one or more audio data segments with timestamps ranging from [start timestamp, ...).

[0091] To ensure the orderliness and efficiency of data processing, the terminal device can activate or designate a dedicated data processing unit. Specifically, the terminal device can instantiate a data integration module or allocate a splicing buffer in memory. This module or buffer functions as a convergence point, receiving and temporarily storing audio data segments from the two different sources mentioned above (i.e., historical buffer and real-time acquisition).

[0092] Then, to ensure the continuity of the final generated data stream on the timeline, the data integration module of the terminal device places all audio data segments from the received retransmitted audio data, as well as all audio data segments from the first audio data, into the same logical set or list. The data integration module then strictly sorts all audio data segments in this set in ascending order according to their respective acquisition timestamps. This sorting operation generates a completely new sequence of audio data segments with monotonically increasing timestamps. Through this operation, the last data segment of the retransmitted audio data will be precisely positioned before the first data segment of the first audio data in time sequence. Finally, the data integration module concatenates the sorted audio data segment sequence end-to-end according to the new order, aggregating it into a single, continuous audio data stream. This aggregation operation can be a physical memory copy or a logical pointer link to form a data structure that can be sequentially read by downstream modules.

[0093] The continuous audio data stream generated after sequence reconstruction can be defined as the audio data to be identified. The audio data to be identified can start from the point in time that is traced back from the starting timestamp, and a seamless transition of acoustic features is achieved at the splicing point where the starting timestamp is located.

[0094] Finally, the terminal device can send the complete and time-correct audio data to be recognized into the input queue of its internal Automatic Speech Recognition (ASR) module for subsequent acoustic feature extraction, decoding, and text recognition processing. Through this method, the ASR module obtains complete speech information at the beginning of processing, thereby significantly improving recognition accuracy.

[0095] 205. Perform speech recognition on the audio data to be recognized to obtain the recognition result.

[0096] After acquiring the audio data to be recognized, the speech recognition module of the terminal device can represent the audio data as a continuous digital audio stream (e.g., pulse code modulation PCM data). Then, the speech recognition module of the terminal device uses the current speech recognition algorithm to perform speech recognition on the audio data to be recognized in order to obtain the recognition result.

[0097] It should be understood that the recognition result is typically represented as a text string. In some embodiments, the recognition result may further include: an overall confidence score, the confidence score of each word, and the timestamp information corresponding to each word in the audio data to be recognized. This recognition result may then be transmitted to a subsequent natural language understanding module for intent analysis, or presented directly to the user.

[0098] Optionally, in order for subsequent speech recognition to be performed normally, after obtaining the speech recognition result during this speech recognition process, the terminal device can also clear the cached audio data in the cache area.

[0099] The following is based on Figure 4 The flowchart shown illustrates the speech recognition method in this application.

[0100] 1. The pre-buffering module is activated when audio input occurs, and the pre-buffering duration of the buffer area is dynamically adjusted based on the contextual features or feature vectors of the current voice session.

[0101] 2. Start the VAD module and perform a test. It should be understood that the startup time of the VAD module is the same as the caching time of the pre-caching module.

[0102] 3. Record the first moment T1 when the VAD module is started.

[0103] 4. Determine if the voice activity is active. If yes, record the second moment T2 when the voice activity is first confirmed, and trigger the supplementary transmission signal and transmit the real-time audio stream to the speech recognition module. If no, continue to execute VAD.

[0104] 5. Calculate the time interval ΔT = T2 - T1. During this process, the terminal device can also determine whether the time interval meets the preset duration. If it does, the time interval is carried in the trigger supplementary transmission signal; if it does not meet the preset duration, the preset duration is carried in the trigger supplementary transmission signal.

[0105] 6. Trigger the supplementary transmission signal.

[0106] 7. Perform retransmission and send the retransmitted audio data.

[0107] 8. The speech recognition module receives and splices the retransmitted audio data and the current audio to obtain the audio data to be recognized.

[0108] 9. Perform speech recognition on the audio data to be recognized to obtain the recognition result.

[0109] 10. Output the recognition result.

[0110] 11. Clear the cached audio data in this pre-cached module.

[0111] To better implement the speech recognition method in the embodiments of this application, a speech recognition device is also provided in the embodiments of this application, such as... Figure 5 As shown, the voice recognition device 500 includes: The recording module 501 is used to record the first moment when voice activity detection is initiated on real-time audio data and the second moment when the real-time audio data is first determined to be in a voice active state. The processing module 502 is configured to determine the delay duration for voice activity detection of the real-time audio data based on the second time and the first time; determine the retransmitted audio data based on the delay duration; generate audio data to be recognized based on the retransmitted audio data and the first audio data, wherein the first audio data is the initial audio data for speech recognition after voice activity detection; and perform speech recognition on the audio data to be recognized to obtain a recognition result.

[0112] In this embodiment, the start time of voice activity detection and the trigger time when active voice is stably determined to exist during the voice activity detection process are recorded. At this time, the difference between the trigger time and the start time of the VAD judgment delay can be determined. Then, the retransmitted audio data for speech recognition is determined based on the delay duration, and the audio data to be recognized for speech recognition is completed based on the retransmitted audio data. In this way, the voice front-end part that should have been lost due to the VAD delay is successfully retransmitted and seamlessly supplemented to the beginning position of the speech recognition processing stream. This ensures that the voice data sent to the speech recognition system is a complete instruction that starts from the initial moment of active voice, thereby ensuring that speech recognition can completely recognize voice instructions, avoid misjudgment of user intent, and improve the coherence and reliability of intelligent voice interaction.

[0113] In some embodiments of this application, the processing module 502 is specifically used for: Get the start timestamp of the first audio data; The duration of the retransmitted audio data is determined based on the delay duration. The retransmitted audio data is extracted from the cached audio data based on the extraction duration and the start timestamp. The cached audio data is obtained by caching the real-time audio data. The start timestamp, the first moment, and the second moment are calculated based on a unified time base.

[0114] In this embodiment, the speech processing system can flexibly adapt to various complex acoustic environments by using delay duration and trigger timestamps for speech recognition. Whether it's a user's whisper in a quiet environment (which may result in a longer VAD delay) or a loud command in a noisy background (which may result in a shorter VAD delay), the system can accurately compensate for the delay duration through dynamic acquisition. This adaptive capability greatly enhances the system's stability and reliability, i.e., robustness, in different scenarios. Furthermore, due to the dynamic determinism of the delay duration, compared to recompacting fixed-length buffered data that may contain a large amount of invalid silence, the retransmitted audio data in this embodiment is precisely pruned and effective information. This avoids the speech recognition engine wasting computational resources processing meaningless silence data in the initial stage, allowing it to focus on acoustically characteristic speech signals from the outset, thereby improving the computational efficiency of the entire recognition task to a certain extent.

[0115] In some embodiments of this application, the processing module 502 is specifically used for: Check whether the delay duration meets the preset duration, which is obtained based on historical delay duration statistics; When the delay duration meets the preset duration, the delay duration is determined to be the intercept duration; If the delay duration does not meet the preset duration, the preset duration is determined to be the intercept duration.

[0116] In this embodiment, a preset duration based on historical data is introduced, which makes the delay duration more in line with the actual needs of the scenario, thereby making the retransmitted audio data more accurate. At the same time, it cleverly balances the contradiction between compensation accuracy and system overhead, and provides an efficient, adaptive and easy-to-implement voice front-end information compensation scheme, which has important practical value for improving the user experience and market competitiveness of full-duplex voice interaction products.

[0117] In some embodiments of this application, the processing module 502 is specifically used for: The cutoff point is determined from the cached audio data based on the start timestamp; Based on the cutoff point and the cutoff duration, the retransmitted audio data is extracted from the cached audio data. The duration of the retransmitted audio data is the cutoff duration, and the end time of the retransmitted audio data is the start timestamp.

[0118] In this embodiment, the actual physical start timestamp of the speech is calculated by introducing the moment when the speech recognition engine actually begins processing the audio stream and the time difference between the actual start of the speech and the triggering of speech recognition. By using an absolute timestamp as the sole basis for data operations, a deterministic and reliable mechanism unaffected by concurrency is provided for acquiring compensation data. This ensures perfect timing alignment between retransmitted audio data and subsequent real-time audio streams, preventing data loss due to truncation errors and avoiding the introduction of duplicate data due to excessive truncation, thus guaranteeing data consistency and high reliability throughout the entire speech processing chain.

[0119] In some embodiments of this application, the processing module 502 is specifically used for: The first audio data and the retransmitted audio data are concatenated according to their timestamps to obtain the audio data to be identified.

[0120] In this embodiment, a deterministic splicing method based on timestamps can be used to construct an audio stream that is acoustically and temporally continuous. This effectively avoids acoustic discontinuities at splicing points that may result from simple data block appending, ensuring that the audio stream fed into the speech recognition engine is a high-fidelity complete audio stream identical to the original single recording, thereby improving the accuracy of speech recognition.

[0121] In some embodiments of this application, the processing module 502 is further specifically used for: The caching mechanism is activated at the first moment, and the real-time audio data is segmented and cached according to preset rules to obtain the cached audio data.

[0122] In this embodiment, the start time of the cache is strictly bound to the start time of the VAD module. That is, as long as the VAD is running, the cache is running synchronously. Therefore, when the VAD first determines that there is active speech, the system can ensure that all audio data prior to this is completely stored in the cache. Through this unconditional pre-caching strategy, the availability of the audio data used for compensation (i.e., the audio data within the VAD delay window) is guaranteed in a mechanism.

[0123] In some embodiments of this application, the processing module 502 is further specifically used for: The real-time audio data is segmented and cached according to the block size of speech recognition and the preset cache duration to obtain the cached audio data.

[0124] In this embodiment, by aligning the data structure with the consumption end (ASR engine) at the data generation source (caching stage), the necessity of data reorganization or re-segmentation on the compensation path is eliminated. This avoids redundant data processing overhead and reduces the latency between VAD triggering and the start of ASR processing of complete audio, thereby significantly improving the operational efficiency of the entire voice front-end processing chain. Simultaneously, limiting the caching duration of the cached audio data fundamentally eliminates memory leaks or uncontrolled growth caused by improper cache management. This is crucial for embedded devices with very limited memory and computing resources (such as smart speakers, in-vehicle systems, wearable devices, etc.), ensuring the long-term stable operation of applications and preventing system crashes due to resource exhaustion.

[0125] In some embodiments of this application, the preset cache duration is dynamically adjusted based on a latency prediction model, which is trained based on historical latency durations.

[0126] In this embodiment, a latency prediction model is trained that learns the deep correlation between historical latency duration and various relevant contextual features. During runtime, the model predicts the next possible VAD latency based on the current real-time contextual features and dynamically sets an optimal preset buffer duration based on this prediction. This transforms the setting of the buffer duration from a passive, static configuration into an active, intelligent, and forward-looking decision, ensuring sufficient capacity to capture potentially long-latency speech, thereby achieving an optimal dynamic balance between data capture integrity and system resource consumption.

[0127] In some embodiments of this application, the processing module 502 is further specifically used for: Check whether the cache duration of the cached audio data exceeds the preset cache duration; When the cache duration exceeds the preset cache duration, the cached audio data is cleaned up based on the early cache and early cleanup mechanism.

[0128] In this embodiment, limiting the caching duration of the cached audio data fundamentally eliminates memory leaks or uncontrolled growth caused by improper cache management. This is crucial for embedded devices with limited memory and computing resources (such as smart speakers, in-vehicle systems, wearable devices, etc.), ensuring the long-term stable operation of applications and preventing system crashes due to resource exhaustion.

[0129] This application also provides a computer device that integrates any of the speech recognition devices provided in this application. The computer device includes: One or more processors; Memory; and One or more applications, wherein the applications are stored in memory and configured to be executed by a processor from the steps of the speech recognition method in any of the embodiments described above.

[0130] This application also provides a computer device that integrates any of the speech recognition devices provided in this application. For example... Figure 6 As shown, it illustrates a structural schematic diagram of the computer device involved in the embodiments of this application, specifically: The computer device may include components such as a processor 601 with one or more processing cores, a memory 602 with one or more computer-readable storage media, a power supply 603, and an input unit 604. Those skilled in the art will understand that... Figure 6 The computer device structure shown does not constitute a limitation on the computer device and may include more or fewer components than shown, or combine certain components, or have different component arrangements. Wherein: The processor 601 is the control center of the computer device. It connects various parts of the computer device via various interfaces and lines, and performs various functions and processes data by running or executing software programs and / or modules stored in the memory 602, and by calling data stored in the memory 602, thereby providing overall monitoring of the computer device. Optionally, the processor 601 may include one or more processing cores; preferably, the processor 601 may integrate an application processor and a modem processor, wherein the application processor mainly handles the operating system, user interface, and applications, and the modem processor mainly handles wireless communication. It is understood that the modem processor may not be integrated into the processor 601.

[0131] The memory 602 can be used to store software programs and modules. The processor 601 executes various functional applications and data processing by running the software programs and modules stored in the memory 602. The memory 602 may mainly include a program storage area and a data storage area. The program storage area may store the operating system, application programs required for at least one function (such as sound playback function, image playback function, etc.), etc.; the data storage area may store data created according to the use of the computer device, etc. In addition, the memory 602 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 602 may also include a memory controller to provide the processor 601 with access to the memory 602.

[0132] The computer device also includes a power supply 603 that supplies power to the various components. Preferably, the power supply 603 can be logically connected to the processor 601 through a power management system, thereby enabling functions such as charging, discharging, and power consumption management through the power management system. The power supply 603 may also include one or more DC or AC power supplies, recharging systems, power fault detection circuits, power converters or inverters, power status indicators, and other arbitrary components.

[0133] The computer device may also include an input unit 604, which can be used to receive input digital or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

[0134] Although not shown, the computer device may also include a display unit, etc., which will not be described in detail here. Specifically, in this embodiment, the processor 601 in the computer device loads the executable files corresponding to the processes of one or more application programs into the memory 602 according to the following instructions, and the processor 601 runs the application programs stored in the memory 602 to realize various functions, as follows: Record the first moment when voice activity detection is initiated on real-time audio data and the second moment when the real-time audio data is first determined to be in a voice active state; determine the delay duration for voice activity detection on the real-time audio data based on the second moment and the first moment; determine the retransmitted audio data based on the delay duration; generate audio data to be recognized based on the retransmitted audio data and the first audio data, wherein the first audio data is the initial audio data for speech recognition after voice activity detection; perform speech recognition on the audio data to be recognized to obtain the recognition result.

[0135] Those skilled in the art will understand that all or part of the steps in the various methods of the above embodiments can be performed by instructions, or by instructions controlling related hardware. These instructions can be stored in a computer-readable storage medium and loaded and executed by a processor.

[0136] Therefore, embodiments of this application provide a computer-readable storage medium, which may include: read-only memory (ROM), random access memory (RAM), a disk, or an optical disk, etc. A computer program is stored thereon, and the computer program is loaded by a processor to execute the steps in any of the speech recognition methods provided in embodiments of this application. For example, the computer program loaded by the processor can execute the following steps: Record the first moment when voice activity detection is initiated on real-time audio data and the second moment when the real-time audio data is first determined to be in a voice active state; The delay duration for voice activity detection is determined based on the second moment and the first moment; The audio data to be retransmitted is determined based on the delay duration; Based on the retransmitted audio data and the first audio data, audio data to be recognized is generated. The first audio data is the initial audio data for speech recognition after speech activity detection. Speech recognition is performed on the audio data to be recognized to obtain the recognition result.

[0137] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the detailed descriptions of other embodiments above, which will not be repeated here.

[0138] In practice, each of the above units or structures can be implemented as an independent entity or can be arbitrarily combined to be implemented as the same or several entities. For the specific implementation of each of the above units or structures, please refer to the previous method embodiments, which will not be repeated here.

[0139] For details on the implementation of each of the above operations, please refer to the previous examples, which will not be repeated here.

[0140] The above provides a detailed description of a speech recognition method, apparatus, computer device, and computer-readable storage medium provided in the embodiments of this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.

Claims

1. A speech recognition method, characterized in that, include: Record the first moment when voice activity detection is initiated on the real-time audio data and the second moment when the real-time audio data is first determined to be in a voice active state; The delay duration for voice activity detection based on the second time point and the first time point is determined using the real-time audio data. The retransmitted audio data is determined based on the aforementioned delay duration; Based on the retransmitted audio data and the first audio data, audio data to be recognized is generated, wherein the first audio data is the initial audio data for speech recognition after speech activity detection. The audio data to be identified is subjected to speech recognition to obtain the recognition result.

2. The method according to claim 1, characterized in that, Determining the retransmission audio data based on the aforementioned delay duration includes: Obtain the start timestamp of the first audio data; The duration for capturing the retransmitted audio data is determined based on the delay duration. The retransmitted audio data is extracted from the cached audio data based on the extraction duration and the start timestamp. The cached audio data is obtained by caching the real-time audio data. The start timestamp, the first time, and the second time are calculated based on a unified time base.

3. The method according to claim 2, characterized in that, Determining the truncation duration of the retransmitted audio data based on the aforementioned delay duration includes: It is detected whether the delay duration meets the preset duration, which is obtained based on historical delay duration statistics; When the delay duration meets the preset duration, the delay duration is determined to be the intercept duration; When the delay duration does not meet the preset duration, the preset duration is determined to be the intercept duration.

4. The method according to claim 2 or 3, characterized in that, The step of extracting the retransmitted audio data from the cached audio data based on the extraction duration and the start timestamp includes: The cutoff point is determined from the cached audio data based on the start timestamp; The retransmitted audio data is extracted from the cached audio data based on the cutoff point and the extraction duration, wherein the duration of the retransmitted audio data is the extraction duration, and the end time of the retransmitted audio data is the start timestamp.

5. The method according to any one of claims 1 to 3, characterized in that, The step of generating the audio data to be identified based on the retransmitted audio data and the first audio data includes: The first audio data and the retransmitted audio data are concatenated in timestamp order to obtain the audio data to be identified.

6. The method according to any one of claims 1 to 3, characterized in that, The method further includes: The caching mechanism is activated at the first moment, and the real-time audio data is segmented and cached according to preset rules to obtain the cached audio data.

7. The method according to claim 6, characterized in that, The step of segmenting and caching the real-time audio data according to a preset rule to obtain the cached audio data includes: The real-time audio data is segmented and cached according to the block size of speech recognition and the preset cache duration to obtain the cached audio data.

8. The method according to claim 6, characterized in that, The method further includes: Detect whether the cache duration of the cached audio data exceeds the preset cache duration; When the cache duration exceeds the preset cache duration, the cached audio data is cleaned up based on the early cache and early cleanup mechanism.

9. The method according to claim 7 or 8, characterized in that, The preset cache duration is dynamically adjusted based on a latency prediction model, which is trained based on historical latency durations.

10. A voice recognition device, characterized in that, include: The recording module is used to record the first moment when voice activity detection is initiated on the real-time audio data and the second moment when the real-time audio data is first determined to be in a voice active state; The processing module is configured to determine the delay duration for voice activity detection of the real-time audio data based on the second time and the first time; and to determine the retransmission of audio data based on the delay duration. Based on the retransmitted audio data and the first audio data, audio data to be recognized is generated, wherein the first audio data is the initial audio data for speech recognition after speech activity detection. The audio data to be identified is subjected to speech recognition to obtain the recognition result.

11. A computer device, characterized in that, The computer device includes: One or more processors; Memory; and One or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the processor to implement the method of any one of claims 1 to 9.

12. A computer-readable storage medium, characterized in that, It stores a computer program, which is loaded by a processor to perform the steps of the method according to any one of claims 1 to 9.