Method and apparatus for detecting speech termination

By dynamically adjusting the length of the tail silence interval, and based on morphological analysis and part-of-speech information of user speech, the problem of speech recognition rate fluctuation in traditional speech recognition systems is solved, achieving higher recognition accuracy and user experience.

CN122245296APending Publication Date: 2026-06-19HYUNDAI MOTOR CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HYUNDAI MOTOR CO LTD
Filing Date
2025-10-09
Publication Date
2026-06-19

Smart Images

  • Figure CN122245296A_ABST
    Figure CN122245296A_ABST
Patent Text Reader

Abstract

This disclosure relates to a method and apparatus for detecting speech termination, the method comprising the steps of: receiving a user's speech; dynamically adjusting the length of the user's tail silence interval based on the type of the last word unit in the user's speech; and determining whether the user's speech has terminated based on whether another speech is received within the adjusted tail silence interval.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] This application claims priority and benefits to Korean Patent Application No. 10-2024-0190299, filed with the Korean Intellectual Property Office on December 18, 2024, the entire disclosure of which is incorporated herein by reference. Technical Field

[0002] This disclosure relates to a method and apparatus for detecting the termination of speech. Background Technology

[0003] The content described in this section provides background information relevant to this disclosure only and does not constitute prior art.

[0004] The operating logic of a traditional speech recognition system is as follows: when it is determined that there is no user input within a certain period of time, it terminates the speech input. Typically, if the speech signal remains below a certain threshold for more than a specified period of time, the traditional speech recognition system will identify the speech signal as a tail silence interval and consider the speech input to have ended.

[0005] Traditional methods have limitations because they use a fixed-length tail silence interval as a criterion, thus failing to consider the differences in users' speech characteristics. For example, speech recognition rates may fluctuate depending on the user's speaking speed and the length of pauses between word units. For users who speak slowly, traditional speech recognition systems may misinterpret and terminate speech input even if the user has not finished speaking. Conversely, for users who speak quickly, the system may still incur unnecessary waiting time after the user has actually finished speaking.

[0006] Traditional methods use a fixed-length tail silence interval as the basis for determination, which fails to fully consider the fluency of speech or the speaker's expressive intention, and has certain technical defects: even if the user still plans to continue speaking to form a complete sentence, the voice input may be terminated prematurely. Summary of the Invention

[0007] The disclosed embodiments provide a technical solution that improves the accuracy of speech recognition by dynamically adjusting the length of the tail silence interval by taking into account the user's voice pattern and intent.

[0008] Specifically, the disclosed embodiments dynamically adjust the length of the appropriate tail silence interval based on the morphological analysis results of user speech, thereby preventing the speech recognition mode from terminating prematurely and reducing unnecessary waiting time.

[0009] The objectives to be achieved by this disclosure are not limited to those described above. Other objectives not explicitly mentioned will become apparent to those skilled in the art from the following description.

[0010] Embodiments of this disclosure provide a method for determining whether a speech utterance has terminated. The method includes the following steps: receiving a user's speech; dynamically adjusting the length of the user's trailing silence interval based on the type of the last word unit in the user's speech; and determining whether the user's speech has terminated based on whether another speech is received within the adjusted trailing silence interval.

[0011] Another embodiment of this disclosure provides an apparatus for determining whether a speech utterance has terminated. The apparatus includes: at least one memory; and at least one processor configured to execute instructions to receive a user's speech; dynamically adjust the length of a user's tail silence interval based on the type of the last word unit in the user's speech; and determine whether the user's speech has terminated based on whether another speech is received within the dynamically adjusted tail silence interval.

[0012] Unlike traditional speech recognition systems that use fixed or threshold-based tail silence intervals, this disclosure employs syntactic context analysis based on real-time part-of-speech information. By dynamically adjusting the tail silence interval based on the grammatical attributes of the last word unit in the user's utterance and referencing a predefined list of representative commands, this disclosure provides a technical solution and achieves superior speech recognition performance. This dynamic adjustment differs significantly from current fixed-interval or static-threshold methods and represents an improvement upon them. Attached Figure Description

[0013] Figure 1 This is a block diagram schematically illustrating a speech termination detection device according to an embodiment of the present disclosure.

[0014] Figure 2 This is a block diagram schematically illustrating a controller according to an embodiment of the present disclosure.

[0015] Figure 3 This is a flowchart illustrating the operation of a speech termination detection device according to an embodiment of the present disclosure.

[0016] Figure 4 This is a schematic block diagram of a computing device that can be used to implement the method or apparatus according to this disclosure. Detailed Implementation

[0017] The various embodiments of this disclosure will now be described in detail with reference to the accompanying drawings. In the following drawings, the same reference numerals will be used to refer to the same or equivalent elements, even if the elements appear in different drawings. Furthermore, in the following description of the various embodiments, detailed descriptions of well-known functions and structures included therein have been omitted for clarity and brevity.

[0018] Furthermore, terms such as first, second, A, B, (a), (b), etc., are used only to distinguish one component from another and are not intended to imply or indicate the type, order, or sequence of components. Throughout this specification, when a part "comprises" or "includes" a component, it should be understood that this part may also include other components unless explicitly stated otherwise. When a component, device, element, part, unit, module, etc., of this disclosure is described as having a certain purpose or performing a certain operation or function, that component, device, or element should be regarded herein as "configured" to perform that purpose or perform that operation or function. Each "part," "unit," "module," "component," "device," "element," etc., may be embodied independently or may be included together with a processor and memory (e.g., a non-transitory computer-readable medium) as part of that device.

[0019] The following detailed description, taken in conjunction with the accompanying drawings, is intended to describe various embodiments of the present disclosure, but is not intended to limit the scope of the disclosure to the embodiments described herein.

[0020] Figure 1 This is a block diagram schematically showing a speech termination detection device 10 according to an embodiment of the present disclosure.

[0021] The speech termination detection device 10 according to the embodiments of the present disclosure may include all or part of the components of the speech recognition input device 100, the controller 200 and the speech recognition output device 300. Figure 1 The components shown represent functionally distinct elements, but in a real physical environment, one or more components can be implemented as integrated together.

[0022] The speech recognition input device 100 is a device for acquiring spoken words from a speaker. The speech recognition input device 100 may include a microphone or other speech acquisition sensors. The speech recognition input device 100 can convert the acquired speech data into digital signals to generate speech data for subsequent processing. Furthermore, the speech recognition input device 100 may also include filtering functions to remove noise from the surrounding environment or optimize the speech signal.

[0023] Controller 200 can analyze speech data and determine the termination (i.e., end) of a speaker's utterance. Controller 200 can perform morphological analysis and part-of-speech tagging on the speech data. Controller 200 can dynamically adjust the length of a trailing silence interval based on the information obtained from the analysis. The term "length of a trailing silence" refers to the length of time the speech signal remains below a certain threshold after the speaker's utterance ends. The length of the trailing silence interval can be adjusted according to the semantic features of the utterance (e.g., the type of the last word unit, the likelihood of subsequent utterances, etc.). Controller 200 can also learn and apply the speaker's personalized characteristics (e.g., the speaker's utterance rate, pattern, etc.) as needed.

[0024] The voice recognition output device 300 may include a display, a speaker, an LED prompt device, etc. The voice recognition output device 300 can interact with the user by visually displaying the recognized speech text or outputting feedback speech. Furthermore, it can use LED prompt devices to intuitively convey the status of the voice recognition mode (e.g., active, standby, terminated). As needed, the voice recognition output device 300 can provide status prompts and additional information related to the voice recognition results to enhance the user experience; it can also collaborate with other devices to provide comprehensive feedback.

[0025] Figure 2 This is a block diagram schematically showing a controller 200 according to an embodiment of the present disclosure.

[0026] The controller 200 according to embodiments of this disclosure may include all or some of the components of a speech recognition engine 210, a natural language processing engine 220, a data management unit 230, and a tail silence adjustment unit 240. Those skilled in the art will understand that one or more engines or units, such as the controller described herein that includes a speech recognition engine 210, a natural language processing engine 220, a data management unit 230, and a tail silence adjustment unit 240, may be implemented via a tangible computer-readable medium or a non-temporary memory containing hardware or a processor of a particular configuration (e.g., hereinafter referred to as...). Figure 4 The computer-executable instructions (e.g., executable software code) executed by one or more processors 420 described in detail. It should be understood that the disclosed embodiments may be implemented as different or separate modules of the speech termination detection device 10, or as a separate computer system combined with the speech termination detection device 10.

[0027] The speech recognition engine 210 acquires the speaker's speech received by the microphone inside the vehicle and converts the speech into text using an STT (Speech to Text) engine. The STT engine can convert the speech signal into text by applying a speech recognition algorithm or a deep learning model to the speech signal. The speaker's speech is a speech signal, and the speech recognition engine 210 receives the speech signal corresponding to the speaker's speech.

[0028] The Natural Language Processing Engine 220 can understand and recognize speaker speech by classifying the speaker's intended meaning and slots. The speaker's intended meaning can be categorized into several types, such as making a phone call, searching for a destination, playing a radio broadcast, providing route navigation, or playing a song. The speaker's intended meaning can also be categorized into different domains, such as changing a destination, adding a waypoint, modifying a waypoint, making a phone call, and out-of-domain (OOD) commands.

[0029] A slot refers to the object required to provide information based on the speaker's intended meaning, and can be predefined for each type of intention. For example, the slot corresponding to the intention of setting a route could be "destination" or "waypoint," and the keywords corresponding to that slot could include "home" or "company."

[0030] The natural language processing engine 220, for example, can extract information such as domain, entity names, and speech acts from the input statement through a natural language understanding (NLU) engine, and extract intent and slots based on the extraction results.

[0031] A domain refers to information used to identify the subject matter of a speaker's speech. For example, a domain representing various topics such as vehicle control, information provision, text transmission, and navigation functions can be determined based on the input utterance.

[0032] Entity names refer to proper nouns such as people's names, place names, organization names, time, date, and currency. Named Entity Recognition (NER) is the task of identifying entity names in a statement and determining the type of the identified entity name. NER can be used to extract important keywords from a statement and understand the meaning of the statement.

[0033] Speech act analysis is the task of analyzing the intent of speech, used to determine the intent of a user's speech, such as whether the user is asking a question, making a request, providing a response, or expressing emotion.

[0034] Information such as domain, entity names, and speech acts can be used for at least one of the following operations: classifying the speaker's intention meaning, determining slots, or generating responses to the speaker's speech utterances. Specifically, the NLU engine can segment the input sentence into morphological units using a morphological analyzer, map these morphemes to a vector space, group the mapped vectors to classify the intent according to the input sentence, and extract other components in the input sentence corresponding to the intent slots as entities.

[0035] For example, if the input statement is "Call Kim Cheol-soo," the NLU engine will segment the statement into chunks such as "Kim Cheol-soo," "Give," "Phone," "Call," and "Please." Based on these chunks, the NLU engine determines that the speaker's intent for the input statement is "Make a phone call"; the slot corresponding to the speaker's intent is "Call recipient." In this case, the NLU engine can extract "Kim Cheol-soo" as a keyword.

[0036] For example, if the input statement is "turn on the air conditioner", the speaker's intention is "turn on the air conditioner", and the slots corresponding to the speaker's intention are "temperature" and "fan speed".

[0037] The data management unit 230 stores data such as speaker speech patterns, speech rate, pronunciation characteristics, and average tail silence length during speech. The data management unit 230 can optimize speech recognition performance based on each speaker's personalized data. The data management unit 230 also saves the length of the default tail silence interval learned for each speaker and uses it for speech termination determination. The data management unit 230 can provide learning data for updating the speech recognition model and natural language processing model based on data from new speakers or changes in the environment. The data management unit 230 sends the morphological analysis results to the tail silence adjustment unit 240.

[0038] The tail silence adjustment unit 240 determines the endpoint of the input speech and dynamically adjusts the length of the tail silence interval based on the type of the last word unit. The type of the last word unit can refer to the part of speech type, that is, the length of the user's tail silence interval is adjusted according to the analyzed part of speech type. The analyzed part of speech type can include at least one of the following: common noun, proper noun, dependent noun, pronoun, numeral, determiner, attributive adjective, adverb, particle, conjunctive suffix, terminator suffix, and suffix. The tail silence adjustment unit 240 determines whether the part of speech type of the last word unit in the user speech received from the data management unit 230 is a terminator suffix. If it is determined that the part of speech type of the last word unit in the speech received from the data management unit 230 is a terminator suffix, the tail silence adjustment unit 240 recognizes that the user intends to terminate the speech and reduces the user's waiting time by shortening the length of the user's tail silence interval by an offset. If the part of speech of the last word in the received user utterance is determined to be any one of the following: numeral, conjunction suffix, attributive adjective, adverb, particle, or suffix, then the tail silence adjustment unit 240 extends the length of the tail silence interval of the speaking user by an offset, thereby increasing the user's waiting time.

[0039] When the part of speech of the last word in a utterance is a common noun or a proper noun, the tail silence adjustment unit 240 can compare the word with a pre-set representative instruction list, and then determine whether the probability of another word following the last word is higher than the critical probability. The representative instruction list refers to a predefined set of instructions and vocabulary.

[0040] Specifically, the representative instruction list is a reference list for determining whether a specific utterance can be recognized as an instruction. If the probability that another word unit will follow after the last word unit is higher than the critical probability, the tail silence adjustment unit 240 can extend the length of the user's tail silence interval by an offset. Conversely, if the probability that another word unit will follow after the last word unit is lower than the critical probability, the length of the tail silence interval of the speaking user remains in its initial state. The tail silence adjustment unit 240 can adjust the length of the user's tail silence interval according to the actual situation. In other words, the tail silence adjustment unit 240 can shorten, extend, or maintain the length of the tail silence interval. Subsequently, the tail silence adjustment unit 240 can determine whether another utterance is received within the adjusted tail silence interval. When another utterance is received, the tail silence adjustment unit 240 determines that the user's utterance has not terminated. Conversely, when no other utterance is received, the tail silence adjustment unit 240 determines that the user's utterance has terminated. The tail silence adjustment unit 240 can apply a subdivided offset according to the词性 type of the last word unit in the utterance. For example, when the possessive particle "的" is recognized, the probability that a proper noun or a common noun will follow其后 is relatively high, so a longer offset can be applied compared to the case where the adverbial particle "到" is recognized. In addition, when the connecting suffix "和" is recognized, the length of the tail silence interval can be further extended to increase the possibility of receiving another utterance.

[0041] Table 1 below is a morpheme table when the user says "Set the temperature to 72 degrees".

[0042] 【Table 1】

[0043]

[0044]

[0045] Referring to Table 1, the first word unit " (temperature)" belongs to a common noun. Since the probability that another word unit will follow after the last word unit is higher than the critical probability, the tail silence adjustment unit 240 can extend the length of the user's tail silence interval by an offset. The second word unit "22 (℃) (to 72 degrees)" has the last word unit as " (to)", and " (to)" belongs to an adverbial particle. Since the last词性 type of the second word unit is an adverbial particle, the tail silence adjustment unit 240 can extend the length of the tail silence interval by an offset. The third word unit " (set)" has the last word unit as "~ (connecting suffix)", "~ It should be noted that there is an unclear part in the original text where "词性 type" is mentioned without specific clarification. This might need to be further refined in the original context for a more accurate translation. " Belongs to a conjunctive suffix. Since the last part of speech of the third word unit is a conjunctive suffix, the tail silence adjustment unit 240 can extend the length of the tail silence interval by an offset. The fourth word unit " The last morpheme of "(to)" is " (terminal suffix) It belongs to the terminal word suffix. Since the last part of speech of the fourth word unit is the terminal word suffix, the tail silence adjustment unit 240 can determine that the speech has ended and shorten the length of the tail silence interval by an offset.

[0046] Figure 3 This is a flowchart illustrating the operation of the speech termination detection device 10 according to an embodiment of the present disclosure.

[0047] In step S302, the speech recognition input device 100 receives the user's speech. The speech recognition engine 210 receives the speech signal corresponding to the speaker's speech. The natural language processing engine 220 can process the speech signal, for example, by extracting information such as domain, entity names, and speech acts from the input statement using an NLU engine, and extracting intent and slots based on the extraction results. Specifically, the NLU engine can segment the input statement into morphological units using a morphological analyzer, map these morphemes to a vector space, group the mapped vectors to classify intents according to the input statement, and extract other components in the input statement corresponding to the intent slots as entities. In other words, the natural language processing engine 220 can analyze the part-of-speech of the last word unit in the user's speech.

[0048] In step S304, the data management unit 230 sends the type of the last word unit in the user's speech to the end-silence adjustment unit 240. Here, the type of the last word unit may refer to the part-of-speech type. The analyzed part-of-speech type may include at least one of the following: common noun, proper noun, dependent noun, pronoun, numeral, determiner, attributive adjective, adverb, particle, conjunctive suffix, terminal suffix, and suffix.

[0049] In step S306, the tail silence adjustment unit 240 determines whether the part-of-speech type of the last word unit in the user speech received from the data management unit 230 is a terminal word suffix.

[0050] If the part of speech of the last word in the utterance is determined to be a terminator suffix ("yes" in S306), then in step S308, the tail silence adjustment unit 240 can shorten the length of the tail silence interval of the speaking user by an offset by recognizing the user's intention to terminate the statement, thereby reducing the user's waiting time. If the part of speech of the last word in the utterance is determined not to be a terminator suffix ("no" in S306), then in step S310, the tail silence adjustment unit 240 determines whether the part of speech of the last word in the received user utterance is any one of the following: numeral, conjunction suffix, attributive adjective, adverb, particle, or suffix. If the part of speech of the last word in the received user utterance is determined to be any one of the following ("yes" in S310): a numeral, a conjunction suffix, an adjective, an adverb, a particle, or a suffix, then in step S312, the tail silence adjustment unit 240 extends the length of the tail silence interval of the speaking user by an offset to increase the user's waiting time. If the part of speech of the last word in the utterance is a common noun or a proper noun ("no" in S310), then in step S314, the tail silence adjustment unit 240 compares the part of speech of the last word with a pre-set representative instruction list, and then determines whether the probability of another word following the last word is higher than a critical probability. The representative instruction list refers to a predefined set of instructions and vocabulary, which serves as a reference for determining whether a specific utterance can be identified as an instruction. If the probability of another word following the last word is higher than the critical probability ("Yes" in S314), then in step S312, the tail silence adjustment unit 240 can extend the length of the user's tail silence interval by an offset. Conversely, if the probability of another word following the last word is lower than the critical probability ("No" in S314), then in step S316, the length of the speaking user's tail silence interval is maintained at its original state. The tail silence adjustment unit 240 can adjust the length of the user's tail silence interval according to the actual situation; in other words, the tail silence adjustment unit 240 can shorten, lengthen, or maintain the length of the tail silence interval. Subsequently, in step S318, the tail silence adjustment unit 240 can determine whether another speech is received within the adjusted tail silence interval. When another speech is received ("Yes" in S318), the tail silence adjustment unit 240 determines that the user's speech has not terminated, and the speech termination detection device 10 continues to receive the other speech. On the other hand, when no other speech is received ("no" in S318), the tail silence adjustment unit 240 determines that the user speech has been terminated, and the voice speech termination detection device 10 prompts the processor 420 to recognize the termination of the speech.Once the termination of the statement is detected, the processor 420 generates and processes the statement, and sends the processed statement to, for example, the vehicle control system (i.e., for controlling the vehicle), or to other hardware components / devices that can be controlled by the user's voice (e.g., navigation, messaging, and audio hardware systems in the vehicle).

[0051] Figure 4 This is a block diagram schematically illustrating a computing device that can be used to implement the methods or apparatus of this disclosure according to embodiments thereof.

[0052] The computing device 40 may include some or all of the following components: non-temporary memory 400, processor 420, storage device 440, input / output interface 460, and communication interface 480. The computing device 40 may be a fixed computing device such as a desktop computer or server, or a mobile computing device such as a laptop or smartphone. The computing device 40 may include dedicated hardware accelerators capable of efficiently processing computations related to artificial intelligence models; for example, the computing device 40 may include a graphics processing unit (GPU), a tensor processor (TPU), or a neural processor (NPU).

[0053] Memory 400 may store a program that causes processor 420 to perform the methods or operations described in the embodiments of this disclosure. For example, the program may include a plurality of instructions / computer-executable instructions executable by processor 420. The methods or operations described above can be implemented by processor 420 executing these instructions / computer-executable instructions. Memory 400 may be a single memory or multiple memories. In this case, the information required for the methods or operations described in the embodiments of this disclosure may be stored in a single memory or split and stored in multiple memories. When memory 400 is composed of multiple memories, these memories may be physically separated from each other. Memory 400 may include at least one of volatile memory and non-volatile memory. Volatile memory includes static random access memory (SRAM) or dynamic random access memory (DRAM), and non-volatile memory includes flash memory.

[0054] Processor 420 may include at least one chip capable of executing at least one instruction. Processor 420 may execute instructions / computer-executable instructions stored in memory 400. Processor 420 may be a single processor or multiple processors.

[0055] Even when the power supply to the computing device 40 is interrupted, the storage device 440 can still retain the stored data. For example, the storage device 440 may include non-volatile memory, or storage media such as magnetic tape, optical disc, or magnetic disk. Programs stored in the storage device 440 can be loaded into the memory 400 before being executed by the processor 420. The storage device 440 may store files written in a programming language, and programs generated from these files by tools such as compilers can be loaded into the memory 400. Furthermore, the storage device 440 may also store data to be processed by the processor 420 and / or data that has already been processed by the processor 420.

[0056] Input / output interface 460 provides a connection interface to input devices such as a keyboard and mouse, and / or output devices such as a monitor and printer. Users can trigger processor 420 to execute programs via input devices and / or check the processing results of processor 420 via output devices.

[0057] The communication interface 480 provides access to external networks. The computing device 40 can communicate with other devices through the communication interface 480.

[0058] The embodiments disclosed herein improve the speech recognition rate by dynamically adjusting the tail silence value based on the last part of speech of the word unit, thereby preventing the speech recognition mode from terminating prematurely before the sentence is completed.

[0059] The technical effects of this disclosure are not limited to those described above. Those skilled in the art can clearly understand other effects not explicitly mentioned through the foregoing description.

[0060] This disclosure provides a technical solution for improving a speech recognition system. By dynamically adjusting the end-of-speech silence interval through real-time analysis of the part-of-speech information of the user's speech, the system prevents premature termination of speech recognition and improves recognition accuracy. End-of-speech silence adjustment is not simply an abstract concept or thought process, but is achieved through specific hardware and software interactions within the speech recognition system. The technical effects of this disclosure include improved termination detection accuracy, reduced unnecessary waiting time, and enhanced support for multi-intent utterances.

[0061] The various elements of the apparatus or method according to this disclosure can be implemented by hardware, software, or a combination of hardware and software. The function of each element can be implemented by software, and a microprocessor can be configured to execute the corresponding software function of each element.

[0062] Various embodiments of the systems and techniques described herein can be implemented using digital electronic circuits, integrated circuits, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), computer hardware, firmware, software, and / or combinations thereof. These embodiments may include implementations of one or more computer programs that execute on a programmable system comprising at least one programmable processor (which may be a dedicated processor or a specially configured processor) connected to a storage system, at least one input device, and at least one output device to receive and transmit data and instructions. A computer program (also referred to as a program, software, software application, or code) contains computer-executable instructions that are executed by the programmable processor and stored in a computer-readable recording medium.

[0063] Computer-readable recording media can include all types of storage devices capable of storing computer-readable data. These media can be non-volatile or non-temporary media such as read-only memory (ROM), random access memory (RAM), optical disc read-only memory (CD-ROM), magnetic tape, floppy disk, or optical data storage devices. Additionally, they can also include temporary media such as data transmission media. Furthermore, computer-readable recording media can be distributed across multiple computer systems connected via a network, allowing computer-readable program code to be stored and executed in a distributed manner.

[0064] Although the flowcharts / timing diagrams in this specification show the operations as being performed sequentially, this is merely a description of the technical concept of the embodiments of this disclosure. In other words, those skilled in the art should understand that various modifications and changes can be made without departing from the core features of the embodiments of this disclosure. In other words, the execution order shown in the flowcharts / timing diagrams can be adjusted, and some operations can be executed in parallel. Therefore, the flowcharts / timing diagrams are not limited by a time sequence.

[0065] Although various embodiments of this disclosure have been described for illustrative purposes, those skilled in the art will understand that various modifications, additions, and substitutions can be made to this disclosure without departing from the claimed concept and scope. Therefore, the various embodiments of this disclosure have been described for the sake of brevity and clarity only, and the scope of the technical concept of these embodiments is not limited to the illustrated content. Consequently, those skilled in the art should understand that the scope of the claimed disclosure should not be limited to the above embodiments, but rather to the claims and their equivalents.

Claims

1. A computer-implemented method for determining whether a spoken utterance has terminated, the method comprising the following steps: Receive user input; Based on the type of the last word in the user's utterance, dynamically adjust the length of the user's trailing silence interval; and The user's speech is determined to have terminated based on whether any additional speech is received within the adjusted tail silence interval.

2. The method according to claim 1, wherein, The steps for dynamically adjusting the length of the tail silence interval include: Based on the fact that the last word unit is a terminator suffix, the length of the tail silence interval is shortened.

3. The method according to claim 1, wherein, The steps for dynamically adjusting the length of the tail silence interval include: Based on the fact that the last word unit is any one of a numeral, a conjunctive suffix, an attributive adjective, an adverb, an auxiliary word, and a suffix, the length of the tail silence interval is extended.

4. The method according to claim 1, wherein, The steps for dynamically adjusting the length of the tail silence interval include: Based on the probability that the last word unit is followed by another word unit is greater than the critical probability, the length of the tail silence interval is extended.

5. The method according to claim 1, wherein, The steps for determining whether the user's speech has ended include: Based on the receipt of additional speech within the adjusted tail silence interval, it is determined that the user's speech has not terminated.

6. The method according to claim 1, wherein, The steps for determining whether the user's speech has ended include: Based on the fact that no further speech was received within the adjusted tail silence interval, it was determined that the user's speech had terminated.

7. An apparatus comprising: At least one memory configured to store computer-executable instructions; as well as At least one processor is configured to execute the computer-executable instructions to achieve: Receive user input; Based on the type of the last word in the user's utterance, the length of the user's trailing silence interval is dynamically adjusted; and The user's speech is determined to have terminated based on whether any other speech is received within the dynamically adjusted tail silence interval.

8. The apparatus according to claim 7, wherein, In order to dynamically adjust the length of the tail silence interval, the processor is further configured to: Based on the fact that the last word unit is a terminator suffix, the length of the tail silence interval is shortened.

9. The apparatus according to claim 7, wherein, In order to dynamically adjust the length of the tail silence interval, the processor is further configured to: Based on the fact that the last word unit is any one of a numeral, a conjunctive suffix, an attributive adjective, an adverb, an auxiliary word, and a suffix, the length of the tail silence interval is extended.

10. The apparatus according to claim 7, wherein, In order to dynamically adjust the length of the tail silence interval, the processor is further configured to: Based on the probability that the last word unit is followed by another word unit is greater than the critical probability, the length of the tail silence interval is extended.

11. The apparatus according to claim 7, wherein, To determine whether the user's speech has ended, the processor is further configured to: Based on the receipt of additional speech within the adjusted tail silence interval, it is determined that the user's speech has not terminated.

12. The apparatus according to claim 7, wherein, To determine whether the user's speech has ended, the processor is further configured to: Based on the fact that no further speech was received within the adjusted tail silence interval, it was determined that the user's speech had terminated.