Speech processing method, apparatus, device, medium, and product
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING BAIDU NETCOM SCI & TECH CO LTD
- Filing Date
- 2026-02-13
- Publication Date
- 2026-06-16
AI Technical Summary
During voice interaction, existing technologies are slow to respond quickly and accurately to users' incomplete sentences, especially when there are pauses in the speech, which affects the user experience.
After detecting a pause in speech, the silence duration of the intermediate recognized text is obtained. If it is greater than a preset threshold, semantic matching is performed. If a target control text that is not a complete sentence is matched, the fast exit duration is determined and the waiting time is set. If no new speech packet is obtained, the target semantics are determined based on the target control text and the operation is performed to improve response speed and accuracy.
It enables fast and accurate responses to incomplete sentences during speech pauses, improving the user experience.
Smart Images

Figure CN122224162A_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the fields of artificial intelligence and intelligent vehicle technology, and in particular to the fields of intelligent voice interaction and intelligent cockpit, specifically to a voice processing method, device, system, equipment, medium and product. Background Technology
[0002] During voice interaction, users may pause in their speech. In scenarios where pauses occur, how to quickly and accurately respond to the user's voice data is a problem that needs to be solved. Summary of the Invention
[0003] This disclosure provides a speech processing method, apparatus, system, device, medium, and product.
[0004] According to one aspect of this disclosure, a speech processing method is provided, comprising: in response to the speech recognition text being intermediate recognition text, obtaining a silence duration corresponding to the intermediate recognition text; in response to the silence duration being greater than or equal to a preset threshold and a semantic matching result matching the intermediate recognition text existing thereto, the semantic matching result including: target control text that is not a complete sentence, and determining that the waiting time is a fast-out duration; in response to reaching the fast-out duration and no new speech packet being obtained within the fast-out duration, determining the target semantics of the intermediate recognition text based on the semantic matching result; and performing a corresponding operation based on the target semantics of the intermediate recognition text; wherein the fast-out duration is less than the slow-out duration, the slow-out duration is the waiting time used when the silence duration is greater than or equal to the preset threshold, the intermediate recognition text is not a complete sentence, and the target control text does not exist.
[0005] According to another aspect of this disclosure, a speech processing apparatus is provided, comprising: an acquisition module, configured to acquire a silence duration corresponding to the intermediate recognition text in response to the speech recognition text being intermediate recognition text; a matching module, configured to, in response to the silence duration being greater than or equal to a preset threshold and a semantic matching result matching the intermediate recognition text existing thereto, the semantic matching result including: target control text of an incomplete sentence, and a waiting duration determined to be a fast-out duration; a determination module, configured to, in response to reaching the fast-out duration and no new speech packet being obtained within the fast-out duration, determine the target semantics of the intermediate recognition text based on the semantic matching result; and an execution module, configured to perform a corresponding operation based on the target semantics of the intermediate recognition text; wherein the fast-out duration is less than the slow-out duration, the slow-out duration is the waiting duration used when the silence duration is greater than or equal to the preset threshold, the intermediate recognition text is an incomplete sentence, and the target control text does not exist.
[0006] According to another aspect of this disclosure, an electronic device is provided, comprising: at least one processor; and a memory communicatively connected to said at least one processor; wherein the memory stores instructions executable by said at least one processor, said instructions being executed by said at least one processor to enable said at least one processor to perform the method as described in any of the foregoing aspects.
[0007] According to another aspect of this disclosure, a non-transitory computer-readable storage medium is provided storing computer instructions, wherein the computer instructions are configured to cause the computer to perform the method according to any of the preceding aspects.
[0008] According to another aspect of this disclosure, a computer program product is provided, comprising a computer program that, when executed by a processor, implements the method according to any of the preceding aspects.
[0009] According to embodiments of this disclosure, a fast and accurate response can be provided for incomplete sentences.
[0010] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description
[0011] The accompanying drawings are provided to better understand this solution and do not constitute a limitation of this disclosure. Wherein:
[0012] Figure 1 This is a schematic diagram based on the first embodiment of the present disclosure;
[0013] Figure 2 This is a schematic diagram according to the second embodiment of the present disclosure;
[0014] Figure 3 This is a schematic diagram illustrating the process of obtaining the target semantics of intermediate identified text according to embodiments of this disclosure;
[0015] Figure 4 This is a schematic diagram illustrating the processing procedure for the final identified text according to embodiments of this disclosure;
[0016] Figure 5 This is a schematic diagram according to the third embodiment of the present disclosure;
[0017] Figure 6 This is a schematic diagram of an electronic device used to implement the speech processing method of the embodiments of this disclosure. Detailed Implementation
[0018] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding, and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.
[0019] Figure 1 This is a schematic diagram based on the first embodiment of the present disclosure, which provides a speech processing method. For example... Figure 1 As shown, the method includes:
[0020] 101. In response to the fact that the speech recognition text is intermediate recognition text, obtain the silence duration corresponding to the intermediate recognition text.
[0021] 102. In response to the silence duration being greater than or equal to a preset threshold, and the existence of a semantic matching result that matches the intermediate identified text, the semantic matching result includes: target control text that is not a complete sentence, and determining that the waiting duration is the fast-out duration.
[0022] 103. In response to reaching the fast-out duration and not obtaining a new speech packet within the fast-out duration, determine the target semantics of the intermediate recognized text based on the semantic matching result.
[0023] 104. Based on the target semantics of the intermediate identified text, perform the corresponding operation.
[0024] Wherein, the fast output duration is less than the slow output duration, the slow output duration is the duration of silence greater than or equal to a preset threshold, and the waiting duration is used when the intermediate identified text is an incomplete sentence and the target control text does not exist.
[0025] After obtaining the speech data, speech recognition can be performed to obtain the recognized text. For example, Automatic Speech Recognition (ASR) can be performed using speech packets as the basic unit to obtain the recognized text for each speech packet. Each speech packet can contain one or more characters.
[0026] During voice interaction, the Voice Activity Detection (VAD) module can be used to identify whether the voice-recognized text is partial or final.
[0027] Specifically, during the voice interaction process, the VAD module can detect the voice end point (the start point of silence). If no new voice packets are received within the set truncation duration after the voice end point, all the consecutive recognized texts before the voice end point are taken as the final recognized text. On the contrary, if new voice packets are received within the above truncation duration, all the consecutive recognized texts before the voice end point are taken as the intermediate recognized text.
[0028] The silence duration corresponding to the intermediate recognized text refers to the silence duration after the intermediate recognized text, that is, the silence duration between the last character of the intermediate recognized text and the first character of the next voice packet.
[0029] After the above silence duration is greater than or equal to a preset threshold (such as 160 milliseconds), it can be considered that a voice pause occurs; otherwise, it is continuous speech.
[0030] For example, if the voice data spoken by the user is "How to turn on the air conditioner", if the silence duration between "me" and "da" is greater than or equal to the preset threshold, it is considered that a voice pause occurs between "How" and "turn on the air conditioner".
[0031] When the above silence duration is greater than or equal to the preset threshold (voice pause occurs), semantic matching is performed on the intermediate recognized text to determine whether there is a target semantics that matches the intermediate recognized text.
[0032] In the related art, when performing semantic matching on the intermediate recognized text, it can first be detected whether the intermediate recognized text is a complete sentence. For example, a preset integrity detection model is used for detection. When the intermediate recognized text is a complete sentence, natural language understanding (NLU) matching is then performed on this complete sentence. For example, during NLU matching, it is determined whether there is a preset instruction text that is semantically consistent with the intermediate recognized text. If so, this semantically consistent preset instruction text is taken as the target instruction text, and this target instruction text is also the target semantics. Then, when the preset fast output duration (such as 300 milliseconds) is reached and no new voice packets are obtained, the corresponding operation is performed according to this target instruction text.
[0033] When the intermediate recognized text is not a complete sentence, it is necessary to wait for the preset slow output duration (such as 1500 milliseconds). If no new voice packets are received within this slow output duration, after the slow output duration is reached, NLU and other processing are performed on this intermediate recognized text. If new voice packets are received within this slow output duration, the intermediate recognized text is spliced with the new recognized text corresponding to this new voice packet. After the slow output duration is reached, NLU and other processing are performed on the spliced text.
[0034] As can be seen from the above, if the intermediate identified text is not a complete sentence, semantic matching and subsequent processing will only be performed after the slow-out time has elapsed.
[0035] In Visible To Speech (VTS) scenarios, VTS controls can be displayed to the user, who can then control these controls based on their voice. For example, in an in-vehicle scenario, controllable VTS controls are displayed on the vehicle's infotainment screen. The user can speak the information they want to control regarding the VTS. If a VTS control on the screen is "radio," the system can turn on the radio after the user says "radio." If multiple songs are displayed on the screen, and the user speaks the name of a specific song, the corresponding song will be played.
[0036] In this scenario, if the intermediate identified text is the VTS control text mentioned above, such as "radio station", after the sentence completeness detection, the intermediate identified text is not a complete sentence. Therefore, it is necessary to wait for a slow outage duration (such as 1500 milliseconds) before subsequent processing can be carried out, which results in a slow response speed.
[0037] In this scenario, although "radio" is not a complete sentence, it clearly expresses the user's intention, such as needing to turn on the "radio." If the user has to wait for a certain amount of time before responding, the response speed will be slow, affecting the user experience.
[0038] Therefore, when a speech pause occurs, the intermediate recognized text can be matched with preset control text. Specifically, the preset control text can be VTS control text. If a preset control text with semantically consistent with the intermediate recognized text exists, it is used as the target control text. The waiting time at this point is determined to be the fast-out duration, not the slow-out duration. After the fast-out duration is reached, if no new speech packet is obtained, the target semantics can be obtained from the target control text, and the corresponding operation can be performed based on the target semantics. For example, if the target control text is "radio," then after the fast-out duration is reached, the operation to turn on the radio is executed.
[0039] In this embodiment, after a pause in speech, if there is a target control text that is an incomplete sentence that matches the intermediate recognized text, the waiting time is determined to be the near-out time. This allows for a response that can be initiated either when the intermediate recognized text is not a complete sentence or when the near-out time is reached, thus improving response speed. Furthermore, by determining the target semantics based on the matched target control text, accurate target semantics can be obtained. Subsequently, by executing corresponding operations based on the target semantics, accuracy can be improved. Therefore, even for incomplete sentences, a response can still be initiated accurately and quickly.
[0040] In some embodiments, the semantic matching result further includes: target instruction text that matches the intermediate identification text; determining the target semantics based on the semantic matching result includes: selecting one of the target instruction text and the target control text as the target semantics.
[0041] Among them, when the silence duration is greater than or equal to a preset threshold, it can detect whether the intermediate recognized text is a complete sentence. If it is a complete sentence, it can obtain the target instruction text that matches the intermediate recognized text.
[0042] For example, when a pause occurs in speech, a sentence completeness detection model is used to check whether the intermediate recognized text is a complete sentence. If the detection result is a complete sentence, the intermediate recognized text of that complete sentence is matched with a preset instruction text, and the semantically consistent preset instruction text is taken as the target instruction text. For example, if the intermediate recognized text is "turn on the air conditioner", and the preset instruction text contains "turn on the air conditioner", then the target instruction text is "turn on the air conditioner".
[0043] After obtaining the target instruction text, the waiting time is also the fast-out time. At this time, when the fast-out time is reached, the target semantics can be obtained based on the target control text and the target instruction text.
[0044] Then, one of the target instruction text and the target control text can be selected as the target semantic text.
[0045] Specifically, the target semantics can be obtained according to the preset priority. For example, if the priority of the control text is set to be higher, the target control text will be used as the target semantics, and then the operation corresponding to the target control text will be executed.
[0046] In this embodiment, by selecting one of the target instruction text and the target control text as the target semantics, the uniqueness of the semantic matching result can be guaranteed, and the accuracy of the response can be improved.
[0047] Figure 2 This is a schematic diagram based on a second embodiment of the present disclosure, which provides a speech processing method. For example... Figure 2 As shown, the method includes:
[0048] 201. Multiple recognition engines are used to perform speech recognition on speech data from multiple voice regions to obtain multiple sets of speech recognition text.
[0049] Taking the in-vehicle scenario as an example, it can be divided into four sound zones, corresponding to the driver's seat, passenger seat, left rear, and right rear areas respectively.
[0050] The recognition engine corresponds one-to-one with the audio region, such as Figure 2As shown, the first to fourth recognition engines are used to represent the speech data in the first vocal range. The first recognition engine performs speech recognition on the speech data in the first vocal range to obtain the first set of speech recognition text. The second recognition engine performs speech recognition on the speech data in the second vocal range to obtain the second set of speech recognition text. The remaining vocal ranges are treated similarly. The different recognition engines are independent of each other, therefore, speech recognition can be performed in parallel on the speech data in each vocal range.
[0051] In this embodiment, multiple recognition engines are used to perform speech recognition on speech data from multiple audio regions. Speech recognition can be performed independently of each other, improving parallelism and thus improving processing efficiency.
[0052] 202. For each group of speech recognition text, determine whether the group of speech recognition text is intermediate recognition text or final recognition text. If it is intermediate recognition text, execute 203 and subsequent steps; otherwise, execute the processing procedure for final recognition text.
[0053] Among them, such as Figure 2 As shown, a Voice Activity Detection (VAD) module can be used to identify whether each group of speech recognition texts is partial or final.
[0054] Specifically, during voice interaction, the VAD module can detect the end point of the voice (the starting point of silence). If no new voice packet is received within a set truncation time after the end point of the voice, all continuous recognized text before the end point of the voice is taken as the final recognized text. Conversely, if a new voice packet is received within the aforementioned truncation time, all continuous recognized text before the end point of the voice is taken as the intermediate recognized text.
[0055] 203. Obtain the silence duration corresponding to the intermediate recognized text, and determine whether the silence duration is greater than or equal to a preset threshold. If so, execute steps 204 and 206; otherwise, do not process. "Do not process" means not executing subsequent operations under the corresponding conditions. If a new voice packet is obtained, reprocessing begins from the speech recognition stage for the new voice packet.
[0056] 204. Determine whether the intermediate identified text is a complete sentence. If yes, proceed to step 205; otherwise, proceed to step 208.
[0057] 205. Determine if there is a target instruction text (represented as NLU match) that matches the intermediate recognition text. If yes, execute 207; otherwise, do nothing.
[0058] 206. Determine if there is a target control text that matches the intermediate recognition text (represented as VTS matching). If yes, proceed to step 207; otherwise, do nothing.
[0059] 207. Determine if the waiting time is the approximate output time.
[0060] 208. Determine whether the waiting time is the slow output time.
[0061] Both the fast output duration and the slow output duration are set values, and the fast output duration is less than the slow output duration. For example, the fast output duration is 300 milliseconds and the slow output duration is 1500 milliseconds.
[0062] 209. In response to the arrival of the waiting time and the absence of a new speech packet within the waiting time, the target semantics of the intermediate recognized text is obtained.
[0063] For example, if the waiting time is the fast-out duration, then when the semantic matching result (target instruction text and / or target control text) is obtained, a timer can be started. When the timer's duration reaches the fast-out duration (e.g., 300 milliseconds) and no new voice packet is obtained, the target semantics can be obtained based on the target instruction text and / or target control text.
[0064] If the semantic matching result is only one type, such as obtaining the target instruction text or the target control text, then the semantic matching result is used as the target semantic. If the semantic matching result includes multiple types, then one type needs to be selected as the target semantic. Specifically, it can be based on priority. For example, the priority of the control text can be set to be higher than that of the instruction text. If both the target instruction text and the target control text are obtained, then the target control text is used as the target semantic. After that, the operation corresponding to the target control text is executed.
[0065] That is, if the target control text is VTS control text, the VTS control text is used as the target semantics.
[0066] In this embodiment, prioritizing the execution of operations corresponding to VTS controls can better meet the actual needs of users and improve the user experience.
[0067] If the waiting time is a slow-out duration, NLU matching can be performed on the intermediate recognized text or the concatenated text when the slow-out duration is reached. If a matching target instruction text exists, the corresponding operation is executed based on the target instruction text. For example, if the intermediate recognized text is "how", which is an incomplete sentence and does not have a matching VTS control, then wait for the slow-out duration. If new recognized text, such as "turn on the air conditioner", is obtained within the slow-out duration, it is concatenated to obtain the concatenated text "how to turn on the air conditioner". After the slow-out duration (e.g., 1500 milliseconds) is reached, semantic understanding and processing are performed on "how to turn on the air conditioner".
[0068] 210. Based on the target semantics of the intermediate identified text, perform the corresponding operation.
[0069] For example, if the target semantic is "radio station", then when the timeout period is almost reached, the operation of turning on the radio station will be executed.
[0070] Speech recognition can be divided into online speech recognition and offline speech recognition. The semantic matching process can also be divided into online matching and offline matching. The fast-out durations for online matching and offline matching can be different. For example, the fast-out duration for offline matching is longer than that for online matching.
[0071] Based on this, the semantic matching results include: online matching results and offline matching results;
[0072] The fast-out duration includes: a first duration and a second duration, wherein the first duration is less than the second duration;
[0073] The response to reaching the waiting time and not receiving a new voice packet within the waiting time, determining the target semantics based on the semantic matching result, includes:
[0074] In response to the first time period being reached and no new voice packet being obtained within the first time period, the online matching result is output to the distributor;
[0075] In response to the second duration being reached and no new voice packet being obtained within the second duration, the offline matching result is output to the distributor;
[0076] The dispatcher is used to select one of the online matching results and the offline matching results as the target semantics.
[0077] For example, refer to Figure 3 , Figure 3 This is a schematic diagram illustrating the process of obtaining the target semantics of intermediate identified text according to embodiments of this disclosure.
[0078] like Figure 3 As shown, when the silence duration of the intermediate recognized text is greater than or equal to a preset threshold, NLU matching and VTS matching can be performed. Before performing NLU matching, the sentence completeness of the intermediate recognized text can be detected. If the intermediate recognized text is a complete sentence, NLU matching is performed. NLU matching includes online NLU matching and offline NLU matching. The fast out time corresponding to online NLU matching is the first duration, such as 300 milliseconds (300ms), and the fast out time corresponding to offline NLU matching is the second duration, such as 500 milliseconds (500ms).
[0079] VTS matching includes online VTS matching and offline VTS matching. The fast outgoing time corresponding to online VTS matching is the first time, such as 300 milliseconds (300ms), and the fast outgoing time corresponding to offline VTS matching is the second time, such as 500 milliseconds (500ms).
[0080] Based on this, if an online NLU matching result exists that matches the intermediate recognition text, i.e., a preset online instruction text with consistent semantics exists, then when the online NLU matching result is obtained, a timer is started. When the timer duration reaches 300ms, the online NLU matching result is output to the distributor. When the timer duration reaches 500ms, the offline NLU matching result is output to the distributor. When the timer duration reaches 300ms, the online VTS matching result is output to the distributor. When the timer duration reaches 500ms, the offline VTS matching result is output to the distributor. The distributor selects one of these as the target semantics of the intermediate recognition text.
[0081] Specifically, online matching results and VTS matching results are given priority. For example, if all four types of matching results exist, the online VTS matching result is used as the target semantics.
[0082] In this embodiment, the target semantics are obtained based on online matching results and offline matching results, and the corresponding fast retrieval times are different, so the target semantics can be obtained accurately and in a timely manner.
[0083] For the final identified text, a processing procedure can be performed.
[0084] That is, it may also include:
[0085] In response to the fact that the speech recognition text is the final recognized text, NLU matching and VTS matching are performed on the final recognized text to obtain the matching result of the final recognized text;
[0086] Based on the matching results of the final identified text, the target semantics of the final identified text are determined;
[0087] Based on the target semantics of the finally identified text, the corresponding operation is performed.
[0088] For example, refer to Figure 4 , Figure 4 This is a schematic diagram illustrating the processing procedure for the final identified text according to embodiments of this disclosure. Figure 4 As shown, for the final recognized text, NLU matching and VTS matching are performed. After the corresponding matching results exist, the dispatcher selects one of them as the target semantics of the final recognized text and performs the corresponding operation.
[0089] NLU matching refers to matching the final recognized text with preset instruction text, while VTS matching refers to matching the final recognized text with VTS control text. Additionally, similar intermediate text recognition can be categorized into online matching and offline matching.
[0090] When choosing one, you can select according to priority, such as prioritizing the VTS matching result as the target semantics.
[0091] In this embodiment, by performing NLU matching and VTS matching on the final identified text, a timely response can be made when the final identified text is VTS control text, or a valid instruction can be identified and executed when it matches a preset instruction text, thereby improving response speed and accuracy.
[0092] Figure 5 This is a schematic diagram based on a third embodiment of the present disclosure, which provides a voice processing device. For example... Figure 5 As shown, the device 500 includes: an acquisition module 501, a matching module 502, a determination module 503, and an execution module 504.
[0093] The acquisition module 501 is used to acquire the silence duration corresponding to the intermediate recognition text in response to the speech recognition text being intermediate recognition text; the matching module 502 is used to acquire the silence duration corresponding to the intermediate recognition text in response to the silence duration being greater than or equal to a preset threshold and the existence of a semantic matching result matching the intermediate recognition text, the semantic matching result including: target control text of an incomplete sentence, and determining that the waiting time is a fast-out duration; the determination module 503 is used to determine the target semantics of the intermediate recognition text based on the semantic matching result in response to the fast-out duration being reached and no new speech packet being obtained within the fast-out duration; the execution module 504 is used to execute the corresponding operation based on the target semantics of the intermediate recognition text.
[0094] Wherein, the fast output duration is less than the slow output duration, the slow output duration is the duration of silence greater than or equal to a preset threshold, and the waiting duration is used when the intermediate identified text is an incomplete sentence and the target control text does not exist.
[0095] In this embodiment, after a pause in speech, if there is a target control text that is an incomplete sentence that matches the intermediate recognized text, the waiting time is determined to be the near-out time. This allows for a response that can be initiated either when the intermediate recognized text is not a complete sentence or when the near-out time is reached, thus improving response speed. Furthermore, by determining the target semantics based on the matched target control text, accurate target semantics can be obtained. Subsequently, by executing corresponding operations based on the target semantics, accuracy can be improved. Therefore, even for incomplete sentences, a response can still be initiated accurately and quickly.
[0096] In some embodiments, the semantic matching result further includes: target instruction text matching the intermediate identified text; the determining module 503 is further configured to:
[0097] Choose one of the target instruction text and the target control text as the target semantics.
[0098] In this embodiment, by obtaining the target instruction text when the intermediate identified text is a complete sentence, instruction matching can also be performed to improve the comprehensiveness of the response.
[0099] In some embodiments, the determining module 503 is further configured to:
[0100] If the target control text is VTS control text, then the VTS control text is used as the target semantics.
[0101] In this embodiment, prioritizing the execution of operations corresponding to VTS controls can better meet the actual needs of users and improve the user experience.
[0102] In some embodiments, the semantic matching result includes: online matching result and offline matching result; the fast output duration includes: a first duration and a second duration, wherein the first duration is less than the second duration; the determining module 503 is further configured to:
[0103] In response to the first time period being reached and no new voice packet being obtained within the first time period, the online matching result is output to the distributor;
[0104] In response to the second duration being reached and no new voice packet being obtained within the second duration, the offline matching result is output to the distributor;
[0105] The dispatcher is used to select one of the online matching results and the offline matching results as the target semantics.
[0106] In this embodiment, the target semantics are obtained based on online matching results and offline matching results, and the corresponding fast retrieval times are different, so the target semantics can be obtained accurately and in a timely manner.
[0107] In some embodiments, the device 500 further includes:
[0108] The response module is configured to, in response to the fact that the speech recognition text is the final recognition text, perform NLU matching and VTS matching on the final recognition text to obtain the matching result of the final recognition text; determine the target semantics of the final recognition text based on the matching result of the final recognition text; and perform corresponding operations based on the target semantics of the final recognition text.
[0109] In this embodiment, by performing VTS matching on the final identified text, a timely response can be made when the final identified text is VTS control text, thereby improving response speed and accuracy.
[0110] In some embodiments, the device 500 further includes:
[0111] The recognition module is used to employ multiple recognition engines to perform speech recognition on speech data from multiple voice regions to obtain multiple sets of the speech recognition text.
[0112] In this embodiment, multiple recognition engines are used to perform speech recognition on speech data from multiple audio regions. Speech recognition can be performed independently of each other, improving parallelism and thus improving processing efficiency.
[0113] It is understood that the same or similar content in different embodiments of this disclosure can be referred to each other.
[0114] It is understood that the terms "first" and "second" in the embodiments of this disclosure are only used for distinction and do not indicate the degree of importance or the order of events.
[0115] It is understandable that, unless otherwise specified, the order of steps in the process indicates that the temporal relationship between these steps is not limited.
[0116] The technical solutions disclosed herein involve the collection, storage, use, processing, transmission, provision, and disclosure of various types of information, such as user personal information, in accordance with relevant laws and regulations and without violating public order and good morals.
[0117] According to embodiments of this disclosure, this disclosure also provides an electronic device, a readable storage medium, and a computer program product.
[0118] Figure 6 A schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure is shown. The electronic device 600 is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.
[0119] like Figure 6 As shown, the electronic device 600 includes a computing unit 601, which can perform various appropriate actions and processes based on a computer program stored in a read-only memory (ROM) 602 or a computer program loaded from a storage unit 606 into a random access memory (RAM) 603. The RAM 603 may also store various programs and data required for the operation of the electronic device 600. The computing unit 601, ROM 602, and RAM 603 are interconnected via a bus 604. An input / output (I / O) interface 605 is also connected to the bus 604.
[0120] Multiple components in electronic device 600 are connected to I / O interface 605, including: input unit 606, such as keyboard, mouse, etc.; output unit 607, such as various types of displays, speakers, etc.; storage unit 608, such as disk, optical disk, etc.; and communication unit 609, such as network card, modem, wireless transceiver, etc. Communication unit 609 allows electronic device 600 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0121] The computing unit 601 can be a variety of general-purpose and / or proprietary processing components with processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various proprietary artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the various methods and processes described above, such as speech processing methods. For example, in some embodiments, the speech processing method may be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and / or installed on the electronic device 600 via ROM 602 and / or communication unit 609. When the computer program is loaded into RAM 603 and executed by the computing unit 601, one or more steps of the speech processing method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform speech processing methods by any other suitable means (e.g., by means of firmware).
[0122] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), proprietary integrated circuits (ASICs), proprietary standard products (ASSPs), systems-on-a-chip (SoCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a proprietary or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.
[0123] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, a dedicated computer, or other programmable task processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.
[0124] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
[0125] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).
[0126] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.
[0127] Computer systems can include clients and servers. Clients and servers are generally geographically separated and typically interact via communication networks. The client-server relationship is created by computer programs running on the respective computers and having a client-server relationship with each other. A server can be a cloud server, also known as a cloud computing server or cloud host, a hosting product within the cloud computing service system that addresses the shortcomings of traditional physical hosts and VPS (Virtual Private Server) services, such as high management difficulty and weak business scalability. Servers can also be servers for distributed systems or servers incorporating blockchain technology.
[0128] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.
[0129] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.
Claims
1. A speech processing method, comprising: In response to the fact that the speech-recognized text is intermediate-recognition text, the silence duration corresponding to the intermediate-recognition text is obtained; In response to the silence duration being greater than or equal to a preset threshold, and the existence of a semantic matching result that matches the intermediate identified text, the semantic matching result includes: target control text that is not a complete sentence, and the waiting duration being determined to be the fast-out duration; In response to reaching the fast-out duration and not obtaining a new speech packet within the fast-out duration, the target semantics of the intermediate recognized text is determined based on the semantic matching result; Based on the target semantics of the intermediate recognized text, perform the corresponding operation; Wherein, the fast output duration is less than the slow output duration, the slow output duration is the duration of silence greater than or equal to a preset threshold, and the waiting duration is used when the intermediate identified text is an incomplete sentence and the target control text does not exist.
2. The method according to claim 1, wherein, The semantic matching result also includes: the target instruction text that matches the intermediate identified text; Determining the target semantics of the intermediate recognized text based on the semantic matching result includes: Choose one of the target instruction text and the target control text as the target semantics.
3. The method according to claim 2, wherein, Selecting one of the target instruction text and the target control text as the target semantics includes: If the target control text is VTS control text, then the VTS control text is used as the target semantics.
4. The method according to claim 1, wherein, The semantic matching results include: online matching results and offline matching results; The fast-out duration includes: a first duration and a second duration, wherein the first duration is less than the second duration; In response to reaching the waiting time and not obtaining a new speech packet within the waiting time, determining the target semantics of the intermediate recognized text based on the semantic matching result includes: In response to the first time period being reached and no new voice packet being obtained within the first time period, the online matching result is output to the distributor; In response to the second duration being reached and no new voice packet being obtained within the second duration, the offline matching result is output to the distributor; The dispatcher is used to select one of the online matching results and the offline matching results as the target semantics.
5. The method according to claim 1, further comprising: In response to the fact that the speech recognition text is the final recognized text, NLU matching and VTS matching are performed on the final recognized text to obtain the matching result of the final recognized text; Based on the matching results of the final identified text, the target semantics of the final identified text are determined; Based on the target semantics of the finally identified text, the corresponding operation is performed.
6. The method according to any one of claims 1-5, further comprising: Multiple recognition engines are used to perform speech recognition on speech data from multiple voice regions to obtain multiple sets of the speech recognition text.
7. A voice processing device, comprising: The acquisition module is used to acquire the silence duration corresponding to the intermediate recognition text in response to the speech recognition text being intermediate recognition text; The matching module is used to respond to the silence duration being greater than or equal to a preset threshold and the existence of a semantic matching result that matches the intermediate identified text. The semantic matching result includes: the target control text that is not a complete sentence, and the determination that the waiting duration is the fast-out duration. A determination module is configured to determine the target semantics of the intermediate recognized text based on the semantic matching result in response to the arrival of the fast-out duration and the absence of a new speech packet within the fast-out duration. The execution module is used to perform corresponding operations based on the target semantics of the intermediate recognized text; Wherein, the fast output duration is less than the slow output duration, the slow output duration is the duration of silence greater than or equal to a preset threshold, and the waiting duration is used when the intermediate identified text is an incomplete sentence and the target control text does not exist.
8. An electronic device, comprising: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
9. A non-transitory computer-readable storage medium storing computer instructions, wherein, The computer instructions are used to cause the computer to perform the method according to any one of claims 1-6.
10. A computer program product comprising a computer program that, when executed by a processor, implements the method according to any one of claims 1-6.