Method and apparatus for combining transcription and punctuation

By using a pre-trained sentence type classification model for parallel processing, the shortcomings of traditional speech recognition devices in punctuation recognition and sentence type processing are addressed, achieving more efficient and accurate speech recognition and response generation, thus improving user experience and system performance.

CN122224176APending Publication Date: 2026-06-16HYUNDAI MOTOR CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HYUNDAI MOTOR CO LTD
Filing Date
2025-06-05
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Traditional speech recognition devices cannot accurately recognize punctuation marks, leading to misinterpretation of speech intent, which affects the response quality of generative AI and user security. Furthermore, they cannot accurately insert punctuation marks when processing interrogative or exclamatory sentences, resulting in computational latency and low processor resource utilization efficiency.

Method used

By employing a pre-trained sentence type classification model and processing the speech recognizer and the sentence type classification model in parallel, the system accurately identifies sentence types and automatically inserts appropriate punctuation marks, thereby expanding the recognition capability of special symbols and improving processing speed and resource utilization efficiency.

🎯Benefits of technology

It improves the accuracy and processing speed of speech recognition devices, ensures the accurate transmission of user intent, reduces system development costs, and enhances the response quality and user convenience of generative AI.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122224176A_ABST
    Figure CN122224176A_ABST
Patent Text Reader

Abstract

A method and apparatus for combining transcribed utterances and punctuation is provided. Such a method of combining transcribed utterances and punctuation using a pre-trained sentence type classification model includes receiving, from a microphone in a vehicle, a speech signal representing an input utterance captured by the microphone. The method also includes converting, using at least one speech recognizer, the speech signal or a spectrogram generated based on the speech signal to a sentence. The method further includes classifying, using a pre-trained sentence type classification model, the sentence to a sentence type corresponding to the speech signal or the spectrogram. The method also includes inserting punctuation in the sentence based on the sentence type. The method further includes generating a text-based combination result based on a combination of the sentence and the inserted punctuation.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] Cross-references to related applications

[0002] This application claims the benefit and priority of Korean Patent Application No. 10-2024-0186775, filed on December 16, 2024, with the Korean Intellectual Property Office, the entire contents of which are incorporated herein by reference. Technical Field

[0003] This invention relates to methods and apparatus for combining transcribed utterances and punctuation. More specifically, this invention relates to methods and apparatus for combining transcribed utterances and punctuation using a pre-trained sentence type classification model. Background Technology

[0004] The content described in this section is only to provide background information related to the embodiments of the present invention and does not constitute prior art.

[0005] With the development of speech recognition programs and devices, conversational applications and devices based on generative artificial intelligence (AI) can interact with users across various services. These applications and devices convert user utterances into text-based sentences and analyze them to generate appropriate responses. Speech recognition devices are primarily used to process command statements. However, with the further development of generative AI, the number of cases involving interrogative sentences is increasing. As the number of interrogative sentences increases, a technical problem arises where speech recognition devices cannot distinguish whether generative AI should respond to a command or an interrogative sentence based on the user's utterance. This technical problem arises because traditional speech recognition devices cannot insert appropriate punctuation suitable for the type of sentence converted from the utterance. In other words, there is a technical problem where traditional speech recognition devices simply classify user utterances as declarative sentences.

[0006] To address the challenge of recognizing punctuation, traditional speech recognition devices offer two methods. One method involves the user directly speaking the punctuation mark, such as a question mark or exclamation mark. However, in the case of declarative sentences, the automatic insertion of a period at the end of the sentence is inconvenient for the user. The other method utilizes post-processing to insert punctuation, that is, using natural language processing (NLP) to identify the context of the sentence being recognized as a declarative sentence.

[0007] In the first method, the natural flow of the conversation may be disrupted because the user needs to directly pronounce punctuation marks. Furthermore, there is a technical problem that traditional speech recognition devices cannot perform accurate sentence processing when punctuation marks are omitted. In the second method, the accuracy of context recognition decreases in complex sentence structures or irregular utterances. Inaccurate context recognition by traditional speech recognition devices can lead to the insertion of incorrect punctuation marks or the omission of punctuation marks.

[0008] In particular, traditional speech recognition devices have several technical problems. First, traditional speech recognition devices often fail to accurately identify user intent. In cases of questions or exclamations, the intent of the utterance is not accurately identified, leading to misidentification. There are instances where the utterance is simply recognized as a statement and a response result different from the user's intent is generated. This leads to technical problems because the user's request is not accurately conveyed or properly processed. In particular, inaccurate responses to utterances related to vehicle status reduce user safety and satisfaction. For example, traditional speech recognition devices cannot provide accurate responses to utterances related to vehicle status (such as refueling, tire pressure, washer fluid, urea solution, and car wash).

[0009] Second, there is a technical problem with traditional speech recognition devices failing to accurately recognize punctuation. Because traditional speech recognition devices treat sentences as declarative statements when converting utterances into text-based sentences, they often fail to insert appropriate punctuation. For example, punctuation marks such as question marks or exclamation marks are omitted. This omission reduces the quality and reliability of responses using generative AI.

[0010] Third, there is a technical problem that traditional speech recognition devices cannot provide accurate data as input to the subsequent engine that receives the speech recognition results. If the speech recognition result does not match the user's actual speech, there will be a technical problem of transmitting the incorrectly recognized speech recognition result to the natural language understanding (NLU) engine. This will lead to errors in the transmission of the meaning of the speech. Therefore, the use of text-to-speech (TTS) results in a decrease in the naturalness and accuracy of the output.

[0011] Fourth, there is a technical problem that traditional speech recognition devices cannot recognize special symbols other than punctuation marks. In particular, traditional speech recognition devices cannot recognize special symbols such as ellipses (...) or tildes (~) other than punctuation marks.

[0012] Fifth, speech recognition models face technical challenges in terms of scalability and cost. Adding features like punctuation recognition or improving performance requires modifying the speech-to-text (STT) engine or building massive amounts of training data. This leads to problems such as excessively high development costs and long development times.

[0013] Sixth, traditional speech recognition devices suffer from latency issues during inference. Traditional speech recognition devices perform subsequent processing based on the speech recognition results. This leads to computational latency and inefficient use of processor resources. These problems become even more severe when using large-scale models. Summary of the Invention

[0014] The purpose of this invention is to provide a speech recognition device and method for classifying the types of speech received from a user and automatically combining punctuation marks suitable for the types of speech received from the user.

[0015] Embodiments of the present invention address a problem specific to the field of speech recognition technology by providing a speech recognition device and method that automatically incorporates punctuation marks (such as exclamation marks or question marks) into sentences when the utterance type is an exclamation or interrogative sentence. In particular, embodiments of the present invention provide a method and device for combining transcribed utterances and punctuation marks, which accurately recognizes sentences derived from user utterances and accurately processes user requests based on sentence type.

[0016] Embodiments of the present invention also provide a method and apparatus for classifying sentences derived from user utterances into declarative, interrogative, or exclamatory sentences using a pre-trained sentence type classification model, and combining punctuation marks appropriate to the sentence type with the sentences.

[0017] Embodiments of the present invention also provide a method and apparatus for combining transcribed utterances and punctuation marks, which provides accurate data to an NLU or TTS engine by combining punctuation marks and sentences suitable for sentence types, the NLU or TTS engine receiving sentences transcribed from user utterances.

[0018] Embodiments of the present invention also provide a method and apparatus for combining transcribed discourse and punctuation marks, which expands the ability to identify other special symbols suitable for discourse contexts besides punctuation marks by classifying sentence types and combining appropriate punctuation marks.

[0019] Embodiments of the present invention also provide a method and apparatus for combining transcribed speech and punctuation, which adds new functionality by using a speech recognizer and a pre-trained sentence type classification model in parallel without the cost of changing the speech recognizer or building large-scale training data.

[0020] Furthermore, embodiments of the present invention provide a method and apparatus for combining transcription of speech and punctuation marks, which improves the processing speed in the reasoning process by using a speech recognizer and a pre-trained sentence type classification model in parallel, solves the technical problem of computational latency, and effectively utilizes processor resources.

[0021] Therefore, the embodiments of the present invention provide specific and practical applications that improve existing speech recognition devices and provide additional functions that were not previously available.

[0022] The objectives of this invention are not limited to those described above, and other objectives not explicitly mentioned should be apparent to those skilled in the art from the following description.

[0023] According to one aspect of the invention, a method is provided for combining transcribed utterances and punctuation using a pre-trained sentence type classification model. The method includes receiving a speech signal from a microphone in a vehicle, the speech signal representing input utterance captured by the microphone. The method further includes converting the speech signal or a spectrogram generated based on the speech signal into a sentence using at least one speech recognizer. The sentence represents a transcription of the input utterance into text. The method further includes classifying the sentence into a sentence type corresponding to the speech signal or spectrogram using the pre-trained sentence type classification model. The method further includes inserting punctuation marks into the sentence based on the sentence type. The method further includes generating a text-based combined result based on the combination of the sentence and the inserted punctuation marks.

[0024] According to another aspect of the invention, an apparatus is provided for combining transcribed utterances and punctuation using a pre-trained sentence type classification model. The apparatus includes a memory configured to store computer-executable instructions. The apparatus also includes at least one processor configured to execute the computer-executable instructions to receive a speech signal from a microphone in a vehicle. The speech signal represents input utterance captured by the microphone. The at least one processor is further configured to convert the speech signal or a spectrogram generated based on the speech signal into a sentence using at least one speech recognizer. The sentence represents the transcription of the input utterance into text. The at least one processor is further configured to classify the sentence into a sentence type corresponding to the speech signal or spectrogram using the pre-trained sentence type classification model, and to insert punctuation marks into the sentence based on the sentence type. The at least one processor is further configured to generate a text-based combined result based on the combination of the sentence and the inserted punctuation marks.

[0025] The embodiments of the present invention improve the accuracy of natural sentence generation and processing based on user utterances by accurately identifying and processing user intent based on sentence type.

[0026] The embodiments of the present invention use a pre-trained sentence type classification model to classify sentences into declarative sentences, interrogative sentences, or exclamatory sentences, and automatically apply appropriate punctuation marks to the sentences.

[0027] The embodiments of the present invention transmit speech recognition data that matches the user's speech intent as input to a subsequent processing engine and output a response suitable for the user's request.

[0028] Embodiments of the present invention provide a highly scalable speech recognition device and method capable of recognizing special symbols other than punctuation marks based on context and combining them with sentences.

[0029] The embodiments of the present invention reduce system development costs and improve efficiency by using a speech recognizer and a pre-trained sentence type classification model in parallel without replacing the speech recognizer or building new training data.

[0030] The embodiments of the present invention improve processing speed and effectively allocate processor resources by using a speech recognizer and a pre-trained sentence type classification model in parallel.

[0031] The effects of this invention are not limited to those described above. Other effects not mentioned will be clearly understood by those skilled in the art based on the following description. Attached Figure Description

[0032] Figure 1 This is a functional block diagram of a speech recognition device according to an embodiment of the present invention.

[0033] Figure 2 This is a diagram schematically illustrating the relationship between a vehicle and a voice recognition device according to an embodiment of the present invention.

[0034] Figure 3 This is a diagram illustrating a speech recognition module according to an embodiment of the present invention.

[0035] Figure 4A This is a diagram illustrating traditional methods for recognizing user utterances.

[0036] Figure 4B This is a diagram illustrating a method for combining transcribed speech and punctuation marks according to an embodiment of the present invention.

[0037] Figure 5 This is a flowchart illustrating a method for combining transcribed speech and punctuation marks according to an embodiment of the present invention.

[0038] Figure 6 This is a configuration diagram of a speech recognition device according to an embodiment of the present invention. Detailed Implementation

[0039] The various embodiments of the present invention will now be described in detail with reference to the accompanying drawings. Throughout the drawings, the same reference numerals are used to denote the same or equivalent elements, even if these elements are shown in different drawings. Furthermore, in the following description of the various embodiments, detailed descriptions of well-known functions and configurations incorporated herein are omitted for clarity and brevity.

[0040] Furthermore, terms such as first, second, A, B, (a), (b), etc., are used only to distinguish one component from another and do not imply or suggest the type, order, or sequence of components. Throughout the specification, when a component “comprises” or “includes” a component, it should be understood that, unless expressly stated otherwise, a component may also include other components.

[0041] When the components, devices, elements, parts, units, modules, etc. of the present invention are described as having a purpose or performing an operation or function, the components, devices, or elements herein shall be regarded as being "configured" to satisfy that purpose or perform that operation or function. Each "part," "unit," "module," "component," "device," "element," etc. may be embodied individually or included in a processor and memory (such as a non-transitory computer-readable medium) as part of a device.

[0042] The following detailed description and appendix Figure 1 The invention is intended to describe various embodiments of the invention and is not intended to limit the scope of the invention to the embodiments described herein.

[0043] In the following text, speech recognition equipment refers to a device that combines speech and punctuation.

[0044] In the following text, "user" refers to the driver or passenger of the vehicle.

[0045] Figure 1 This is a functional block diagram illustrating a speech recognition device according to an embodiment of the present invention.

[0046] like Figure 1 As shown, the speech recognition device 100 according to an embodiment of the present invention may include some or all of the speech recognition module 110, the natural language understanding module 120 and the response generation module 130.

[0047] Speech recognition device 100 recognizes and understands user speech and provides responses corresponding to the user's speech. Speech recognition device 100 includes a speech recognition module 110, which transcribes input speech (e.g., user speech) and voice commands into transcribed text and classifies the type of the transcribed text. Speech recognition device 100 also includes a natural language understanding module 120 for determining the user's speech intent. Speech recognition device 100 also includes a response generation module 130, which generates a response corresponding to the user's speech intent, i.e., a text-based response output. Speech recognition module 110 may be referred to as an Automatic Speech Recognition (ASR) engine. Speech recognition module 110 may include at least one speech recognizer and a pre-trained sentence type classification model. The speech recognizer may be referred to as a speech-to-text (STT) engine. Speech recognition module 110 can use the speech recognizer and the pre-trained sentence type classification model to perform speech recognition and sentence type classification simultaneously, thereby improving processing speed. Since these two operations are performed independently, processor resources such as CPUs and GPUs can be effectively allocated to improve system performance.

[0048] To enable sentence type classification, the speech signal (i.e., waveform) received from the microphone is segmented and either used directly or converted into a spectrogram for input into the sentence type classification model. The model can consist of neural networks, such as convolutional neural networks (CNNs) or transformer-based encoders, which extract sentence-level features from the waveform or spectrogram input. A classification layer then assigns sentence type labels, such as statement, interrogative, or exclamation.

[0049] Sentence type classification models can be pre-trained using a labeled speech-text corpus, where each utterance is annotated with its corresponding sentence type. During training, each utterance can be converted into Mel-frequency cepstral coefficients (MFCCs), which are widely used in speech recognition and can be used as feature vectors for training the model. The model learns to associate MFCC patterns with sentence types and can be fine-tuned to improve performance under various speaking styles or background noise conditions. In real-time applications, the trained model can process waveform or spectrogram inputs, making it more suitable for real-time processing environments.

[0050] After deployment, the pre-trained model uses the incoming speech signal to classify sentence types in real time. The sentence type labels are then used to guide downstream processes, such as inserting appropriate punctuation in the transcribed text or generating output commands for controlling devices or vehicles. This flow from signal processing to classification to actual response demonstrates that this invention represents a technological improvement to speech understanding and control systems, rather than merely an abstract concept.

[0051] A microphone installed in the vehicle acquires or captures input utterances, such as a user's speech, and converts the input utterances into a speech signal. A speech recognition module 110 receives the speech signal representing the input utterances from the microphone installed in the vehicle and uses at least one STT engine to transcribe the speech signal representing the input utterances into an input sentence, i.e., a sequence of words. The speech recognition module 110 can generate a spectrogram by using a conversion program to convert the speech signal into a spectrogram. The conversion program for converting the speech signal into a spectrogram can be stored in the memory and / or storage device of the speech recognition device 100. The STT engine can use a speech recognition algorithm or a deep learning model to transcribe the speech signal representing the input utterances or the spectrogram obtained from the speech signal into a sentence. For example, the speech recognition module 110 can extract feature vectors from the input utterances by using feature extraction methods such as cepstral, linear prediction coefficients (LPC), Mel-frequency cepstral coefficients (MFCC), or filter bank energy.

[0052] The speech recognition module 110 obtains recognition results by comparing the extracted feature vectors with trained reference patterns. For this purpose, an acoustic model that models and compares the signal characteristics of speech, or a language model that models the linguistic order relationships of words or syllables corresponding to the recognized words, can be used.

[0053] The speech recognition module 110 can also convert the user's speech into text-based sentences based on a model trained using machine learning or deep learning.

[0054] The speech recognition module 110 can receive speech signals or spectrograms and use a pre-trained sentence type classification model to output the sentence type corresponding to the speech signal or spectrogram.

[0055] The speech recognition module 110 can combine sentences converted from user speech by the speech recognizer with punctuation marks representing sentence types classified based on a sentence type classification model. Before speech recognition, the speech recognition module 110 can preprocess the speech signal corresponding to the user's utterance. For example, the speech recognition module 110 can perform preprocessing to reduce noise in the speech signal.

[0056] Natural Language Understanding Module 120 uses at least one Natural Language Understanding (NLU) engine to classify the user's utterance intent from the input sentence and extract slots representing semantic information related to the utterance intent.

[0057] A slot is a semantic slot required to provide a response based on the utterance intent. Slots can be predefined for each utterance intent. The function of a slot is determined by the utterance intent. For example, in the input statement "Take me to Yanghua Bridge", "Yanghua Bridge" can represent a point of interest, while in the input statement "Play Yanghua Bridge", "Yanghua Bridge" can represent a song title.

[0058] In implementation, the NLU engine can compare the input sentence with a preset syntax to determine the user's utterance intent and the slot of the input sentence. For example, when the preset syntax is "call <someone>" and the input sentence is "call Hong Gil-dong", the NLU engine can determine that the utterance intent is "call" and the slot value is "Hong Gil-dong".

[0059] In another implementation, the NLU engine can use tokenization, deep learning models, etc., to determine the user's utterance intent and the slots of the input sentence.

[0060] Specifically, the NLU engine segments the input sentence into morphemes at the speech rate level. A morpheme is the smallest unit that has meaning but may not be further segmented. The NLU engine can tag each morpheme with a part-of-speech tag.

[0061] The NLU engine maps tokens to a vector space. Each token or combination of tokens is transformed into an embedding vector. To improve performance, sequence embedding, positional embedding, etc., can be performed simultaneously.

[0062] The NLU engine determines the utterance intent and slots of an input sentence by grouping embedding vectors or by applying a first deep learning model and a second deep learning model to the embedding vectors. The first deep learning model can be a pre-trained recurrent neural network (RNN) used to classify the utterance intent in response to the input embedding vectors. The second deep learning model can be a pre-trained recurrent neural network used to determine the slots in response to the input embedding vectors.

[0063] The Natural Language Understanding module 120 can use an NLU engine to extract domains, named entities, or speech behaviors from sentences.

[0064] A domain is information used to identify the topics of a user's discourse. For example, a domain can be determined based on sentences to represent various topics such as vehicle control, information provision, text transmission, and navigation functions.

[0065] Named entities are proper nouns, such as people's names, place names, organization names, times, dates, currencies, etc. Named Entity Recognition (NER) is the operation of identifying named entities from a sentence and classifying the identified named entities into types. NLU engines can use named entity recognition to extract important keywords from sentences to understand the meaning of the sentences.

[0066] Voice behavior analysis is a task used to analyze the intent behind utterances. It is used to identify the intent behind utterances, such as whether a user is asking a question, making a request, giving a response, or simply expressing emotion.

[0067] Information such as domains, named entities, or speech behaviors can be used to classify user utterance intent, determine slots, or generate responses to user utterances at least once.

[0068] The response generation module 130 performs processing to provide a response corresponding to the user's verbal intent. The response generation module 130 can provide the response in various forms. The response generation module 130 can use visual, auditory, or tactile interfaces to provide a response to the user's verbal intent.

[0069] The response generation module 130 can generate user-understandable response information using a generative model. The response generation module 130 can use the generative model to generate complete sentences from information such as utterance intent, slots, domains, named entities, and speech behavior. For example, if the user's utterance intent is "vehicle-related control," the response generation module 130 can transmit a result processing signal to the vehicle for performing vehicle-related control.

[0070] In another example, if the user's utterance intent is "to provide specific information," the response generation module 130 can use a slot to search for specific information and provide the searched information to the user terminal. Alternatively, an external server can be used to perform the information search.

[0071] In another example, if the user's utterance intent is "to provide specific content", the response generation module 130 can request the transmission of the target content from an external server that provides the content.

[0072] In another example, if the user's utterance intent is "to engage in casual conversation," the response generation module 130 can generate response content in response to the user's utterance and output the response visually or aurally. The operation of the response generation module 130 in outputting the response visually or aurally can be performed using... Figure 6 It is executed through the input / output interface.

[0073] The following describes an example of the operation of the voice recognition device 100.

[0074] For example, when the input sentence is "When should I change the engine oil?", the NLU engine labels the input sentence with words such as "engine", "oil", "when", "change", and "do", and converts each word into a vector. The NLU engine classifies the discourse intent corresponding to the vector based on the similarity between the vector and its position in the vector space. In the example above, the classified discourse intent is "check consumable replacement". The NLU engine extracts the slot values ​​for "engine" and "oil" based on the intent of "check consumable replacement". Subsequently, the response generation module 130 can provide the sentence "The engine oil change interval is 15,000 km" based on the intent of checking consumable replacement and the slot values ​​of "engine" and "oil".

[0075] In another example, if the input sentence is "Let's go home", the domain is "navigation", the discourse intent is "route setting", and depending on the discourse intent, the slots that control the required fields are "start point" and "end point".

[0076] In another example, if the input statement is "turn on the air conditioner", the domain is "vehicle control", the utterance intent is "turn on the air conditioner", and the slot required for control is "air conditioner" based on the utterance intent. Depending on the specific control required, additional slots may include "temperature" and "airflow".

[0077] The speech recognition device 100 includes at least one processor and a memory for storing at least one computer-executable instruction. The speech recognition device 100 can perform the functions of the speech recognition module 110, the natural language understanding module 120, and the response generation module 130 by executing instructions (i.e., computer-executable instructions) executed by the at least one processor. The speech recognition device 100 may also include a communication module for communicating with external devices.

[0078] Figure 1 The modules shown are not all necessary components, and in other embodiments, some modules included in the voice recognition device 100 may be added, changed, or removed. Figure 1 The components shown represent functionally different elements, and one or more components can be integrated into a real physical environment.

[0079] Those skilled in the art will understand that one or more modules described herein (e.g., speech recognition module 110, natural language understanding module 120, response generation module 130, speech recognizer, and pre-trained sentence type classification model) can be implemented using tangible computer-readable media or non-transitory memory, including specially configured hardware or processors (e.g., for...) Figure 6 The computer-executable instructions (e.g., executable software code) executed by one or more processors 620 described in detail. It should be understood that embodiments of the present invention can be implemented as different or separate modules of the speech recognition device 100, or as a separate computer system coupled to the speech recognition device 100.

[0080] The sentence type classification model described above, which processes waveform or spectrogram inputs to output sentence type labels, can be implemented using executable software code stored in a non-transient computer-readable medium. Further details regarding the pre-training of the classification results, acoustic signal processing, and downstream applications have been provided above.

[0081] Figure 2 This is a schematic diagram illustrating the relationship between a vehicle and a voice recognition device according to an embodiment of the present invention. (Reference) Figure 2 The voice recognition device 100 can be implemented using at least one of the vehicle 210 or the server 220. In one embodiment, the voice recognition device 100 can be implemented in the vehicle 210. In other words, the vehicle 210 may include the voice recognition device 100.

[0082] The voice recognition device 100 in vehicle 210 can acquire user speech received through the microphone in the vehicle, recognize and understand the user's speech, and provide a response to vehicle 210 corresponding to the user's speech. Vehicle 210 can use the speakers in vehicle 210 to provide a response to the user.

[0083] In this implementation, the voice recognition device 100 can be implemented in the server 220. The communication interface in the vehicle 210 can transmit the user's speech or voice commands to the voice recognition device 100 in the server 220. The voice recognition device 100 processes the speech or voice commands to generate the information or control commands required by the user and transmits the information or control commands to the communication interface in the vehicle 210. The vehicle 210 can then use its speakers to provide a response to the user.

[0084] The speech recognition module 110, natural language understanding module 120, and response generation module 130 in the speech recognition device 100 can be distributed in the vehicle 210 and the server 220.

[0085] Figure 3 This is a diagram illustrating a speech recognition module 110 according to an embodiment of the present invention.

[0086] The speech recognition module 110 may include at least one speech recognizer 301, 302, and 303 and a pre-trained sentence type classification model 310. The speech recognition module 110 may select one from the processing results of the multiple speech recognizers 301, 302, and 303 converting the user's utterance into a sentence. The speech recognition module 110 may select one from the outputs of the multiple speech recognizers 301, 302, and 303, for example, based on the accuracy of speech recognition. However, the criteria used for selection are not limited to accuracy. Specific selection methods will be obvious to those skilled in the art, and therefore a detailed description is omitted.

[0087] Based on the sentence type classified by the sentence type classification model 310, the speech recognition module 110 inserts punctuation marks into the sentences converted from user speech by the speech recognizers 301, 302, or 303. The combined result (i.e., the text-based combined result) is a sentence selected from multiple sentences converted by the speech recognizers, with punctuation marks inserted according to the sentence type. In other words, the speech recognition module 110 generates a text-based combined result based on the combination of the sentence and the inserted punctuation marks. The speech recognition module 110 may insert punctuation marks into the sentence if it is an interrogative or exclamatory sentence. This is because inserting a period in a declarative sentence would reduce user convenience. There are three possible results for inserting punctuation marks into the sentence. First, if the speech recognition result selected from the processing results of the speech recognizers 301, 302, and 303 is "Seoul is a livable place," and the sentence type classification model 310 classifies the sentence as a declarative sentence, then no period needs to be inserted. The combined result is "Seoul is a livable place." Secondly, if the selected speech recognition result from the processing results of speech recognizers 301, 302, and 303 is "Seoul is a livable place," and the sentence type classification model 310 classifies the sentence as an interrogative sentence, then a question mark is inserted into the sentence. The combined result is "Seoul is a livable place?". Thirdly, if the selected speech recognition result from the processing results of speech recognizers 301, 302, and 303 is "Seoul is a livable place," and the sentence type classification model 310 classifies the sentence as an exclamatory sentence, then an exclamation mark is inserted into the sentence. The combined result (i.e., the text-based combined result) is "Seoul is a livable place!".

[0088] Speech recognizers 301, 302, and 303 may refer to the STT engine. Speech recognizers 301, 302, and 303 can receive speech signals or spectrograms converted from speech signals by speech recognition module 110 using a spectrogram conversion program. Speech recognizers 301, 302, and 303 can use speech recognition algorithms or deep learning models to convert speech signals representing user utterances or spectrograms converted from speech signals (i.e., generated based on speech signals) into sentences.

[0089] Sentence type classification model 310 is a model that receives a speech signal or spectrogram and classifies sentence types using a neural network such as a convolutional neural network (CNN). Sentence types can include declarative sentences, interrogative sentences, or exclamatory sentences, etc. Sentence type classification model 310 can receive a speech signal or a spectrogram converted from a speech signal using an onboard microphone as input. Sentence type classification model 310 is a model trained to output sentence types corresponding to the speech signal or spectrogram. Sentence type classification model 310 can include layers that receive the speech signal or spectrogram, process the waveform of the speech signal or spectrogram, and extract features including at least one of prosody or pitch. Sentence type classification model 310 can use a softmax function to output the sentence type with the highest probability among declarative, interrogative, and exclamatory sentences. The softmax function outputs the probability of each type when classifying multiple classes. The output of sentence type classification model 310 can include 0, 1, and 2, where 0 can represent a declarative sentence, 1 can represent an interrogative sentence, and 2 can represent an exclamatory sentence. Sentence type classification model 310 can classify sentence types in a speech signal or spectrogram by outputting sentence types.

[0090] Speech recognizers 301, 302, and 303, along with sentence type classification model 310, can perform speech recognition and sentence type classification operations in parallel.

[0091] Figure 4A This is a diagram illustrating traditional methods for recognizing user utterances.

[0092] The traditional speech recognition model 410 is a model that uses conventional speech recognition equipment to recognize user speech as speech. When the sentence type converted from the user's speech is an interrogative or exclamatory sentence, the traditional speech recognition model 410 cannot recognize punctuation marks. Because the traditional speech recognition model 410 inputs sentences in the form of declarative sentences into the natural language understanding (NLU) module, regardless of the sentence type, the response generation module may not be able to accurately process the user's request. For example, in one case, when the user's speech is "I need to refuel," the sentence converted by the speech recognition module (ASR) is "I need to refuel," and the response output by the response generation module using the intent slot output by the natural language understanding module might be "To find a gas station near your current location, please say 'Add gas station as a waypoint.'" In another case, if the user's speech is "I need to refuel?", the sentence converted by the speech recognition module (ASR) is "I need to refuel," and the response output by the response generation module using the intent slot output by the natural language understanding module might be "Please say 'Add gas station as a waypoint.'"

[0093] on the other hand, Figure 4B This is a diagram illustrating a method for combining transcribed speech and punctuation marks according to an embodiment of the present invention.

[0094] refer to Figure 4B The speech recognition model 420 according to an embodiment of the present invention is a model that uses the speech recognition device according to the present invention to recognize a user's speech as speech. Even when the sentence type converted from the user's speech is an interrogative or exclamatory sentence, the speech recognition model 420 according to an embodiment of the present invention can combine punctuation marks suitable for the sentence type. The speech recognition model 420 according to an embodiment of the present invention inputs a sentence with punctuation marks corresponding to the sentence type into the natural language understanding module, so the response generation module can accurately process the user's request. The speech recognition model 420 according to an embodiment of the present invention improves user convenience by accurately recognizing the user's intent. For example, in one case, if the user's speech is "I need to refuel," the speech recognition module converts the sentence to "I need to refuel." The user's intent is to find a gas station because the user needs to refuel. The response generated by the response generation module using the intent slot output by the natural language understanding module could be "To find a gas station near your current location, please say 'Add gas station as a waypoint'." In another case, if the user's speech is "I need to refuel?", the speech recognition module converts the sentence to "I need to refuel?". The user's intent is to check whether refueling is currently needed. The response generated by the response generation module using the intent slot output by the natural language understanding module could be "Current driving distance is x km. Following the currently provided route, you will not run out of fuel".

[0095] Figure 5 This is a flowchart illustrating a method for combining transcribed speech and punctuation marks according to an embodiment of the present invention.

[0096] refer to Figure 5 The speech recognition device that combines transcribed speech and punctuation marks receives input speech as a speech signal using a microphone in the vehicle (step S501). The device that combines transcribed speech and punctuation marks can use... Figure 6 The communication interface shown is used to receive voice signals.

[0097] The speech recognition module can use at least one speech recognizer to convert a speech signal or a spectrogram generated from a speech signal into a sentence (step S502). During the process of step S502, the speech recognition module can use multiple speech recognizers to convert a speech signal or a spectrogram generated based on a speech signal into a sentence. The speech recognition module can select any one of the results processed by the multiple speech recognizers.

[0098] The sentence type classification model can receive speech signals or spectrograms generated based on speech signals and classify sentences into sentence types (step S503). Sentence types can include declarative sentences, interrogative sentences, and exclamatory sentences. The sentence type classification model can receive speech signals or spectrograms generated based on speech signals and extract features including at least one of prosody or pitch. The sentence type classification model can classify sentences into sentence types based on the extracted features. The sentence type classification model can use a softmax function to classify sentence types.

[0099] The speech recognition module can insert punctuation marks into sentences selected from the processing results of at least one speech recognizer based on sentence type (step S504). The speech recognition module can insert punctuation marks suitable for the sentence type into the selected sentence to combine the sentence and punctuation marks. If the sentence type is an interrogative sentence (e.g., a question) or an exclamatory sentence, the speech recognition module can combine the selected sentence with punctuation marks suitable for the sentence type. If the sentence type is an interrogative sentence, a question mark can be set as a punctuation mark; if the sentence type is an exclamatory sentence, an exclamation mark can be set as a punctuation mark.

[0100] Figure 6 This is a configuration diagram of a speech recognition device according to an embodiment of the present invention.

[0101] refer to Figure 6 The voice recognition device 600 may include some or all of the following: non-transient memory 610, processor 620, storage device 630, input / output interface 640, and communication interface 650.

[0102] The voice recognition device 600 can be a fixed computing device such as a desktop computer, server, or AI accelerator, or a mobile computing device such as a laptop or smartphone.

[0103] The memory 610 may store a program that causes the processor 620 to execute a method for combining words and punctuation marks according to an embodiment of the present invention. For example, the program may include a plurality of instructions executable by the processor 620, and the method for combining words and punctuation marks may be executed by the processor 620 executing the plurality of instructions.

[0104] The memory 610 can be a single memory or multiple memories. When the memory 610 is a single memory or multiple memories, the information required to combine words and punctuation marks can be stored in a single memory or divided and stored in multiple memories. When the memory 610 consists of multiple memories, the multiple memories can be physically separated.

[0105] The memory 610 may include at least one of volatile memory or non-volatile memory. The volatile memory may include static random access memory (SRAM) or dynamic random access memory (DRAM), and the non-volatile memory may include flash memory.

[0106] Processor 620 may include at least one core capable of executing at least one instruction. Processor 620 may execute instructions stored in memory 610. Processor 620 may be a single processor or multiple processors.

[0107] The speech recognition module 110, the natural language understanding module 120, and the response generation module 130 can be implemented using the processor 620.

[0108] Even when the power to the voice recognition device 600 is cut off, the storage device 630 can retain the stored data. For example, the storage device 630 may include non-volatile memory, or storage media such as magnetic tape, optical disc, or magnetic disk.

[0109] The program stored in storage device 630 can be loaded into memory 610 before being executed by processor 620. Storage device 630 can store files written in a programming language, and can load programs generated from files by compilers or the like into memory 610.

[0110] Storage device 630 can store data to be processed by processor 620 and data that has already been processed by processor 620.

[0111] The input / output interface 640 may include input devices such as a keyboard or mouse and output devices such as a display device or printer. A user may also use the input / output interface 640 to trigger the processor 620 to execute a program.

[0112] The communication interface 650 provides access to external communication networks. For example, the voice recognition device 600 can use the communication interface 650 to communicate with other devices.

[0113] Each element of the device or method according to the invention can be implemented in hardware, software, or a combination of both. The function of each element can be implemented in software. A microprocessor can be implemented to execute the software functions corresponding to each element.

[0114] Various implementations of the systems and techniques described herein can be implemented using digital electronic circuits, integrated circuits, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), computer hardware, firmware, software, and / or combinations thereof. Various implementations may include implementations of one or more computer programs executable on a programmable system. The programmable system includes: at least one specially configured programmable processor, which may be a dedicated or general-purpose processor, coupled to receive and transfer data and instructions to a storage system; at least one input device; and at least one output device. The computer program (also referred to as a program, software, software application, or code) includes instructions for the programmable processor and is stored in a computer-readable recording medium.

[0115] Computer-readable recording media can include all types of storage devices on which computer-readable data can be stored. Computer-readable recording media can be non-volatile or non-transitory media, such as read-only memory (ROM), random access memory (RAM), optical disc ROM (CD-ROM), magnetic tape, floppy disk, or optical data storage devices. Furthermore, computer-readable recording media can also include transient media such as data transmission media. Moreover, computer-readable recording media can be distributed across computer systems connected via a network, and computer-readable program code can be stored and executed in a distributed manner.

[0116] Although the flowcharts / timing diagrams in this specification illustrate sequentially executed operations, this is merely a description of the technical concept of one embodiment of the invention. In other words, those skilled in the art to which the various embodiments of the invention pertain will understand that various modifications and changes can be made without departing from the essential characteristics of the embodiments of the invention. In other words, the order shown in the flowcharts / timing diagrams can be changed, and one or more operations can be executed in parallel. Therefore, the flowcharts / timing diagrams are not limited to a temporal order.

[0117] Although various embodiments of the invention have been described for illustrative purposes, those skilled in the art will understand that various modifications, additions, and substitutions are possible without departing from the concept and scope of the claimed invention. Therefore, for the sake of brevity and clarity, various embodiments of the invention have been described. The scope of the technical concept of these embodiments is not limited by the illustrations. Therefore, those skilled in the art will understand that the scope of the claimed invention is not limited to the embodiments explicitly described above, but rather to the claims and their equivalents.

Claims

1. A method for combining transcribed discourse and punctuation using a pre-trained sentence type classification model, the method comprising the following steps: Voice signals are received from a microphone in the vehicle, the voice signals representing input speech captured by the microphone; The speech signal or a spectrogram generated based on the speech signal is converted into a sentence using at least one speech recognizer, wherein the sentence represents a transcription of the input utterance into text; Using the pre-trained sentence type classification model, the sentence is classified into a sentence type corresponding to the speech signal or the spectrogram; Based on the sentence type, insert punctuation marks into the sentence; and Based on the combination of the sentence and the inserted punctuation marks, a text-based combination result is generated.

2. The method according to claim 1, wherein, Converting the speech signal or the spectrogram includes the following steps: using multiple speech recognizers to convert the speech signal or the spectrogram into multiple sentences, and selecting one sentence from the multiple sentences converted by the multiple speech recognizers.

3. The method according to claim 1, wherein, Classifying the sentences using the pre-trained sentence type classification model includes the following steps: Receive the speech signal or the spectrogram, process the waveform of the speech signal or the spectrogram, and extract features including at least one of prosody and pitch; and The sentences are classified based on the extracted features.

4. The method according to claim 3, further comprising the following steps: The speech signal or the spectrogram is received through the layers of the pre-trained sentence type classification model; The waveform of the speech signal or the spectrogram is processed by the layer of the pre-trained sentence type classification model; as well as The features are extracted through the layers of the pre-trained sentence type classification model.

5. The method according to claim 1, wherein, The sentence types include declarative sentences, interrogative sentences, and exclamatory sentences.

6. The method according to claim 1, wherein, Inserting punctuation marks into the sentence includes the following steps: when the sentence type is an interrogative sentence or an exclamatory sentence, inserting punctuation marks into the sentence, wherein when the sentence type is an interrogative sentence, the punctuation mark is set to a question mark, and when the sentence type is an exclamatory sentence, the punctuation mark is set to an exclamation mark.

7. The method of claim 1, further comprising generating the spectrogram by converting the speech signal received from the microphone into a spectrogram.

8. The method of claim 1, further comprising generating a text-based response output based on the text-based combination result.

9. The method of claim 8, further comprising displaying the text-based response output on a display device of the vehicle.

10. The method of claim 8, further comprising generating and transmitting a result processing signal based on the text-based response output to control the vehicle.

11. An apparatus for combining transcribed utterances and punctuation using a pre-trained sentence type classification model, the apparatus comprising: Memory, which is configured to store computer-executable instructions; as well as At least one processor is configured to execute the computer-executable instructions to perform the following operations: Voice signals are received from a microphone in the vehicle, the voice signals representing input speech captured by the microphone; The speech signal or a spectrogram generated based on the speech signal is converted into a sentence using at least one speech recognizer, wherein the sentence represents a transcription of the input utterance into text; Using the pre-trained sentence type classification model, the sentence is classified into a sentence type corresponding to the speech signal or the spectrogram; Based on the sentence type, insert punctuation marks into the sentence; and Based on the combination of the sentence and the inserted punctuation marks, a text-based combination result is generated.

12. The device according to claim 11, wherein, The processor is also configured to: Implement multiple speech recognizers configured to convert the speech signal or the spectrogram into multiple sentences; and Select one sentence from the multiple sentences converted by the multiple speech recognizers.

13. The device according to claim 11, wherein, The processor is also configured to use the pre-trained sentence type classification model to perform the following operations: Receive the speech signal or the spectrogram, and extract features including at least one of prosody and pitch. The sentences are classified based on the extracted features.

14. The device according to claim 13, wherein, The pre-trained sentence type classification model includes layers configured to receive the speech signal or the spectrogram, process the waveform of the speech signal or the spectrogram, and extract the features.

15. The device according to claim 11, wherein, The sentence types include declarative sentences, interrogative sentences, and exclamatory sentences.

16. The device according to claim 11, wherein, The processor is further configured to: insert punctuation marks into the sentence when the sentence type is an interrogative sentence or an exclamatory sentence, wherein when the sentence type is an interrogative sentence, the punctuation mark is set to a question mark; and when the sentence type is an exclamatory sentence, the punctuation mark is set to an exclamation mark.

17. The device according to claim 11, wherein, The processor is also configured to generate the spectrogram by converting the speech signal received from the microphone into a spectrogram.

18. The device according to claim 11, wherein, The processor is also configured to generate a text-based response output based on the text-based combination result.

19. The device according to claim 18, wherein, The processor is also configured to display the text-based response output on the vehicle's display device.

20. The device according to claim 18, wherein, The processor is also configured to generate and transmit result processing signals based on the text-based response output to control the vehicle.