Speech recognition methods, systems, devices, media, and products for text matching
By receiving audio data for feature extraction and probability generation, the shortcomings of manual control and cloud recognition in existing teleprompter technologies are overcome. This enables accurate recognition of professional words and hot words in offline or weak network environments, thus improving the automatic teleprompter effect.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HANGZHOU LINGBAN TECH CO LTD
- Filing Date
- 2025-07-08
- Publication Date
- 2026-06-30
Smart Images

Figure CN120636406B_ABST
Abstract
Description
Technical Field
[0001] Embodiments of this disclosure relate to the field of computer technology, and more specifically to speech recognition methods, systems, devices, media, and products for text matching. Background Technology
[0002] In recent years, teleprompter technology, combined with automatic speech recognition (ASR) technology, has been increasingly applied to scenarios such as speeches, live broadcasts, and video recordings, providing speakers with automatically scrolling prompts. Currently, commonly used prompting methods include manually controlling the scrolling speed and position via a Bluetooth remote control or foot switch, or real-time audio capture and transmission to the cloud or general offline automatic speech recognition for transcription, with the recognized text then used as prompts.
[0003] However, the inventors discovered that the following technical problems often exist when using the above methods: manual control requires manual intervention, which affects the fluency of expression, and requires looking at a fixed teleprompter position, affecting the presentation effect; cloud-based solutions are susceptible to delays or interruptions due to network fluctuations, or in some situations, cloud recognition solutions cannot be used due to privacy protection requirements; general offline automatic speech recognition is difficult to cover key words such as professional terms, industry terms, names of people and places, and brand names in the speech, resulting in a large number of missed or misrecognized words.
[0004] The information disclosed in this background section is only intended to enhance the understanding of the background of the inventive concept, and therefore may contain information that does not constitute prior art known to those skilled in the art. Summary of the Invention
[0005] The summary portion of this disclosure is intended to provide a brief overview of the concepts, which will be described in detail in the detailed description portion. This summary portion is not intended to identify key or essential features of the claimed technical solutions, nor is it intended to limit the scope of the claimed technical solutions.
[0006] Some embodiments of this disclosure provide speech recognition methods, systems, electronic devices, computer-readable media, and computer program products for text matching to address one or more of the technical problems mentioned in the background section above.
[0007] In a first aspect, some embodiments of this disclosure provide a speech recognition method for text matching. The method includes: receiving audio data corresponding to target text data sent by a display device connected in a communication link; extracting features from the audio data to obtain audio feature information; generating acoustic probability information corresponding to the audio data based on the domain type corresponding to the target text data and the audio feature information; generating domain probability information corresponding to the audio data based on the domain type and the acoustic probability information; generating a candidate sentence list based on bias information corresponding to the target text data, the acoustic probability information, and the domain probability information; generating recognized text information based on the candidate sentence list; and sending the recognized text information to the display device, causing the display device to perform text matching between the recognized text information and the target text data and to highlight the text matching result.
[0008] Optionally, the above method further includes: identifying keywords corresponding to the target text data to obtain a keyword set; and identifying the domain type corresponding to the target text data based on the keyword set.
[0009] Optionally, the above method further includes: generating a word search tree based on the keyword set; training an initial language model based on the keyword set to obtain a trained language model; generating a bias vector based on the keyword set; and determining the word search tree, the trained language model, and the bias vector as bias information.
[0010] Optionally, the above-mentioned feature extraction of the audio data to obtain audio feature information includes: performing frame segmentation processing on the audio data to obtain an audio frame sequence; for each audio frame in the audio frame sequence, performing feature extraction on the audio frame to obtain audio frame feature information; and superimposing the obtained audio frame feature information based on a preset window to obtain each audio frame feature tensor as audio feature information.
[0011] Optionally, generating acoustic probability information corresponding to the audio data based on the domain type corresponding to the target text data and the audio feature information includes: inputting the audio feature information into a pre-loaded acoustic model corresponding to the domain type to obtain the probability of each sub-word unit as acoustic probability information, wherein each sub-word unit probability corresponds to an audio frame and a sub-word unit.
[0012] Optionally, generating domain probability information corresponding to the audio data based on the domain type and the acoustic probability information includes: determining each target word unit based on the acoustic probability information; combining the historical prefix and each target word unit into the current prefix; inputting the current prefix into a pre-loaded language model corresponding to the domain type to obtain the probability of each word unit as domain probability information, wherein each word unit probability corresponds to an audio frame and a word unit.
[0013] Optionally, generating recognition text information based on the candidate sentence list includes: optimizing the candidate sentence list to obtain an optimized candidate sentence list; and generating recognition text information based on the optimized candidate sentence list.
[0014] Optionally, the above optimization process for the candidate sentence list to obtain an optimized candidate sentence list includes: determining the recognition location information corresponding to the target text data at the current time; extracting context information from the target text data based on the recognition location information; for each candidate sentence in the candidate sentence list, performing the following steps: determining the matching length of the candidate sentence in the context information; determining the ratio of the matching length to the string length of the context information as the matching value; generating a comprehensive value for the candidate sentence based on the matching value and the candidate value corresponding to the candidate sentence; and reordering each candidate sentence based on the comprehensive value of each candidate sentence in the candidate sentence list to optimize the candidate sentence list.
[0015] Optionally, the above optimization process for the candidate sentence list to obtain an optimized candidate sentence list includes: selecting the candidate sentence with the highest comprehensive value from the optimized candidate sentence list; performing word segmentation on the candidate sentence to obtain a word segmentation set; for each word segment in the word segmentation set, performing the following steps: determining the pinyin corresponding to the word segment; determining whether there is a word pinyin corresponding to the pinyin in the pre-constructed word pinyin mapping information corresponding to the target text data, wherein the word pinyin mapping information includes words in the target text data and corresponding word pinyin; in response to determining that there is, determining the word corresponding to the word pinyin in the word pinyin mapping information; determining whether the word segment is the same as the word; in response to determining that there is a difference, determining whether there is the word in the adjacent text of the candidate sentence in the target text data; in response to determining that there is the word in the adjacent text, replacing the word segment in the candidate sentence with the word to optimize the candidate sentence.
[0016] Optionally, the above optimization process for the candidate sentence list to obtain an optimized candidate sentence list includes: for each candidate sentence in the candidate sentence list, performing the following steps: performing word segmentation on the candidate sentence to obtain a word segmentation set; for each word segment in the word segmentation set, performing the following steps: determining the phoneme corresponding to the word segment as a word segmentation phoneme; determining the similarity between the word segmentation phoneme and each word segmentation phoneme in the pre-constructed word segmentation phoneme mapping information, wherein the word segmentation phoneme mapping information includes words and word segmentation phonemes corresponding to the words; determining the word segmentation phonemes whose similarity to the word segmentation phonemes among the word segmentation phonemes meets a preset similarity condition as target word segmentation phonemes; in response to determining that the similarity between the word segmentation phoneme and the target word segmentation phoneme meets a preset threshold condition, determining the word corresponding to the target word segmentation phoneme as a target word; in response to determining that the target word is different from the word segmentation, replacing the word segmentation in the candidate sentence with the target word to optimize the candidate sentence.
[0017] Optionally, the above optimization process for the candidate sentence list to obtain an optimized candidate sentence list includes: for each candidate sentence in the candidate sentence list, performing the following steps: performing word segmentation on the candidate sentence to obtain a word segmentation set; for each word segment in the word segmentation set, performing the following steps: matching the word segment with a pre-built user dictionary to obtain a word matching result; and replacing the word segment in the candidate sentence with the word matching result.
[0018] Optionally, the process of identifying keywords corresponding to the target text data to obtain a keyword set includes: extracting plain text from the target text data to obtain target text; cleaning the target text to obtain cleaned text; segmenting the cleaned text into sentences to obtain a sentence list; for each sentence in the sentence list, performing the following steps: segmenting the sentence into words to obtain a word list corresponding to the sentence; performing entity recognition on the sentence to obtain entity recognition results; removing words that meet preset conditions from the word list to update the word list; for each word in the updated word list, performing the following steps: determining the word frequency of the word corresponding to the target text data; generating a word frequency score for the word corresponding to the target text data based on the word frequency; determining the connectivity score of the word; generating a word score based on the word frequency score and the connectivity score; selecting a preset number of words whose word scores meet preset score conditions from each obtained word list as keywords, and using each obtained entity recognition result as supplementary keywords to obtain a keyword set.
[0019] Optionally, the process of identifying the domain type corresponding to the target text data based on the keyword set includes: for each domain dictionary in a preset domain dictionary set, determining the hit information of the keyword set corresponding to the domain dictionary; selecting the domain dictionary with the highest hit information from the domain dictionary set as a candidate domain dictionary; in response to determining that the hit information corresponding to the candidate domain dictionary is greater than or equal to a preset threshold, determining the domain corresponding to the candidate domain dictionary as the domain type corresponding to the target text data, wherein the confidence level of the domain type corresponds to the hit information; and in response to determining that the hit information corresponding to the candidate domain dictionary is less than the preset threshold, generating the domain type and confidence level corresponding to the target text data based on the target text data and a pre-trained domain recognition model.
[0020] Secondly, some embodiments of this disclosure provide a speech recognition system for text matching, the system comprising: a data processing device configured to perform the method described in any implementation of the first aspect; and a display device configured to perform the following steps: sending collected audio data to the data processing device; in response to receiving recognized text information sent by the data processing device, performing text matching on the recognized text information and the target text data to obtain matching position information and text matching result; determining screen position information corresponding to the matching position information; and highlighting the text matching result according to the screen position information.
[0021] Thirdly, some embodiments of this disclosure provide an electronic device, including: one or more processors; and a storage device having one or more programs stored thereon, wherein when the one or more programs are executed by the one or more processors, the one or more processors implement the method described in any implementation of the first aspect above.
[0022] Fourthly, some embodiments of this disclosure provide a computer-readable medium having a computer program stored thereon, wherein the program, when executed by a processor, implements the method described in any of the implementations of the first aspect above.
[0023] Fifthly, some embodiments of this disclosure provide a computer program product, including a computer program that, when executed by a processor, implements the method described in any of the implementations of the first aspect above.
[0024] The above-described embodiments of this disclosure have the following beneficial effects: The speech recognition method for text matching in some embodiments of this disclosure can improve the recognition accuracy of professional terms and hot words in a text in offline or weak network environments, thus improving the automatic prompting effect and making it suitable for professional prompting scenarios. Specifically, the reasons for poor prompting effect, unsuitability for offline or weak network environments, and poor recognition accuracy are as follows: manual control requires manual intervention, which affects the fluency of expression and requires looking at a fixed teleprompter position, affecting the presentation effect; cloud-based solutions are susceptible to delays or interruptions due to network fluctuations, or in some cases, cloud-based recognition solutions cannot be used due to privacy protection requirements; general offline automatic speech recognition is difficult to cover key hot words such as professional terms, industry terms, names of people and places, and brand names in a speech, resulting in many omissions or misrecognitions. Based on this, the speech recognition method for text matching in some embodiments of this disclosure first receives audio data corresponding to the target text data sent by a display device with a communication connection. Thus, the audio data corresponding to the current text can be collected in real time through the display device. Then, feature extraction is performed on the audio data to obtain audio feature information. This facilitates audio recognition. Next, based on the domain type corresponding to the target text data and the aforementioned audio feature information, acoustic probability information corresponding to the aforementioned audio data is generated. Thus, acoustic probability inference can be performed using the extracted audio features and the domain type corresponding to the text. Next, based on the aforementioned domain type and the aforementioned acoustic probability information, domain probability information corresponding to the aforementioned audio data is generated. Thus, domain probability inference can be performed using the extracted audio features and the domain type corresponding to the text. Next, based on the bias information corresponding to the aforementioned target text data, the aforementioned acoustic probability information, and the aforementioned domain probability information, a candidate sentence list is generated. Thus, a comprehensive selection of candidate sentences corresponding to the audio data can be achieved by combining pre-constructed prior knowledge, acoustic probability inference results, and domain probability inference results. Next, based on the aforementioned candidate sentence list, recognition text information is generated. Thus, the final recognition text can be determined. Finally, the aforementioned recognition text information is sent to the aforementioned display device, enabling the display device to perform text matching between the aforementioned recognition text information and the aforementioned target text data and to highlight the text matching results. Thus, the display device can automatically highlight the recognition text without requiring manual prompts. Because it does not use cloud-based audio data recognition, it is suitable for offline or weak network environments. Furthermore, when recognizing audio data, both acoustic probability inference results and domain probability inference results are determined based on the domain type corresponding to the text. This makes the inference results more closely aligned with domain characteristics, thereby improving the accuracy of recognizing professional terms and hot words in texts in offline or weak network environments, enhancing the automatic title-writing effect, and thus making it suitable for specialized title-writing scenarios. Attached Figure Description
[0025] The above and other features, advantages, and aspects of the embodiments of this disclosure will become more apparent from the accompanying drawings and the following detailed description. Throughout the drawings, the same or similar reference numerals denote the same or similar elements. It should be understood that the drawings are schematic, and elements are not necessarily drawn to scale.
[0026] Figure 1 This is an architecture diagram of an exemplary system to which some embodiments of this disclosure can be applied;
[0027] Figure 2 This is a flowchart of some embodiments of the speech recognition method for text matching according to the present disclosure;
[0028] Figure 3 These are flowcharts of other embodiments of the speech recognition method for text matching according to this disclosure;
[0029] Figure 4 This is a schematic diagram of the structure of some embodiments of a speech recognition system for text matching according to the present disclosure;
[0030] Figure 5 This is a schematic diagram of the structure of an electronic device suitable for implementing some embodiments of the present disclosure. Detailed Implementation
[0031] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.
[0032] It should also be noted that, for ease of description, only the parts relevant to the invention are shown in the accompanying drawings. Unless otherwise specified, the embodiments and features described in this disclosure can be combined with each other.
[0033] It should be noted that the concepts of "first" and "second" mentioned in this disclosure are used only to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or their interdependencies.
[0034] It should be noted that the terms "a" and "a plurality of" used in this disclosure are illustrative rather than restrictive, and those skilled in the art should understand that, unless otherwise expressly indicated in the context, they should be understood as "one or more".
[0035] The names of messages or information exchanged between multiple devices in the embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.
[0036] Before performing any of the operations involving the collection, storage, or use of user personal information (such as target text data and audio data) disclosed in this disclosure, the relevant organizations or individuals shall fulfill their obligations, including conducting personal information security impact assessments, informing personal information subjects, and obtaining prior authorization and consent from personal information subjects.
[0037] This disclosure will now be described in detail with reference to the accompanying drawings and embodiments.
[0038] Figure 1 An exemplary system architecture 100 for a speech recognition method or system for text matching that can be applied to some embodiments of the present disclosure is shown.
[0039] like Figure 1 As shown, system architecture 100 may include terminal device 101, network 102, display device 103, network 104, and server 105. Network 104 is used as a medium to provide a communication link between terminal device 101 and server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, etc. Network 102 is used as a medium to provide a communication link between terminal device 101 and display device 103. Network 102 may include various connection types, such as wired, wireless communication links (e.g., Bluetooth, Wi-Fi Direct, LAN Socket, etc.).
[0040] Users can use terminal device 101 to interact with server 105 via network 104 to receive or send messages, etc. Various communication client applications can be installed on terminal device 101, such as web browser applications, word generator applications, search applications, instant messaging tools, email clients, social media platform software, etc.
[0041] Terminal device 101 can be either hardware or software. When terminal device 101 is hardware, it can be various electronic devices with computing capabilities, including but not limited to smartphones, tablets, portable computers, e-book readers, laptops, and desktop computers. When terminal device 101 is software, it can be installed in the electronic devices listed above. It can be implemented as, for example, multiple software programs or software modules used to provide distributed services, or as a single software program or software module. No specific limitations are made here.
[0042] Display device 103 can be various electronic devices with a display screen, microphone and support for caption display and audio data acquisition, including but not limited to head-mounted display devices, smart display screens, watches, bracelets, etc.
[0043] Server 105 can be a server that provides various services, such as a backend server that sends information (e.g., acoustic models or language models corresponding to a specific domain) to terminal device 101. The backend server can analyze and process the received requests and other data, and then feed the processing results back to the terminal device.
[0044] It should be noted that the speech recognition method for text matching provided in the embodiments of this disclosure can be executed by the terminal device 101.
[0045] It should be noted that a server can be either hardware or software. When the server is hardware, it can be implemented as a distributed server cluster consisting of multiple servers, or as a single server. When the server is software, it can be implemented as multiple software programs or software modules used to provide distributed services, or as a single software program or software module. No specific limitations are made here.
[0046] It should be understood that Figure 1 The number of terminal devices, display devices, networks, and servers shown is merely illustrative. Depending on implementation needs, any number of terminal devices, display devices, networks, and servers can be included.
[0047] Continue to refer to Figure 2 The diagram illustrates a flow 200 of some embodiments of a speech recognition method for text matching according to the present disclosure. The speech recognition method for text matching includes the following steps:
[0048] Step 201: Receive audio data corresponding to the target text data sent by the display device connected in the communication connection.
[0049] In some embodiments, the execution subject of the speech recognition method for text matching (e.g. Figure 1The terminal device shown can receive audio data corresponding to the target text data sent by the connected display device via a wired or wireless connection. The display device can be a device capable of displaying captions. Such display devices may include, but are not limited to: head-mounted displays, smart display screens, watches, and wristbands. Head-mounted displays may include, but are not limited to: AR glasses and MR glasses. The target text data can be basic text-related data used for caption display. Target text data may include, but is not limited to: speech drafts, product introductions, and tour guide scripts. The audio data corresponding to the target text data can be audio associated with the speech or narration. It should be noted that the wireless connection method can be a local wireless connection. Local wireless connections may include, but are not limited to, Bluetooth, Wi-Fi Direct, and LAN Socket.
[0050] Step 202: Extract features from the audio data to obtain audio feature information.
[0051] In some embodiments, the execution entity may perform feature extraction on the audio data to obtain audio feature information. In practice, an audio feature extraction algorithm can be used to extract audio feature information from the audio data. For example, the audio feature extraction algorithm can be Mel-Filterbank.
[0052] In some optional implementations of certain embodiments, the aforementioned execution entity may extract features from the aforementioned audio data through the following steps to obtain audio feature information:
[0053] The first step is to perform frame segmentation on the audio data to obtain an audio frame sequence. In practice, the audio data can be segmented into frames according to a preset duration to obtain the audio frame sequence. For example, the preset duration can be 10ms.
[0054] The second step involves extracting features from each audio frame in the aforementioned audio frame sequence to obtain audio frame feature information. In practice, audio feature extraction algorithms can be used to extract audio feature information from the audio data. For example, the Mel-Filterbank algorithm can be used. The dimension of the audio frame feature information can be 80.
[0055] The third step involves superimposing the feature information of each audio frame within a preset window to obtain the feature tensor of each audio frame as the audio feature information. For example, the preset window can be a preset number of frames before and after the current audio frame. The preset number could be 4. In practice, for each preset context window, the feature information of each audio frame within the preset context window can be superimposed to obtain the audio frame feature tensor. Thus, the feature tensors of each audio frame can be used as the audio feature information.
[0056] Optionally, the aforementioned execution entity can identify keywords corresponding to the target text data to obtain a keyword set. In practice, the target text data can first be cleaned to obtain cleaned text data. Text cleaning can include, but is not limited to, at least one of the following: removing control characters, special punctuation, and HTML tags; and unifying consecutive spaces and line breaks into periods or spaces. Then, a keyword extraction algorithm can be used to extract keywords from the cleaned text data to obtain a keyword set. For example, the keyword extraction algorithm can be a TF-IDF-based keyword extraction algorithm.
[0057] Next, based on the above set of keywords, the domain type corresponding to the target text data can be identified.
[0058] In some optional implementations of certain embodiments, the execution entity may identify keywords corresponding to the target text data through the following steps to obtain a keyword set:
[0059] The first step is to extract plain text from the target text data to obtain the target text. In practice, third-party libraries can be used to extract plain text from the target text data. For example, third-party libraries could be python-docx or pdfminer.
[0060] The second step is to perform text cleaning on the target text to obtain cleaned text. Text cleaning may include, but is not limited to, at least one of the following: removing control characters, special punctuation, and HTML tags; and replacing consecutive spaces and line breaks with periods or spaces.
[0061] The third step is to segment the cleaned text into sentences, obtaining a list of sentences. In practice, the cleaned text can be segmented according to various preset punctuation marks to obtain the sentence list. These preset punctuation marks may include, but are not limited to: colon, period, exclamation mark, question mark, and semicolon.
[0062] Fourth, for each clause in the above list of clauses, perform the following steps:
[0063] The first sub-step involves segmenting the above clauses into words to obtain a list of corresponding word segments. In practice, jieba segmentation can be used to segment the above clauses and obtain the list of corresponding word segments.
[0064] The second sub-step involves performing entity recognition on the aforementioned clauses to obtain the entity recognition results. In practice, named entity recognition algorithms can be used to perform entity recognition on the aforementioned clauses to obtain the entity recognition results. The entity recognition results may include the words that represent each entity. Each entity may include, but is not limited to, at least one of the following: person's name, place name, organization name, and professional term.
[0065] The third sub-step involves removing word segments that meet preset conditions from the aforementioned word segmentation list to update the word segmentation list. These preset conditions can include identifying stop words as the segmented words.
[0066] The fourth sub-step involves performing the following steps for each word in the updated word segmentation list:
[0067] First, determine the word frequency of the target text data corresponding to the above word segmentation. The word frequency can be the number of times the above word segmentation appears in the target text data.
[0068] Second, based on the aforementioned word frequencies, generate word frequency scores for the target text data corresponding to the word segments. The word frequency score can be the TF-IDF value of the word segment in the target text data. In practice, this can be achieved by first...
[0069] Third, determine the connectivity score of the aforementioned word segments. The connectivity score can be the PageRank score of the aforementioned word segment in the word graph corresponding to the aforementioned target text data. The aforementioned word graph can be a word graph constructed using the aforementioned target text data through the TextRank algorithm. Each word in the aforementioned target text data can be treated as a node, and edges are established between words co-occurring within a sliding window (e.g., 5 words). Then, PageRank iteration scoring is performed on the graph to obtain the "connectivity importance" of each word. The aforementioned word graph can be pre-constructed for the aforementioned target text data.
[0070] Fourth, based on the aforementioned word frequency scores and connectivity scores, a word segmentation score is generated. In practice, the normalized word frequency scores and connectivity scores can be weighted to obtain the word segmentation score. The weighting coefficients can be preset.
[0071] The fifth step involves selecting a predetermined number of segmented words from the obtained word segmentation lists that meet a preset score requirement as keywords, along with the obtained entity recognition results as supplementary keywords, to obtain a keyword set. The preset score requirement can be that the segmented word's score is among the top predetermined number of segmented words. For example, the preset number could be 200. Therefore, the top predetermined number of segmented words can be selected as keywords based on their segmentation scores, from highest to lowest.
[0072] In some optional implementations of certain embodiments, the execution entity may identify the domain type corresponding to the target text data based on the aforementioned keyword set through the following steps:
[0073] The first step is to determine the hit information for each domain dictionary in the predefined domain dictionary set, corresponding to the keyword set. The domain dictionary set can be domain dictionaries for various domains. Each domain dictionary corresponds to a domain. Domain dictionaries can include common terms from the corresponding domain. In practice, the number of keywords in the keyword set that are identical to terms included in the domain dictionaries can be determined. This number can then be defined as the hit count. The hit information can include the hit count and also the hit rate. The hit rate can be the ratio of the hit count to the number of keywords included in the keyword set.
[0074] The second step is to select the domain dictionary with the highest hit rate from the aforementioned domain dictionary set as the candidate domain dictionary. In practice, the domain dictionary with the highest number of hits can be selected from the aforementioned domain dictionary set as the candidate domain dictionary.
[0075] The third step involves determining that, in response to the hit information corresponding to the aforementioned candidate domain dictionary is greater than or equal to a preset threshold, the domain corresponding to the aforementioned candidate domain dictionary is identified as the domain type corresponding to the aforementioned target text data. In practice, this can be achieved by determining that the hit rate included in the hit information corresponding to the aforementioned candidate domain dictionary is greater than or equal to a preset threshold, thus identifying the domain corresponding to the aforementioned candidate domain dictionary as the domain type corresponding to the aforementioned target text data. The confidence level of the aforementioned domain type corresponds to the aforementioned hit information. The confidence level of the aforementioned domain type can be the hit rate included in the aforementioned hit information.
[0076] Fourth, in response to the determination that the hit information corresponding to the above candidate domain dictionary is less than the above preset threshold, the domain type and confidence score corresponding to the above target text data are generated based on the above target text data and the pre-trained domain recognition model. The domain recognition model can be a classification model that takes text data as input and outputs the domain type and corresponding confidence score. For example, the domain recognition model can be a general Chinese text classification model trained based on FastText or TextCNN.
[0077] Step 203: Generate acoustic probability information of the corresponding audio data based on the domain type and audio feature information of the target text data.
[0078] In some embodiments, the execution entity can generate acoustic probability information corresponding to the audio data based on the domain type corresponding to the target text data and the audio feature information. In practice, a pre-stored acoustic model corresponding to the domain type can be determined first. Different domain types may correspond to acoustic models pre-trained with training data specific to that domain type. Then, the audio feature information can be input into the determined acoustic model to obtain the acoustic probability information corresponding to the audio data. The acoustic model can be a lightweight acoustic model. For example, the acoustic model can be a Conformer-Lite or an RNN-Transducer. The acoustic probability information may include the probability distribution of the tokens corresponding to each frame.
[0079] Optionally, the aforementioned audio feature information can be input into a pre-loaded acoustic model corresponding to the aforementioned domain type to obtain the probability of each sub-word unit as acoustic probability information. Each sub-word unit probability corresponds to an audio frame and a sub-word unit.
[0080] Step 204: Generate the domain probability information of the corresponding audio data based on the domain type and acoustic probability information.
[0081] In some embodiments, the execution entity can generate domain probability information corresponding to the audio data based on the domain type and the acoustic probability information. In practice, a pre-stored language model corresponding to the domain type can be determined first. Different domain types may correspond to language models pre-trained with training data specific to that domain type. Then, the acoustic probability information can be input into the determined language model to obtain the domain probability information corresponding to the audio data. For example, the language model can be an n-gram language model. The language probability information may include the probability distribution of the next token for each frame. The probability distribution of the next token can be understood as the probability of the next token under the current prefix.
[0082] Optionally, each target sub-word unit can be determined based on the aforementioned acoustic probability information. In practice, when the acoustic probability information includes the probability distribution of sub-word units corresponding to each frame, encompassing the probabilities of multiple sub-word units, the sub-word unit with the highest probability corresponding to each frame can be determined as the target sub-word unit. When the acoustic probability information includes the probability distribution of sub-word units corresponding to each frame, encompassing the probability of only one sub-word unit, the sub-word unit corresponding to each frame can be determined as the target sub-word unit. Then, the historical prefix and the aforementioned target sub-word units can be combined to form the current prefix. The historical prefix can be a previously determined prefix; initially, the historical prefix can be empty. The determined current prefix can be used as the historical prefix in the next iteration. Next, the current prefix can be input into a pre-loaded language model corresponding to the aforementioned domain type to obtain the probabilities of each sub-word unit as domain probability information. Each sub-word unit probability corresponds to an audio frame and a sub-word unit. Here, a sub-word unit can be understood as the probability of the next sub-word unit under the current prefix.
[0083] Optionally, the aforementioned implementing entity may perform the following steps:
[0084] The first step is to generate a word lookup tree based on the keyword set mentioned above. The word lookup tree can be a Trie tree. In practice, each keyword in the keyword set can be inserted character by character into an empty Trie tree to obtain the constructed word lookup tree.
[0085] The second step is to train the initial language model based on the aforementioned keyword set, resulting in a trained language model. The initial language model can be the language model to be trained. For example, the initial language model can be a 2-gram or 3-gram model. In practice, the keyword set can be used as training data to train the initial language model, resulting in a trained language model.
[0086] The third step is to generate a bias vector based on the aforementioned keyword set. In practice, for each keyword in the keyword set, a preset empirical value can be determined as the corresponding bias value. Then, each determined bias value can be used to define a bias vector.
[0087] The fourth step involves defining the word search tree, the trained language model, and the bias vector as bias information. The word search tree can be stored in a first format, the language model in a second format, and the bias vector in a third format. The first format can be JSON. The second format can be ARPA or FST. The third format can be npy format.
[0088] Alternatively, the bias information can be received from the server.
[0089] Step 205: Generate a list of candidate sentences based on the bias information, acoustic probability information, and domain probability information of the corresponding target text data.
[0090] In some embodiments, the execution entity can generate a candidate sentence list based on the bias information corresponding to the target text data, the acoustic probability information, and the domain probability information. The bias information can be pre-constructed prior information. The bias information may include, but is not limited to, a keyword set. The keyword set can be various keywords extracted from the target text data. When the prefix corresponding to a sub-word unit matches a keyword in the keyword set, a weight can be added to that sub-word unit. For example, the added weight can be a preset value.
[0091] In practice, the aforementioned execution entity can generate a candidate sentence list based on the bias information, acoustic probability information, and domain probability information corresponding to the aforementioned target text data through the following steps:
[0092] The first step is to perform the following steps for each audio frame in each of the above audio frame sequences:
[0093] The first sub-step involves determining the candidate path list. This list includes paths, each containing a prefix text and candidate values. Here, the path can be a search path. The prefix text can be pre-determined text. The candidate values can be scores determined for the prefix text. On the first execution, the candidate path list may include an initialized path. For example, this path could be (prefix=[ <sos>[, score = 0.0]. prefix can represent prefix text. <sos>It can represent the start of a sentence. The score can represent the candidate value corresponding to the prefix.
[0094] The second sub-step involves extracting the probabilities of each word unit corresponding to the aforementioned audio frame from the acoustic probability information, forming a set of word unit probabilities. This set of word unit probabilities characterizes the token probability distribution of the audio frame. In practice, the set of word unit probabilities can be derived by extracting the probabilities of each word unit corresponding to the aforementioned audio frame and arranged in the Top-K order. This reduces the computational load.
[0095] The third sub-step involves performing the following steps for each path in the candidate path list:
[0096] For each sub-word unit corresponding to the probability of a sub-word unit in the above sub-word unit probability set, perform the following steps:
[0097] First, based on the above-mentioned word unit probabilities, the acoustic model score is generated. In practice, the logarithm of the above-mentioned word unit probabilities can be used to determine the acoustic model score. Here, the logarithm can be based on the natural constant e.
[0098] Second, the language model score is generated based on the sub-word unit probabilities corresponding to the aforementioned sub-word units in the domain probability information. In practice, the logarithm of the sub-word unit probabilities corresponding to the aforementioned sub-word units in the domain probability information can be used as the language model score. Here, the logarithm can be based on the natural constant e.
[0099] Third, based on the aforementioned sub-word units and the bias vector and word lookup tree included in the aforementioned bias information, a bias score is generated. In practice, it can be determined whether the current prefix text corresponding to the aforementioned sub-word unit matches the keyword in the aforementioned word lookup tree. Then, in response to the confirmation of a match, the bias value in the aforementioned bias vector corresponding to the matched keyword can be determined as the bias score.
[0100] Fourth, based on the candidate values included in the above path, the acoustic model score, the language model score, and the bias score, updated candidate values are generated. In practice, the updated candidate values can be determined by multiplying the candidate values included in the above path, the acoustic model score, the language model score, and preset interpolation coefficients, and multiplying the bias score and preset bias coefficients. The preset interpolation coefficients can be values between 0 and 1, and can be used to control the weight of the language model in the score. The preset bias coefficients can be non-negative numbers, and can be used to control the degree to which the bias enhances the decoding path.
[0101] Fifth, concatenate the prefix text included in the above path and the above sub-word units to form the updated prefix text. Here, the concatenation can be string concatenation.
[0102] Sixth, add the updated prefix text and the updated candidate value as paths to the candidate path list to update the candidate path list.
[0103] The fourth sub-step involves sorting the paths included in the updated candidate path list, and using the first preset number of paths as the updated candidate path list, and then using them as the candidate path list for the next audio frame.
[0104] The second step involves selecting prefix texts whose candidate values satisfy preset candidate conditions from each updated candidate path list, thus obtaining a candidate sentence list. The preset candidate conditions can be that the candidate value is the candidate value ranked first by a preset value. For example, the preset value could be 8. Paths whose candidate values are ranked after the preset value can be pruned to control time and memory overhead. Each candidate sentence in the above candidate sentence list corresponds to a candidate value. Therefore, by utilizing the probability distribution output by the acoustic model and language model of the corresponding domain, as well as pre-constructed bias information, candidate sentences can be comprehensively screened. This allows for the fusion of multi-dimensional information for final decoding, improving the accuracy of the determined candidate sentence list.
[0105] Step 206: Generate recognition text information based on the candidate sentence list.
[0106] In some embodiments, the executing entity can generate recognition text information based on the candidate sentence list. In practice, the executing entity can determine the candidate sentence with the highest candidate value in the candidate sentence list as the recognition text information. Alternatively, it can determine the candidate sentence with the highest candidate value in the candidate sentence list, along with the start and end timestamps of the audio data, as the recognition text information. For example, the candidate sentence with the highest candidate value could be "Today we will talk about the basic principles of artificial intelligence," the start timestamp of the audio data could be "start_time": 0.0, and the end timestamp could be "end_time": 2.3.
[0107] In some optional implementations of certain embodiments, the aforementioned execution entity may generate recognition text information based on the aforementioned candidate sentence list through the following steps:
[0108] The first step is to optimize the above candidate sentence list to obtain an optimized candidate sentence list.
[0109] The second step is to generate the recognized text information based on the optimized candidate sentence list. In practice, the executing entity can determine the candidate sentence with the highest candidate value in the optimized candidate sentence list as the recognized text information.
[0110] In some optional implementations of certain embodiments, the aforementioned execution entity may optimize the candidate sentence list by performing the following steps to obtain an optimized candidate sentence list:
[0111] The first step is to determine the recognition location information corresponding to the target text data at the current moment. The recognition location information can be the index of the paragraph that was successfully matched last time.
[0112] The second step is to extract context information from the target text data based on the identified location information. In practice, a context window can be determined based on the identified location information. For example, the difference between the identified location information and a preset number of segments, or the sum of the identified location information and a preset number of segments, can be used to determine the context window. Then, the text within the context window can be extracted from the target text data as context information.
[0113] The third step is to perform the following steps for each candidate sentence in the candidate sentence list:
[0114] The first sub-step is to determine the matching length of the candidate sentence in the context information. Here, the matching length can be the length of the longest common subsequence (LCS) between the candidate sentence and the context information.
[0115] The second sub-step is to determine the matching value as the ratio of the above matching length to the string length of the above context information.
[0116] The third sub-step involves generating a composite value for the candidate sentences based on the matching values and their corresponding candidate values. In practice, the composite value can be determined by a weighted sum of the matching values and their corresponding candidate values. The weighting coefficients can be pre-set. For example, to emphasize contextual consistency, the weighting coefficient for the matching value can be set larger. As an example, the weighting coefficient for the matching value could be 0.7. Similarly, to place greater trust in the candidate values, the weighting coefficient for the candidate values can be set larger. As an example, the weighting coefficient for the candidate values could be 0.8.
[0117] The fourth step is to reorder the candidate sentences based on their overall scores to optimize the candidate sentence list. In practice, the candidate sentences can be reordered in descending order of their overall scores to optimize the list. The optimized candidate sentence list can then use its overall score as its candidate value. Therefore, the reordered candidate sentence list can be used to select the candidate sentences that best fit the context.
[0118] Optionally, the aforementioned implementing entity may also perform the following steps:
[0119] The first step is to select the candidate sentence with the highest comprehensive value from the optimized candidate sentence list.
[0120] The second step is to perform word segmentation on the above candidate sentences to obtain a word segmentation set.
[0121] Third, for each word in the above word segmentation set, perform the following steps:
[0122] The first sub-step is to determine the pinyin corresponding to the above word segmentation. In practice, a pinyin library can be called to obtain the pinyin corresponding to the above word segmentation. For example, the pinyin library could be pypinyin.
[0123] The second sub-step involves determining whether a word with a corresponding pinyin exists in the pre-constructed word pinyin mapping information corresponding to the aforementioned target text data. This word pinyin mapping information includes the words in the target text data and their corresponding pinyin.
[0124] The third sub-step, in response to the determination of existence, is to determine the word corresponding to the pinyin of the word in the above word pinyin mapping information.
[0125] The fourth sub-step is to determine whether the above word segmentation is the same as the above word.
[0126] The fifth sub-step, in response to the determination of differences, is to determine whether the aforementioned words exist in the adjacent text of the aforementioned candidate sentences in the aforementioned target text data.
[0127] The sixth sub-step, in response to determining that the aforementioned word exists in the adjacent text, replaces the aforementioned word segment in the aforementioned candidate sentence with the aforementioned word to optimize the aforementioned candidate sentence. Thus, it is possible to strictly match candidate words with homophones that have completely identical pinyin in the word pinyin mapping information, and replace them after contextual verification.
[0128] Optionally, for each candidate sentence in the candidate sentence list, the aforementioned execution entity may also perform the following steps:
[0129] The first step is to perform word segmentation on the candidate sentences to obtain a word segmentation set. Here, the candidate sentence list can be the initial list of candidate sentences obtained from the initial screening, or it can be the list of candidate sentences after either of the above optimization processes.
[0130] The second step is to perform the following steps for each word in the above word segmentation set:
[0131] The first sub-step involves identifying the phonemes corresponding to the word segments as word segmentation phonemes. Here, phonemes can be pinyin phonemes.
[0132] The second sub-step involves determining the similarity between the segmented phonemes and each word phoneme in the pre-constructed word phoneme mapping information. This word phoneme mapping information includes words and their corresponding word phonemes. The word phoneme mapping information can be a pre-constructed phoneme library for common words. In practice, for each word factor, the edit distance between the segmented phonemes and the word phonemes can be determined. Then, the maximum value between the phoneme sequence length corresponding to the word phoneme and the phoneme sequence length of the segmented phonemes can be determined. Next, the ratio of the edit distance to the maximum value can be determined as the normalized edit distance. Finally, the difference between 1 and the normalized edit distance can be determined as the similarity between the segmented phonemes and the word phonemes.
[0133] The third sub-step involves identifying the target word phonemes among the aforementioned word phonemes whose similarity to the corresponding segmented phonemes meets a preset similarity condition. The preset similarity condition can be that the similarity to the segmented phonemes is the highest.
[0134] Optionally, the phoneme with the smallest edit distance to the segmented phonemes among the aforementioned phonemes can first be identified as the target phoneme. Then, the maximum value between the phoneme sequence length of the target phoneme and the phoneme sequence length of the segmented phonemes can be determined. Next, the ratio of the edit distance between the segmented phoneme and the target phoneme to the maximum value can be determined as the normalized edit distance. Finally, the difference between 1 and the normalized edit distance can be determined as the similarity between the segmented phoneme and the target phoneme.
[0135] The fourth sub-step involves determining the word entry corresponding to the target word phoneme as the target word entry in response to the similarity between the segmented phoneme and the target word phoneme meeting a preset threshold condition. The preset threshold condition can be a similarity greater than or equal to a preset threshold.
[0136] The fifth sub-step involves replacing the target word with the segmented word in the candidate sentence after determining that the target word is different from the segmented word, thereby optimizing the candidate sentence. This allows the text following misidentified words with similar pronunciations to be replaced.
[0137] Optionally, for each candidate sentence in the candidate sentence list, the aforementioned execution entity may also perform the following steps:
[0138] The first step is to perform word segmentation on the candidate sentences to obtain a word segmentation set. Here, the candidate sentence list can be the initial list of candidate sentences obtained from the initial screening, or it can be the list of candidate sentences after either of the above optimization processes.
[0139] The second step is to perform the following steps for each word in the above word segmentation set:
[0140] The first sub-step involves matching the segmented words with a pre-built user dictionary to obtain word matching results. The user dictionary can be a user-defined dictionary, which may include entries and their corresponding pinyin or phonemes. In practice, the segmented words can be matched with each pinyin in the user dictionary using fuzzy pinyin matching to obtain the matched pinyin. Then, the entries corresponding to the matched pinyin can be identified as the word matching results.
[0141] The second sub-step involves replacing the aforementioned word segmentation in the candidate sentence with the word matching results. This allows for the use of a user-defined dictionary to correct the word segmentation in the candidate sentence, thereby improving recognition accuracy.
[0142] Step 207: The recognized text information is sent to the display device, so that the display device performs text matching between the recognized text information and the target text data and highlights the text matching result.
[0143] In some embodiments, the executing entity may send the identified text information to the display device, causing the display device to perform text matching between the identified text information and the target text data, and to highlight the text matching result. The text matching result may include the text and text position in the matched target text data. The highlighting method may include, but is not limited to, highlighting, magnification, and color-changing display.
[0144] Optionally, the target text data can be sent to the aforementioned display device. Here, the target text data can be text-cleaned and processed. In practice, the target text data can be pre-divided into segments according to a preset byte order to obtain individual segments. Each segment can be encapsulated as a structured message, and the accompanying metadata may include, but is not limited to: segment number, total number of segments, text content, and checksum (e.g., MD5 or SHA-256). Correspondingly, the display device can receive the segment results according to the segment number, temporarily cache them in memory, and perform integrity checks based on the checksum after receiving the segment results. After successful verification, the text content can be merged and written to the local storage space of the display device, and the file name can include the unique ID of the target text data. When the user starts the teleprompter, the corresponding text data can be read from the aforementioned local storage space for display.
[0145] The above-described embodiments of this disclosure have the following beneficial effects: The speech recognition method for text matching in some embodiments of this disclosure can improve the recognition accuracy of professional terms and hot words in a text in offline or weak network environments, thus improving the automatic prompting effect and making it suitable for professional prompting scenarios. Specifically, the reasons for poor prompting effect, unsuitability for offline or weak network environments, and poor recognition accuracy are as follows: manual control requires manual intervention, which affects the fluency of expression and requires looking at a fixed teleprompter position, affecting the presentation effect; cloud-based solutions are susceptible to delays or interruptions due to network fluctuations, or in some cases, cloud-based recognition solutions cannot be used due to privacy protection requirements; general offline automatic speech recognition is difficult to cover key hot words such as professional terms, industry terms, names of people and places, and brand names in a speech, resulting in many omissions or misrecognitions. Based on this, the speech recognition method for text matching in some embodiments of this disclosure first receives audio data corresponding to the target text data sent by a display device with a communication connection. Thus, the audio data corresponding to the current text can be collected in real time through the display device. Then, feature extraction is performed on the audio data to obtain audio feature information. This facilitates audio recognition. Next, based on the domain type corresponding to the target text data and the aforementioned audio feature information, acoustic probability information corresponding to the aforementioned audio data is generated. Thus, acoustic probability inference can be performed using the extracted audio features and the domain type corresponding to the text. Next, based on the aforementioned domain type and the aforementioned acoustic probability information, domain probability information corresponding to the aforementioned audio data is generated. Thus, domain probability inference can be performed using the extracted audio features and the domain type corresponding to the text. Next, based on the bias information corresponding to the aforementioned target text data, the aforementioned acoustic probability information, and the aforementioned domain probability information, a candidate sentence list is generated. Thus, a comprehensive selection of candidate sentences corresponding to the audio data can be achieved by combining pre-constructed prior knowledge, acoustic probability inference results, and domain probability inference results. Next, based on the aforementioned candidate sentence list, recognition text information is generated. Thus, the final recognition text can be determined. Finally, the aforementioned recognition text information is sent to the aforementioned display device, enabling the display device to perform text matching between the aforementioned recognition text information and the aforementioned target text data and to highlight the text matching results. Thus, the display device can automatically highlight the recognition text without requiring manual prompts. Because it does not use cloud-based audio data recognition, it is suitable for offline or weak network environments. Furthermore, when recognizing audio data, both acoustic probability inference results and domain probability inference results are determined based on the domain type corresponding to the text. This makes the inference results more closely aligned with domain characteristics, thereby improving the accuracy of recognizing professional terms and hot words in texts in offline or weak network environments, enhancing the automatic title-writing effect, and thus making it suitable for specialized title-writing scenarios.
[0146] Further reference Figure 3 The diagram illustrates flow 300 of another embodiment of a speech recognition method for text matching. Flow 300 of this speech recognition method for text matching includes the following steps:
[0147] Step 301: Generate the language type based on the target text data.
[0148] In some embodiments, the execution subject of the speech recognition method for text matching (e.g. Figure 1 The terminal device shown can generate a language type based on the target text data described above. In practice, the main language type of the target text data can be determined using a language recognition algorithm to obtain the language type. For example, the language recognition algorithm can be an algorithm based on Unicode character statistics and fastText classification. The language type can include, but is not limited to: Chinese, English, and Japanese.
[0149] Step 302: Determine the model domain type based on the confidence level corresponding to the domain type.
[0150] In some embodiments, the executing entity can determine the model domain type based on the confidence level corresponding to the domain type. In practice, in response to determining that the confidence level is greater than or equal to a preset confidence level, the domain identifier corresponding to the domain type can be determined as the model domain type. In response to determining that the confidence level is less than a preset confidence level, the identifier representing a general domain type (e.g., general) can be determined as the model domain type.
[0151] Step 303: Determine the model performance type based on the equipment performance parameter information.
[0152] In some embodiments, the execution entity can determine the model performance type based on device performance parameter information. The device performance parameter information can be device performance-related parameters of the execution entity. The device performance parameter information may include, but is not limited to, available memory. In practice, in response to determining that the available memory is greater than or equal to a preset memory, an identifier representing a complete model (e.g., full) can be determined as the model performance type. In response to determining that the available memory is less than a preset memory, an identifier representing a lightweight model (e.g., lite) can be determined as the model performance type.
[0153] Step 304: Generate model identification information based on language type, model domain type, and model performance type.
[0154] In some embodiments, the executing entity can generate model identification information based on the language type, model domain type, and model performance type. In practice, the language type, model domain type, and model performance type can be combined to form model identification information. For example, the model identification information could be "zh-medical-full". "zh" represents Chinese. "medical" represents the medical field. "full" represents a complete model.
[0155] Step 305: Load the acoustic model and language model corresponding to the model identification information based on the model identification information.
[0156] In some embodiments, the execution entity may load an acoustic model and a language model corresponding to the model identification information based on the model identification information. In practice, the acoustic model and language model corresponding to the model identification information can be obtained from the local machine or a server.
[0157] Step 306: Receive audio data corresponding to the target text data sent by the display device connected in the communication connection.
[0158] Step 307: Extract features from the audio data to obtain audio feature information.
[0159] Step 308: Input the audio feature information into the pre-loaded acoustic model of the corresponding domain type to obtain the probability of each sub-word unit as acoustic probability information.
[0160] Step 309: Determine each target word unit based on acoustic probability information.
[0161] Step 310: Combine the historical prefix and each target word unit into the current prefix.
[0162] Step 311: Input the current prefix into the pre-loaded language model of the corresponding domain type to obtain the probability of each word unit as domain probability information.
[0163] Step 312: Generate a list of candidate sentences based on the bias information, acoustic probability information and domain probability information of the corresponding target text data.
[0164] Step 313: Generate recognition text information based on the candidate sentence list.
[0165] Step 314: The recognized text information is sent to the display device, so that the display device performs text matching between the recognized text information and the target text data and highlights the text matching result.
[0166] In some embodiments, the specific implementation of steps 306-314 and the resulting technical effects can be found in [reference needed]. Figure 2 Steps 201-207 in the corresponding embodiments will not be repeated here.
[0167] from Figure 3 It can be seen from this that, with Figure 2 Compared to the description of some corresponding embodiments, Figure 3 The flow 300 of the speech recognition method for text matching in some corresponding embodiments illustrates the steps of loading corresponding acoustic and language models according to the domain type. Thus, the schemes described in these embodiments can comprehensively select appropriate acoustic and language models based on the language type, domain, and device performance of the target text data to improve recognition accuracy and ensure resource utilization efficiency.
[0168] Further reference Figure 4 As an implementation of the methods shown in the above figures, this disclosure provides some embodiments of a speech recognition system for text matching, which are similar to... Figure 2 The methods and embodiments shown correspond to those.
[0169] like Figure 4 As shown, a speech recognition system 400 for text matching in some embodiments includes: a data processing device 401 configured to perform... Figure 2-3 The corresponding methods in those embodiments; the display device 402 is configured to perform the following steps: sending the collected audio data to the data processing device; in response to receiving the identified text information sent by the data processing device, performing text matching on the identified text information and the target text data to obtain matching position information and text matching result; determining the screen position information corresponding to the matching position information; and highlighting the text matching result according to the screen position information.
[0170] In some embodiments, the display device may, in response to determining that the identified text information and the text in the target text data are completely matched, determine the text as the text matching result and determine the paragraph index corresponding to the text as the matching position information. The display device may also, in response to determining that the identified text information and the text in the target text data are not completely matched, for each paragraph, determine the length of the longest common subsequence between the paragraph and the identified text information, and then determine the ratio of the length to the number of strings in the paragraph as the matching score. Finally, determine the index of the paragraph with the highest matching score as the matching position information and determine the longest common subsequence corresponding to the paragraph with the highest matching score as the text matching result. The matching position information may be the position of the text matching result in the target text data. The matching position information may include, but is not limited to, the paragraph index. The screen position information may be the coordinates of the paragraph corresponding to the matching position information on the screen, for example, it may include the screen coordinates of the first character of the paragraph. In practice, the current view may first be scrolled to the position corresponding to the screen position information, so that the position corresponding to the screen position information is displayed in the center of the current view. For example, a UI animation engine (WebGL or underlying OpenGL ES) can be used to complete the view movement within a preset duration. The preset duration can be 50-200ms. Next, the text matching results can be highlighted.
[0171] It is understandable that the units described in the device 400 are related to the reference. Figure 2 The steps in the described method correspond accordingly. Therefore, the operations, features, and beneficial effects described above for the method also apply to device 400 and the units contained therein, and will not be repeated here.
[0172] The following is for reference. Figure 5 It illustrates an electronic device 500 suitable for implementing some embodiments of the present disclosure (e.g., Figure 1 A schematic diagram of the structure of the terminal equipment in the process. Figure 5 The electronic device shown is merely an example and should not be construed as limiting the functionality and scope of the embodiments of this disclosure.
[0173] like Figure 5 As shown, the electronic device 500 may include a processing unit 501 (e.g., a central processing unit, a graphics processor, etc.), which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 502 or a program loaded from a storage device 508 into a random access memory (RAM) 503. The RAM 503 also stores various programs and data required for the operation of the electronic device 500. The processing unit 501, ROM 502, and RAM 503 are interconnected via a bus 504. An input / output (I / O) interface 505 is also connected to the bus 504.
[0174] Typically, the following devices can be connected to I / O interface 505: input devices 506 including, for example, touchscreens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 507 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; and communication devices 509. Communication device 509 allows electronic device 500 to communicate wirelessly or wiredly with other devices to exchange data. Although Figure 5 An electronic device 500 with various devices is shown; however, it should be understood that it is not required to implement or possess all of the devices shown. More or fewer devices may be implemented or possessed alternatively. Figure 5 Each box shown can represent a device or multiple devices as needed.
[0175] In particular, according to some embodiments of this disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, some embodiments of this disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via communication device 509, or installed from storage device 508, or installed from ROM 502. When the computer program is executed by processing device 501, it performs the functions defined in the methods of some embodiments of this disclosure.
[0176] It should be noted that, in some embodiments of this disclosure, the computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium may be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In some embodiments of this disclosure, a computer-readable storage medium may be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In some embodiments of this disclosure, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals may take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium can be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wires, optical fibers, RF (radio frequency), etc., or any suitable combination thereof.
[0177] In some implementations, clients and servers can communicate using any currently known or future-developed network protocol, such as HTTP (Hypertext Transfer Protocol), and can interconnect with digital data communication (e.g., communication networks) of any form or medium. Examples of communication networks include local area networks ("LANs"), wide area networks ("WANs"), the Internet (e.g., the Internet of Things), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future-developed networks.
[0178] The aforementioned computer-readable medium may be included in the aforementioned electronic device; or it may exist independently and not assembled into the electronic device. The aforementioned computer-readable medium carries one or more programs that, when executed by the electronic device, cause the electronic device to: receive audio data corresponding to target text data sent by a display device with a communication connection; extract features from the audio data to obtain audio feature information; generate acoustic probability information corresponding to the audio data based on the domain type corresponding to the target text data and the audio feature information; generate domain probability information corresponding to the audio data based on the domain type and the acoustic probability information; generate a candidate sentence list based on the bias information corresponding to the target text data, the acoustic probability information, and the domain probability information; generate recognized text information based on the candidate sentence list; and send the recognized text information to the display device, causing the display device to perform text matching between the recognized text information and the target text data and highlight the text matching result.
[0179] Computer program code for performing operations of some embodiments of this disclosure can be written in one or more programming languages or a combination thereof, including object-oriented programming languages such as Java, Smalltalk, and C++, and conventional procedural programming languages such as "C" or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).
[0180] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0181] The functions described above in this document can be performed, at least in part, by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used, without limitation, include: Field Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), Application Standard Products (ASSPs), System-on-Chip (SoCs), Complex Programmable Logic Devices (CPLDs), and so on.
[0182] Some embodiments of this disclosure also provide a computer program product, including a computer program that, when executed by a processor, implements any of the above-described speech recognition methods for text matching.
[0183] The above description is merely a selection of preferred embodiments of this disclosure and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of the invention involved in the embodiments of this disclosure is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-described inventive concept. For example, technical solutions formed by substituting the above-described features with (but not limited to) technical features with similar functions disclosed in the embodiments of this disclosure.< / sos> < / sos>
Claims
1. A speech recognition method for text matching, comprising: Receive audio data corresponding to the target text data sent by the display device connected in the communication connection; Feature extraction is performed on the audio data to obtain audio feature information; Identify the keywords corresponding to the target text data to obtain a keyword set; Based on the keyword set, identify the domain type corresponding to the target text data; Based on the domain type corresponding to the target text data and the audio feature information, acoustic probability information corresponding to the audio data is generated, wherein the acoustic probability information includes the probability distribution of sub-word units corresponding to each frame; Based on the domain type and the acoustic probability information, domain probability information corresponding to the audio data is generated, including: Based on the acoustic probability information, each target sub-word unit is determined; Combine the historical prefix and each target sub-word unit into the current prefix; The current prefix is input into a pre-loaded language model corresponding to the domain type to obtain the probability of each sub-word unit as domain probability information. Each sub-word unit probability corresponds to an audio frame and a sub-word unit. The domain probability information includes the probability distribution of the next sub-word unit corresponding to each frame. Generate a word search tree based on the keyword set; Based on the set of keywords, the initial language model is trained to obtain the trained language model; Generate a bias vector based on the set of keywords; The word search tree, the trained language model, and the bias vector are determined as bias information; A candidate sentence list is generated based on the bias information, acoustic probability information, and domain probability information corresponding to the target text data, wherein the bias information is pre-constructed prior information; Based on the candidate sentence list, generate recognition text information; The identified text information is sent to the display device, which then performs text matching between the identified text information and the target text data and highlights the text matching result.
2. The method according to claim 1, wherein, The step of extracting features from the audio data to obtain audio feature information includes: The audio data is segmented into frames to obtain an audio frame sequence; For each audio frame in the audio frame sequence, feature extraction is performed on the audio frame to obtain audio frame feature information; Based on a preset window, the feature information of each audio frame is superimposed to obtain the feature tensor of each audio frame as audio feature information.
3. The method according to claim 1, wherein, The step of generating acoustic probability information corresponding to the audio data based on the domain type corresponding to the target text data and the audio feature information includes: The audio feature information is input into a pre-loaded acoustic model corresponding to the domain type to obtain the probability of each sub-word unit as acoustic probability information, wherein each sub-word unit probability corresponds to an audio frame and a sub-word unit.
4. The method according to claim 1, wherein, The step of generating recognition text information based on the candidate sentence list includes: The candidate sentence list is optimized to obtain an optimized candidate sentence list; Based on the optimized candidate sentence list, recognition text information is generated.
5. The method according to claim 4, wherein, The optimization process of the candidate sentence list to obtain an optimized candidate sentence list includes: Determine the recognition location information corresponding to the target text data at the current moment; Based on the identified location information, context information is extracted from the target text data; For each candidate sentence in the candidate sentence list, perform the following steps: Determine the matching length of the candidate sentence in the context information; The ratio of the matching length to the string length of the context information is determined as the matching value; Based on the matching value and the candidate value corresponding to the candidate sentence, a comprehensive value of the candidate sentence is generated; Based on the comprehensive value of each candidate sentence in the candidate sentence list, the candidate sentences are reordered to optimize the candidate sentence list.
6. The method according to claim 5, wherein, The optimization process of the candidate sentence list to obtain an optimized candidate sentence list includes: Select the candidate sentence with the highest comprehensive value from the optimized candidate sentence list; The candidate sentences are segmented to obtain a segmentation set; For each word in the aforementioned word segmentation set, perform the following steps: Determine the pinyin corresponding to the word segmentation; Determine whether there exists a word pinyin corresponding to the pinyin in the pre-constructed word pinyin mapping information corresponding to the target text data, wherein the word pinyin mapping information includes words in the target text data and their corresponding pinyin; In response to the determination of existence, the word corresponding to the pinyin of the word in the word pinyin mapping information is determined; Determine whether the word segmentation is the same as the word; In response to the determination of a difference, it is determined whether the word exists in the adjacent text of the candidate sentence in the target text data; In response to determining that the word exists in the adjacent text, the word segment in the candidate sentence is replaced with the word to optimize the candidate sentence.
7. The method according to claim 4, wherein, The optimization process of the candidate sentence list to obtain an optimized candidate sentence list includes: For each candidate sentence in the candidate sentence list, perform the following steps: The candidate sentences are segmented to obtain a segmentation set; For each word in the aforementioned word segmentation set, perform the following steps: The phonemes corresponding to the word segmentation are identified as word segmentation phonemes; Determine the similarity between the segmented phonemes and each word phoneme in the pre-constructed word phoneme mapping information, wherein the word phoneme mapping information includes words and the word phonemes of the corresponding words; The phonemes in each of the word entries whose similarity to the corresponding word segmentation phonemes meets the preset similarity conditions are determined as the target word entry phonemes; In response to determining that the similarity between the segmented phoneme and the target word phoneme meets a preset threshold condition, the word corresponding to the target word phoneme is determined as the target word; In response to determining that the target term is different from the word segment, the word segment in the candidate sentence is replaced with the target term to optimize the candidate sentence.
8. The method according to claim 4, wherein, The optimization process of the candidate sentence list to obtain an optimized candidate sentence list includes: For each candidate sentence in the candidate sentence list, perform the following steps: The candidate sentences are segmented to obtain a segmentation set; For each word in the aforementioned word segmentation set, perform the following steps: The word segmentation is matched with a pre-built user dictionary to obtain word matching results; Replace the word segmentation in the candidate sentence with the word matching result.
9. The method according to claim 1, wherein, The process of identifying keywords corresponding to the target text data to obtain a keyword set includes: The plain text is extracted from the target text data to obtain the target text; The target text is cleaned to obtain cleaned text; The cleaned text is segmented into sentences to obtain a list of sentences; For each clause in the list of clauses, perform the following steps: The sentence is segmented into words to obtain a list of words corresponding to the sentence. Entity recognition is performed on the sentence segments to obtain entity recognition results; Remove words that meet preset conditions from the word segmentation list to update the word segmentation list; For each word in the updated word segmentation list, perform the following steps: Determine the word frequency of the target text data corresponding to the word segmentation; Based on the word frequency, generate the word frequency score of the target text data corresponding to the word segmentation; Determine the connectivity score of the segmented words; Based on the word frequency score and the connectivity score, a word segmentation score is generated; Select a preset number of words whose segmentation scores meet the preset score conditions from the obtained word segmentation lists as keywords, and use the obtained entity recognition results as supplementary keywords to obtain a keyword set.
10. The method according to claim 1, wherein, The step of identifying the domain type corresponding to the target text data based on the keyword set includes: For each domain dictionary in the preset domain dictionary set, determine the hit information of the keyword set corresponding to the domain dictionary; Select the domain dictionary with the highest hit information from the domain dictionary set as the candidate domain dictionary; In response to determining that the hit information corresponding to the candidate domain dictionary is greater than or equal to a preset threshold, the domain corresponding to the candidate domain dictionary is determined as the domain type corresponding to the target text data, wherein the confidence level of the domain type corresponds to the hit information; In response to determining that the hit information corresponding to the candidate domain dictionary is less than the preset threshold, the domain type and confidence level corresponding to the target text data are generated based on the target text data and the pre-trained domain recognition model.
11. A speech recognition system for text matching, comprising: A data processing device configured to perform the method according to any one of claims 1-10; The display device is configured to perform the following steps: The collected audio data is sent to the data processing device; In response to receiving the identified text information sent by the data processing device, the identified text information and the target text data are matched to obtain matching location information and text matching result; Determine the screen position information corresponding to the matched position information; The text matching results are highlighted based on the screen position information.
12. The speech recognition system for text matching according to claim 11, wherein, The data processing device is a mobile phone, and the display device is a head-mounted display device.
13. The speech recognition system for text matching according to claim 11, wherein, The speech recognition system for text matching further includes a server configured to send at least one of the following to the data processing device: an updated acoustic model, an updated language model, and updated bias information.
14. An electronic device, comprising: One or more processors; Storage device, on which one or more programs are stored, When the one or more programs are executed by the one or more processors, the one or more processors implement the method as described in any one of claims 1-10.
15. A computer-readable medium having a computer program stored thereon, wherein, When the computer program is executed by a processor, it implements the method as described in any one of claims 1-10.
16. A computer program product comprising a computer program that, when executed by a processor, implements the method according to any one of claims 1-10.