Text-based word segmentation method, device, server, storage medium and product

By combining multiple word segmentation tools and language models, and utilizing WFST graph merging and fusion techniques, high-probability word segmentation results for the target text are determined, solving the problem of low accuracy of single word segmentation tools and improving the accuracy of word segmentation and language models.

CN116127967BActive Publication Date: 2026-06-12SOUNDAI TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SOUNDAI TECH CO LTD
Filing Date
2022-12-30
Publication Date
2026-06-12

Smart Images

  • Figure CN116127967B_ABST
    Figure CN116127967B_ABST
Patent Text Reader

Abstract

The application provides a text-based word segmentation method and device, a server, a storage medium and a product, and belongs to the technical field of speech recognition. The method comprises the following steps: performing word segmentation on a target text by using M word segmentation tools to obtain M first word segmentation sequences, wherein M is an integer greater than 1; determining N second word segmentation sequences and the weights of second words in the N second word segmentation sequences based on the weights corresponding to the M word segmentation tools and the first words in the M first word segmentation sequences, wherein N is an integer greater than M; determining the first probabilities of the N second word segmentation sequences based on the weights of the second words in the N second word segmentation sequences and a basic language model; and selecting a target word segmentation sequence corresponding to the target text from the N second word segmentation sequences based on the first probabilities of the N second word segmentation sequences. The application can improve the word segmentation accuracy.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of speech recognition technology, and in particular to a text-based word segmentation method, apparatus, server, storage medium, and product. Background Technology

[0002] In speech recognition systems, the language model is a crucial component, and its performance directly determines the overall performance of the speech recognition system. Training a language model requires first segmenting the text corpus into words, and then training the language model based on the segmentation results. The accuracy of the word segmentation directly impacts the performance of the language model.

[0003] In related technologies, word segmentation tools are used to segment text corpora. However, a single word segmentation tool contains relatively few word segmentation algorithms, resulting in low accuracy in segmenting text corpora based on that tool. Summary of the Invention

[0004] This application provides a text-based word segmentation method, apparatus, server, storage medium, and product, which can improve word segmentation accuracy. The technical solution is as follows:

[0005] On the one hand, a text-based word segmentation method is provided, the method comprising:

[0006] The target text is segmented using the M-segmentation tool to obtain M first segmentation sequences, where M is an integer greater than 1.

[0007] Based on the weights corresponding to the M word segmentation tools and the first word in the M first word segmentation sequences, determine the weights of the N second word segmentation sequences and the second word in the N second word segmentation sequences, where N is an integer greater than M;

[0008] Based on the weights of the second words in the N second word segments and the basic language model, the first probability of each of the N second word segments is determined.

[0009] Based on the first probability of the N second word segmentation sequences, the target word segmentation sequence corresponding to the target text is selected from the N second word segmentation sequences.

[0010] In one possible implementation, determining the weights of N second word segments and the second words in the N second word segments based on the weights corresponding to the M word segmentation tools and the first words in the M first word segmentation sequences includes:

[0011] Based on the weights corresponding to the M word segmentation tools, the M first word segmentation sequences are converted into M first weighted finite state transition machine (WFST) graphs. The first WFST graph corresponding to the first word segmentation sequence includes the first word segmentation sequence and the weights of the first words in the first word segmentation sequence.

[0012] The M first WFST graphs are weighted and merged to obtain a second WFST graph, which includes the segmentation paths corresponding to the N second word segments and the weights of the second words in the N second word segments.

[0013] In another possible implementation, the step of converting the M first word segmentation sequences into M first weighted finite state transition machine (WFST) graphs based on the weights corresponding to the M word segmentation tools includes:

[0014] For any word segmentation tool, the first word segmentation sequence is generated, and P first node paths are generated corresponding to the first word segments. The first node of the first node path is the start node, and the last node is the end node. P is the number of first words included in the first word segmentation sequence.

[0015] For the i-th first word in the first word segmentation sequence, based on the weights corresponding to the word segmentation tool, the i-th first word and its corresponding weights are marked on the connection between the i-th first node and the (i+1)-th first node in the first node path to obtain the first WFST graph, where i is an integer greater than 0 and not greater than P.

[0016] In another possible implementation, the weighted merging of the M first WFST graphs to obtain the second WFST graph includes:

[0017] Based on the M first word segments, generate Q second node paths corresponding to the second words, where the Q second words are the union of the M first word segments, and the first node of the second node path is the start node and the last node is the end node.

[0018] Based on the path relationship between the first words in the M first wfst graphs, the Q second words are marked on the connection between the second nodes in the second node path;

[0019] For any second word segment, sum the weights of the second word segment in the first word segment sequence to obtain the weight of the second word segment. Mark the weight of the second word segment on the line corresponding to the second word segment to obtain the second WFST graph.

[0020] In another possible implementation, determining the first probability of each of the N second word segments based on the weights of the second words in the N second word segments and the underlying language model includes:

[0021] The second WFST graph and the third WFST graph corresponding to the language model are fused to obtain the fourth WFST graph. The weights of the second words in the N second word segmentation sequences are marked in the second WFST graph, the probability of the words corresponding to the text corpus in the language model is marked in the third WFST graph, and the probability of the second words included in the N second word segmentation sequences is marked in the fourth WFST graph.

[0022] For any second word segmentation sequence, the sum of the probabilities of the second word segmentation in the second word segmentation sequence in the fourth wfst graph is determined to obtain the first probability of the second word segmentation sequence.

[0023] In another possible implementation, determining the first probability of each of the N second word segments based on the weights of the second words in the N second word segments and the underlying language model includes:

[0024] For any second word segmentation sequence, based on the language model, determine the second probability of the second word segmentation in the second word segmentation sequence;

[0025] Based on the weights of the second words in the second word segmentation sequence, the second probabilities of the second words in the second word segmentation sequence are weighted and summed to obtain the first probability of the second word segmentation sequence.

[0026] On the other hand, a text-based word segmentation device is provided, the device comprising:

[0027] The word segmentation module is used to segment the target text using the M-segmentation tool to obtain M first segmentation sequences, where M is an integer greater than 1;

[0028] The first determining module is used to determine the weights of N second word segments and the second words in the N second word segments based on the weights corresponding to the M word segmentation tools and the first words in the M first word segmentation sequences, where N is an integer greater than M;

[0029] The second determining module is used to determine the first probability of each of the N second word segments based on the weights of the second words in the N second word segments and the basic language model.

[0030] The selection module is used to select the target word segmentation sequence corresponding to the target text from the N second word segmentation sequences based on the first probability of the N second word segmentation sequences.

[0031] In one possible implementation, the first determining module is configured to convert the M first word segmentation sequences into M first weighted finite state transition machine (WFST) graphs based on the weights corresponding to the M word segmentation tools, wherein the first WFST graph corresponding to the first word segmentation sequence includes the first word segmentation sequence and the weights of the first words in the first word segmentation sequence; and to perform weighted merging of the M first WFST graphs to obtain a second WFST graph, wherein the second WFST graph includes the word segmentation paths corresponding to the N second word segmentation sequences and the weights of the second words in the N second word segmentation sequences.

[0032] In another possible implementation, the first determining module is used to generate P first node paths corresponding to the first segmented words for the first segmented sequence obtained by any word segmentation tool, wherein the first node of the first node path is the start node and the last node is the end node, and P is the number of first segments included in the first word segmentation sequence; for the i-th first segmented word in the first word segmentation sequence, based on the weight corresponding to the word segmentation tool, the i-th first segmented word and the weight corresponding to the i-th first segmented word are marked on the line between the i-th first node and the (i+1)-th first node in the first node path to obtain the first WFST graph, wherein i is an integer greater than 0 and not greater than P.

[0033] In another possible implementation, the first determining module generates second node paths corresponding to Q second words based on the M first word segmentation sequences, where the Q second words are the union of the M first word segmentation sequences, the first node of the second node path is the start node, and the last node is the end node; based on the path relationships between the first words in the M first WFST graphs, the Q second words are marked on the lines connecting the second nodes in the second node paths; for any second word, the weights of the second word in the first word segmentation sequences are summed to obtain the weight of the second word, and the weights of the second word are marked on the lines corresponding to the second word to obtain the second WFST graph.

[0034] In another possible implementation, the second determining module is used to fuse the second WFST graph and the third WFST graph corresponding to the language model to obtain a fourth WFST graph. The second WFST graph is labeled with the weights of the second segments in the N second segmentation sequences, the third WFST graph is labeled with the probabilities of the segments corresponding to the text corpus in the language model, and the fourth WFST graph is labeled with the probabilities of the second segments included in the N second segmentation sequences. For any second segmentation sequence, the sum of the probabilities of the second segments in the second segmentation sequence in the fourth WFST graph is determined to obtain the first probability of the second segmentation sequence.

[0035] In another possible implementation, the second determining module is used to determine, for any second word segmentation sequence, a second probability of the second word segmentation in the second word segmentation sequence based on the language model; and to perform a weighted summation of the second probabilities of the second word segmentation in the second word segmentation sequence based on the weights of the second word segmentation in the second word segmentation sequence to obtain a first probability of the second word segmentation sequence.

[0036] On the other hand, a server is provided, the server including one or more processors and one or more memories, the one or more memories storing at least one piece of program code, the at least one piece of program code being loaded and executed by the one or more processors to implement the text-based word segmentation method described in any of the above implementations.

[0037] On the other hand, a computer-readable storage medium is provided, wherein at least one piece of program code is stored in the computer-readable storage medium, the at least one piece of program code being loaded and executed by a processor to implement the text-based word segmentation method described in any of the above implementations.

[0038] On the other hand, a computer program product is provided, the computer program product including computer program code stored in a computer-readable storage medium, a server processor reading the computer program code from the computer-readable storage medium, the processor executing the computer program code, causing the server to perform the text-based word segmentation method described in any of the above implementations.

[0039] In this embodiment of the application, the target text is segmented using multiple word segmentation tools, and then the probability of each segmentation result is determined by a basic language model. The segmentation result with the highest probability is selected as the final segmentation result of the target text, which can improve the accuracy of word segmentation. Attached Figure Description

[0040] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0041] Figure 1 This is a schematic diagram of the implementation environment of a text-based word segmentation method provided in an embodiment of this application;

[0042] Figure 2 This is a flowchart of a text-based word segmentation method provided in an embodiment of this application;

[0043] Figure 3 This is a flowchart of another text-based word segmentation method provided in the embodiments of this application;

[0044] Figure 4 This is a schematic diagram of a first WFST diagram provided in an embodiment of this application;

[0045] Figure 5 This is a schematic diagram of a second WFST diagram provided in an embodiment of this application;

[0046] Figure 6 This is a flowchart of another text-based word segmentation method provided in the embodiments of this application;

[0047] Figure 7 This is a block diagram of a text-based word segmentation device provided in an embodiment of this application;

[0048] Figure 8 This is a structural block diagram of a server provided in an embodiment of this application. Detailed Implementation

[0049] To make the objectives, technical solutions, and advantages of this application clearer, the embodiments of this application will be described in further detail below with reference to the accompanying drawings.

[0050] It should be noted that all information (including but not limited to user device information, user personal information, etc.), data (including but not limited to data used for analysis, stored data, displayed data, etc.), and signals involved in this application have been authorized by the user or fully authorized by all parties, and the collection, use, and processing of related data must comply with the relevant laws, regulations, and standards of the relevant countries and regions. For example, the target text involved in this application was obtained with full authorization.

[0051] The terms "first," "second," "third," and "fourth," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish different objects, not to describe a specific order. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or apparatus that includes a series of steps or units is not limited to the listed steps or units, but may optionally include steps or units not listed, or may optionally include other steps or units inherent to these processes, methods, products, or apparatuses.

[0052] Figure 1 This is a schematic diagram illustrating the implementation environment of a text-based word segmentation method provided in this application embodiment; see also Figure 1The implementation environment includes a terminal 101 and a server 102. The server 102 pre-segments the text corpus into words, and then trains a language model based on the segmentation results. After training the language model, the server 102 can provide speech recognition services to the terminal 101. The terminal 101 has a target application installed, which is controlled by the server 102, and the terminal 101 can perform functions such as data transmission and information interaction with the server 102 through this target application. For example, the terminal 101 sends a voice signal to the server 102, which uses the language model to determine the corresponding control command and sends the control command back to the terminal 101, which then executes the operation corresponding to the control command.

[0053] In some embodiments, terminal 101 can be a smartphone, intelligent voice interaction device, tablet computer, laptop computer, in-vehicle terminal, etc., but is not limited to these. The intelligent voice interaction device can be a smart speaker or a smart microphone, etc. Server 102 can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers. It can also be a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms. Server 102 is used to provide backend services for terminal 101.

[0054] Figure 2 This is a flowchart of a text-based word segmentation method provided in an embodiment of this application, wherein the execution entity of the method is a server; see also Figure 2 The method includes:

[0055] Step 201: Use M word segmentation tools to segment the target text to obtain M first word segmentation sequences.

[0056] Where M is an integer greater than 1. The target text can be any text to be segmented; for example, the target text is the text corpus used to train a language model. The M segmentation tools can be at least two of Jieba, Hanlp, SnowNLP, THULAC, pyltp, etc.

[0057] It should be noted that a word segmentation tool includes multiple word segmentation algorithms; therefore, segmenting the target text using multiple word segmentation tools is equivalent to using a large number of word segmentation algorithms, which can improve the accuracy of word segmentation.

[0058] Step 202: Based on the weights corresponding to the M word segmentation tools and the first word in the M first word segmentation sequences, determine the weights of the N second word segmentation sequences and the second word in the N second word segmentation sequences.

[0059] Where N is an integer greater than M. Since N is an integer greater than M, the server can obtain M segmented sequences that are not provided by the segmentation tool by performing weighted recombination on the M first segmented sequences, thereby increasing the number of segmented sequences obtained. In other words, this application can mine more segmented sequences.

[0060] Step 203: Based on the weights of the second word segments in the N second word segmentation sequences and the basic language model, determine the first probability of each of the N second word segmentation sequences.

[0061] The basic language model is a small language; therefore, step 203 is computationally fast, improving word segmentation efficiency.

[0062] Step 204: Based on the first probability of the N second word segmentation sequences, select the target word segmentation sequence corresponding to the target text from the N second word segmentation sequences.

[0063] Based on the first probability of N second-segmentation sequences, the server selects the second-segmentation sequence with the highest first probability from the N second-segmentation sequences to obtain the target segmentation sequence corresponding to the target text. If there are multiple second-segmentation sequences with the highest first probability, the server can randomly select one from these multiple sequences to obtain the target segmentation sequence corresponding to the target text.

[0064] In this embodiment of the application, the target text is segmented using multiple word segmentation tools, and then the probability of each segmentation result is determined by a basic language model. The segmentation result with the highest probability is selected as the final segmentation result of the target text, which can improve the accuracy of word segmentation.

[0065] Figure 3 This is a flowchart of a text-based word segmentation method provided in an embodiment of this application, wherein the execution entity of the method is a server; see also Figure 3 The method includes:

[0066] Step 301: The server uses M word segmentation tools to segment the target text, obtaining M first word segments.

[0067] M is an integer greater than 1; the target text can be any text to be segmented; for example, the target text is the text corpus used to train a language model. The M segmentation tools can be at least two of Jieba, Hanlp, SnowNLP, THULAC, pyltp, etc.

[0068] For example, the target text is "considering the effects of multiple word segmentation tools", M is 4, which are word segmentation tool 1, word segmentation tool 2, word segmentation tool 3 and word segmentation tool 4 respectively; word segmentation tool 1 segments the target text and the first segmentation sequence is {considering the effects of multiple word segmentation tools}, word segmentation tool 2 segments the target text and the first segmentation sequence is {considering the effects of multiple word segmentation tools}, word segmentation tool 3 segments the target text and the first segmentation sequence is {considering the effects of multiple word segmentation tools}, word segmentation tool 4 segments the target text and the first segmentation sequence is {considering the effects of multiple word segmentation tools}.

[0069] It should be noted that if the target text is the text corpus used to train the language model, then there can be multiple target texts.

[0070] Step 302: Based on the weights corresponding to the M word segmentation tools, the server converts the M first word segmentation sequences into M first weighted finite state transition machine (WFST) graphs.

[0071] The first WFST graph corresponding to the first word segmentation sequence includes the first word segmentation sequence and the weights of the first words in the first word segmentation sequence. This step can be achieved through the following steps (1) and (2), including:

[0072] (1) For any word segmentation tool, the server generates P first node paths corresponding to the first word segmentation.

[0073] Where P represents the number of first words included in the first word segmentation sequence. The first node path includes P+1 first nodes, with the first node being the start node and the last node being the end node; furthermore, the first node in the first node path can be represented by a circle, and the end node by two circles. Additionally, the server can label the sequence number of the first node among the P+1 first nodes. For example, see... Figure 4 For the first segmented sequence obtained by segmentation tool 1 (considering the effects of multiple segmentation tools), the server generates a first node path including 7 first nodes; for the first segmented sequence obtained by segmentation tool 2 (considering the effects of multiple segmentation tools), the server generates a first node path including 8 first nodes; for the first segmented sequence obtained by segmentation tool 3 (considering the effects of multiple segmentation tools), the server generates a first node path including 8 first nodes; for the first segmented sequence obtained by segmentation tool 4 (considering the efficiency of multiple segmentation tools), the server generates a first node path including 8 first nodes.

[0074] (2) For the i-th first word in the first word segmentation sequence, the server marks the weight of the i-th first word and the weight of the i-th first word on the connection between the i-th first node and the (i+1)-th first node in the first node path based on the weight of the word segmentation tool, and obtains the first WFST graph.

[0075] Where i is an integer greater than 0 and not greater than P. The weight of the first word in the first word segmentation sequence can be the same as the weight corresponding to the word segmentation tool that obtained the first word segmentation sequence; for example, the weight corresponding to word segmentation tool 1 is k1, and the weight of each first word in the first word segmentation sequence obtained by word segmentation tool 1 (considering the efficiency of multiple word segmentation tools) is k1; the weight corresponding to word segmentation tool 2 is k2, and the weight of each first word in the first word segmentation sequence obtained by word segmentation tool 2 (considering the efficiency of multiple word segmentation tools) is k2; the weight corresponding to word segmentation tool 3 is k3, and the weight of each first word in the first word segmentation sequence obtained by word segmentation tool 3 (considering the efficiency of multiple word segmentation tools) is k3; the weight of each first word in the first word segmentation sequence obtained by word segmentation tool 4 (considering the efficiency of multiple word segmentation tools) is k4.

[0076] The weight of the first word in the first word segmentation sequence can also be different from the weight corresponding to the word segmentation tool that obtained the first word segmentation sequence; correspondingly, for the first word in the first word segmentation sequence obtained by any word segmentation tool, the process by which the server determines the weight of the first word includes:

[0077] The server determines the usage frequency of the first word segment, obtains the weight that matches the usage frequency, and uses the product of the weight that matches the usage frequency and the weight corresponding to the word segmentation tool as the weight of the first word segment; wherein, the usage frequency of the first word segment is positively correlated with the corresponding weight.

[0078] For example, see continue. Figure 4For the first segmented sequence obtained by segmentation tool 1 (considering the efficiency of multiple segmentation tools), the server marks the weight k1 of "comprehensive" and "comprehensive" on the connection between node 0 and node 1, the weight k1 of "consider" and "consider" on the connection between node 1 and node 2, the weight k1 of "multiple" and "multiple" on the connection between node 2 and node 3, the weight k1 of "segmentation tool" and "segmentation tool" on the connection between node 3 and node 4, the weight k1 of "of" and "of" on the connection between node 4 and node 5, and the weight k1 of "effect" and "effect" on the connection between node 5 and node 6. The server uses the same method to annotate the first segmented sequence obtained by segmentation tool 2 and the weight of the first segmented word in the first segmented sequence in the first node path corresponding to segmentation tool 2, annotate the first segmented sequence obtained by segmentation tool 3 and the weight of the first segmented word in the first segmented sequence in the first node path corresponding to segmentation tool 3, and annotate the first segmented sequence obtained by segmentation tool 4 and the weight of the first segmented word in the first segmented sequence in the first node path corresponding to segmentation tool 4.

[0079] It should be noted that the weights corresponding to the M word segmentation tools can be the same or different. In this embodiment, the example of different weights for the M word segmentation tools is used for illustration, and the weight of each word segmentation tool can be determined by its accuracy. Accordingly, for any word segmentation tool, the process of the server determining the weight of that word segmentation tool includes: the server determining the word segmentation accuracy of the word segmentation tool, obtaining the weight that matches the word segmentation accuracy, and thus obtaining the weight of the word segmentation tool. The weight of the word segmentation tool is positively correlated with the word segmentation accuracy of the word segmentation tool.

[0080] In this embodiment, the weight of a word segmentation tool is determined by its segmentation accuracy. That is, a word segmentation tool with a high segmentation accuracy is given a high weight, so that the word segmentation tool with a high segmentation accuracy plays a greater role in the final word segmentation result, while a word segmentation tool with a low segmentation accuracy is given a low weight, so that the word segmentation tool with a low segmentation accuracy plays a smaller role in the final word segmentation result, thereby improving the accuracy of word segmentation.

[0081] Step 303: The server performs a weighted merging of the M first WFST graphs to obtain the second WFST graph.

[0082] The second WFST graph includes the segmentation paths corresponding to N second segmentation sequences and the weights of the second segments in the N second segmentation sequences. N is an integer greater than M. Therefore, in this embodiment, the server performs weighted recombination on the M first segmentation sequences to obtain M segmentation sequences not provided by the segmentation tools, thereby increasing the number of obtained segmentation sequences, that is, this application can mine more segmentation sequences. For example, the server can mine 4 segmentation sequences not provided by the segmentation tools {considering the effects of multiple segmentation tools} and {considering the effects of multiple segmentation tools}.

[0083] This step can be achieved through steps (1) to (3), including:

[0084] (1) The server generates Q second node paths corresponding to the second words based on M first word sequences.

[0085] The Q second-order segments are the union of the M first-order segment sequences. The second-node path includes Q+1 second-order nodes. The first node of the second-node path is the start node, and the last node is the end node. Furthermore, the first node in the second-node path can be represented by a circle, and the end node in the second-node path can be represented by two circles.

[0086] For example, see Figure 5 The server generates a second node path containing nine second nodes based on four first word segmentation sequences obtained from four word segmentation tools. Additionally, the server can label the sequence number of each of the Q+1 second nodes.

[0087] (2) Based on the path relationship between the first words in the M first wfst graph, the server marks Q second words on the connection line between the second nodes in the second node path.

[0088] For any two first-order words, if these two first-order words are connected in the first WFST graph, then the second nodes corresponding to these two first-order words are also connected in the second node path. For example, see [link to example]. Figure 5 The server labels the connection between node 0 and node 1 as "comprehensive", the connection between node 0 and node 2 as "comprehensive consideration", the connection between node 1 and node 2 as "consideration", the path between node 2 and node 3 as "multiple", the connection between node 3 and node 4 as "word segmentation", the connection between node 3 and node 5 as "word segmentation tool", the connection between node 4 and node 5 as "tool", the connection between node 5 and node 6 as "of", the connection between node 6 and node 7 as "effect", the connection between node 7 and node 8 as "result", and the connection between node 6 and node 8 as "effect".

[0089] (3) For any second participle, the server sums up the weights of the second participle in the first participle sequence respectively to obtain the weight of the second participle, and marks the weight of the second participle on the connection line corresponding to the second participle to obtain the second wfst graph.

[0090] For example, continue to refer to Figure 5 , "comprehensive" appears in the first participle sequences obtained by the participle tools 1, participle tool 2, and participle tool 3, and the weight of "comprehensive" in the first participle sequence obtained by the participle tool 1 is k1, the weight of "comprehensive" in the first participle sequence obtained by the participle tool 2 is k2, and the weight of "comprehensive" in the first participle sequence obtained by the participle tool 3 is k3. Then the weight of "comprehensive" is k1 + k2 + k3. According to the same method above, the weight of "comprehensive consideration" is determined to be k4, the weight of "consideration" is k1 + k2 + k3, the weight of "multiple" is k1 + k2 + k3 + k4; the weight of "participle" is k2 + k3 + k4, the weight of "tool" is k2 + k3 + k4, the weight of "participle tool" is k1, the weight of "of" is k1 + k2 + k3 + k4, the weight of "effect" is k4, the weight of "result" is k4, and the weight of "effect" is k1 + k2 + k3.

[0091] Step 304: The server fuses the second wfst graph and the third wfst graph corresponding to the language model to obtain a fourth wfst graph.

[0092] Among them, the weights of the second participles in the N second participle sequences are marked in the second wfst, the probabilities of the participles corresponding to the text corpus in the language model are marked in the third wfst graph, and the probabilities of the second participles included in the N second participle sequences are marked in the fourth wfst graph. For any second participle in the second participle sequence, the server takes the product of the weight of the second participle and the probability of the participle in the third wfst graph as the probability of the second participle in the fourth wfst graph.

[0093] Step 305: For any second participle sequence, the server determines the sum of the probabilities of the second participles in the second participle sequence in the fourth wfst graph to obtain the first probability of the second participle sequence.

[0094] For example, if the second participle sequence is {comprehensive consideration of the effects of multiple participle tools}, then the server takes the sum of the probabilities of "comprehensive consideration", "multiple", "participle tool", "of", and "effects" in the fourth wfst graph as the first probability of the second participle sequence {comprehensive consideration of the effects of multiple participle tools}.

[0095] Step 306: The server selects the target participle sequence corresponding to the target text from the N second participle sequences based on the first probabilities of the N second participle sequences.

[0096] Based on the first probability of N second-segmentation sequences, the server selects the second-segmentation sequence with the highest first probability from the N second-segmentation sequences to obtain the target segmentation sequence corresponding to the target text. If there are multiple second-segmentation sequences with the highest first probability, the server can randomly select one from these multiple sequences to obtain the target segmentation sequence corresponding to the target text.

[0097] In this embodiment of the application, the target text is segmented using multiple word segmentation tools, and then the probability of each segmentation result is determined by a basic language model. The segmentation result with the highest probability is selected as the final segmentation result of the target text, which can improve the accuracy of word segmentation.

[0098] It should be noted that when the target text is a text corpus, after the server determines the target word segmentation sequence corresponding to the target text, it trains the model based on the target text and the target word segmentation sequence to obtain a new language model, and then provides voice control function for the terminal through the new language model.

[0099] In this embodiment of the application, since the method provided by this embodiment of the application can improve the accuracy of word segmentation of text corpus, training a language model based on the word segmentation sequence obtained from the word segmentation corpus can improve the performance of the language model, thereby improving the accuracy of the language model in speech control.

[0100] Figure 6 This is a flowchart of a text-based word segmentation method provided in an embodiment of this application, wherein the execution entity of the method is a server; see also Figure 6 The method includes:

[0101] Step 601: The server uses the M-segmentation tool to segment the target text, obtaining M first segmentation sequences, where M is an integer greater than 1.

[0102] It should be noted that this step is the same as step 301, and will not be repeated here.

[0103] Step 602: Based on the weights corresponding to the M word segmentation tools and the first word in the M first word segmentation sequences, the server determines the weights of the N second word segmentation sequences and the second word in the N second word segmentation sequences, where N is an integer greater than M;

[0104] Based on the target text, the server reassembles the first segments of M first segmentation sequences to obtain N second segmentation sequences. For any second segment in a second segmentation sequence, the server determines the sum of the weights of that second segment in the first segmentation sequences to obtain the weight of that second segment in the second segmentation sequence.

[0105] Step 603: For any second word segmentation sequence, the server determines the second probability of the second word segmentation in the second word segmentation sequence based on the language model.

[0106] The server inputs the second segment from the second segmentation sequence into the language model and outputs the second probability of that second segment.

[0107] Step 604: Based on the weight and second probability of the second word in the second word segmentation sequence, the server performs a weighted summation of the second probabilities of the second word segmentation sequence to obtain the first probability of the second word segmentation sequence.

[0108] For example, if the second word segmentation sequence is {considering the effects of multiple word segmentation tools}, then the server will perform a weighted sum of the weights and second probabilities of "considering", "multiple", "word segmentation tools", "of", and "effect" to obtain the first probability of the second word segmentation sequence {considering the effects of multiple word segmentation tools}.

[0109] Step 605: Based on the first probability of N second word segmentation sequences, the server selects the target word segmentation sequence corresponding to the target text from the N second word segmentation sequences.

[0110] It should be noted that this step is the same as step 306, and will not be repeated here.

[0111] In this embodiment of the application, the target text is segmented using multiple word segmentation tools, and then the probability of each segmentation result is determined by a basic language model. The segmentation result with the highest probability is selected as the final segmentation result of the target text, which can improve the accuracy of word segmentation.

[0112] Figure 7 This is a block diagram of a text-based word segmentation device provided in an embodiment of this application; see also Figure 7 The device includes:

[0113] The word segmentation module 701 is used to segment the target text using the M word segmentation tool to obtain M first word segments, where M is an integer greater than 1;

[0114] The first determining module 702 is used to determine the weights of N second word sequences and the second words in the N second word sequences based on the weights corresponding to M word segmentation tools and the first words in the M first word segmentation sequences, where N is an integer greater than M;

[0115] The second determining module 703 is used to determine the first probability of each of the N second word segments based on the weights of the second words in the N second word segments and the basic language model.

[0116] Selection module 704 is used to select the target word segmentation sequence corresponding to the target text from the N second word segmentation sequences based on the first probability of the N second word segmentation sequences.

[0117] In one possible implementation, the first determining module 702 is used to convert M first word segmentation sequences into M first weighted finite state transition machine (WFST) graphs based on the weights corresponding to the M word segmentation tools, respectively. The first WFST graph corresponding to the first word segmentation sequence includes the first word segmentation sequence and the weights of the first words in the first word segmentation sequence. The M first WFST graphs are then weighted and merged to obtain a second WFST graph, which includes the word segmentation paths corresponding to N second word segmentation sequences and the weights of the second words in the N second word segmentation sequences.

[0118] In another possible implementation, the first determining module 702 is used to generate P first node paths corresponding to the first segmented words for the first segmented sequence obtained by any word segmentation tool. The first node of the first node path is the start node, the last node is the end node, and P is the number of first segments included in the first word segmentation sequence. For the i-th first segmented word in the first word segmentation sequence, based on the weight corresponding to the word segmentation tool, the weight of the i-th first segmented word and the weight corresponding to the i-th first segmented word are marked on the line between the i-th first node and the (i+1)-th first node in the first node path to obtain the first WFST graph, where i is an integer greater than 0 and not greater than P.

[0119] In another possible implementation, the first determining module 702 generates second node paths corresponding to Q second words based on M first word segments, where the Q second words are the union of the M first word segments, and the first node of the second node path is the start node and the last node is the end node; based on the path relationships between the first words in the M first WFST graphs, Q second words are marked on the lines connecting the second nodes in the second node paths; for any second word, the weights of the second word in the first word segments are summed to obtain the weight of the second word, and the weights of the second word are marked on the lines corresponding to the second word to obtain the second WFST graph.

[0120] In another possible implementation, the second determining module 703 is used to fuse the second WFST graph and the third WFST graph corresponding to the language model to obtain a fourth WFST graph. The weights of the second words in the N second word segmentation sequences are marked in the second WFST graph, the probabilities of the words corresponding to the text corpus in the language model are marked in the third WFST graph, and the probabilities of the second words included in the N second word segmentation sequences are marked in the fourth WFST graph. For any second word segmentation sequence, the sum of the probabilities of the second words in the second word segmentation sequence in the fourth WFST graph is determined to obtain the first probability of the second word segmentation sequence.

[0121] In another possible implementation, the second determining module 703 is used to determine the second probability of the second word in any second word segmentation sequence based on a language model; and to perform a weighted summation of the second probabilities of the second word segmentation sequence based on the weights of the second word segmentation in the second word segmentation sequence to obtain the first probability of the second word segmentation sequence.

[0122] In this embodiment of the application, the target text is segmented using multiple word segmentation tools, and then the probability of each segmentation result is determined by a basic language model. The segmentation result with the highest probability is selected as the final segmentation result of the target text, which can improve the accuracy of word segmentation.

[0123] It should be noted that the text-based word segmentation device provided in the above embodiments is only illustrated by the division of the above functional modules. In practical applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the server can be divided into different functional modules to complete all or part of the functions described above. In addition, the text-based word segmentation device and the text-based word segmentation method embodiments provided in the above embodiments belong to the same concept, and the specific implementation process can be found in the method embodiments, which will not be repeated here.

[0124] Figure 8 This is a structural block diagram of a server provided in an embodiment of this application. The server 800 can vary considerably due to different configurations or performance. It may include a processor (central processing unit, CPU) 801 and a memory 802. The memory 802 stores at least one line of program code, which is loaded and executed by the processor 801 to implement the methods provided in the above-described method embodiments. Of course, the server 800 may also have wired or wireless network interfaces, a keyboard, and input / output interfaces for input and output. The server 800 may also include other components for implementing device functions, which will not be elaborated here.

[0125] This application also provides a computer-readable storage medium storing at least one piece of program code, which is loaded and executed by a processor to implement the text-based word segmentation method described in any of the above implementations. Optionally, the storage medium may be a non-transitory computer-readable storage medium, such as ROM (Read-Only Memory), RAM (Random Access Memory), CD-ROM (Compact Disc Read-Only Memory), magnetic tape, floppy disk, and optical data storage device.

[0126] This application also provides a computer program product, which includes computer program code stored in a computer-readable storage medium. The server's processor reads the computer program code from the computer-readable storage medium and executes the computer program code, causing the server to execute the text-based word segmentation method of any of the above implementations.

[0127] In some embodiments, the computer program product involved in the present application can be deployed and executed on a server, or on multiple servers located in one location, or on multiple servers distributed in multiple locations and interconnected through a communication network. Multiple servers distributed in multiple locations and interconnected through a communication network can form a blockchain system.

[0128] The above are merely optional embodiments of this application and are not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the protection scope of this application.

Claims

1. A text-based word segmentation method, characterized in that, The method includes: The target text is segmented using M word segmentation tools to obtain M first word segmentation sequences, where M is an integer greater than 1; Based on the weights corresponding to the M word segmentation tools and the first word in the M first word segmentation sequences, determine the weights of the N second word segmentation sequences and the second word in the N second word segmentation sequences, where N is an integer greater than M; Based on the weights of the second words in the N second word segments and the basic language model, the first probability of each of the N second word segments is determined. Based on the first probability of the N second word segmentation sequences, the target word segmentation sequence corresponding to the target text is selected from the N second word segmentation sequences; The step of determining the weights of N second word segments and the second words in the N second word segments based on the weights corresponding to the M word segmentation tools and the first words in the M first word segmentation sequences includes: Based on the weights corresponding to the M word segmentation tools, the M first word segmentation sequences are converted into M first WFST graphs respectively. The first WFST graph corresponding to the first word segmentation sequence includes the first word segmentation sequence and the weights of the first words in the first word segmentation sequence. The M first WFST graphs are weighted and merged to obtain a second WFST graph, which includes the segmentation paths corresponding to the N second word segments and the weights of the second words in the N second word segments.

2. The method according to claim 1, characterized in that, The step of converting the M first word segmentation sequences into M first WFST graphs based on the weights corresponding to the M word segmentation tools includes: For any word segmentation tool, the first word segmentation sequence is generated, and P first node paths are generated corresponding to the first word segments. The first node of the first node path is the start node, and the last node is the end node. P is the number of first words included in the first word segmentation sequence. For the i-th first word in the first word segmentation sequence, based on the weights corresponding to the word segmentation tool, the i-th first word and its corresponding weights are marked on the connection between the i-th first node and the (i+1)-th first node in the first node path to obtain the first WFST graph, where i is an integer greater than 0 and not greater than P.

3. The method according to claim 1, characterized in that, The step of weighted merging of the M first WFST graphs to obtain the second WFST graph includes: Based on the M first word segments, generate Q second node paths corresponding to the second words, where the Q second words are the union of the M first word segments, and the first node of the second node path is the start node and the last node is the end node. Based on the path relationship between the first words in the M first wfst graphs, the Q second words are marked on the connection between the second nodes in the second node path; For any second word segment, sum the weights of the second word segment in the first word segment sequence to obtain the weight of the second word segment. Mark the weight of the second word segment on the line corresponding to the second word segment to obtain the second WFST graph.

4. The method according to claim 1, characterized in that, The method of determining the first probability of each of the N second word segments based on the weights of the second words in the N second word segments and the underlying language model includes: The second WFST graph and the third WFST graph corresponding to the language model are fused to obtain the fourth WFST graph. The weight of the second word in the N second word segmentation sequences is marked in the second WFST graph, the probability of the word corresponding to the text corpus in the language model is marked in the third WFST graph, and the probability of the second word included in the N second word segmentation sequences is marked in the fourth WFST graph. For any second word segmentation sequence, the sum of the probabilities of the second word segmentation in the second word segmentation sequence in the fourth wfst graph is determined to obtain the first probability of the second word segmentation sequence.

5. The method according to claim 1, characterized in that, The method of determining the first probability of each of the N second word segments based on the weights of the second words in the N second word segments and the underlying language model includes: For any second word segmentation sequence, based on the language model, determine the second probability of the second word segmentation in the second word segmentation sequence; Based on the weights of the second words in the second word segmentation sequence, the second probabilities of the second words in the second word segmentation sequence are weighted and summed to obtain the first probability of the second word segmentation sequence.

6. A text-based word segmentation device, characterized in that, The device includes: The word segmentation module is used to segment the target text using M word segmentation tools to obtain M first word segments, where M is an integer greater than 1; The first determining module is used to determine the weights of N second word segments and the second words in the N second word segments based on the weights corresponding to the M word segmentation tools and the first words in the M first word segmentation sequences, where N is an integer greater than M; The second determining module is used to determine the first probability of each of the N second word segments based on the weights of the second words in the N second word segments and the basic language model. The selection module is used to select the target word segmentation sequence corresponding to the target text from the N second word segmentation sequences based on the first probability of the N second word segmentation sequences; The first determining module is used to convert the M first word segmentation sequences into M first WFST graphs based on the weights corresponding to the M word segmentation tools, wherein the first WFST graph corresponding to the first word segmentation sequence includes the first word segmentation sequence and the weights of the first words in the first word segmentation sequence; and to perform weighted merging of the M first WFST graphs to obtain a second WFST graph, wherein the second WFST graph includes the word segmentation paths corresponding to the N second word segmentation sequences and the weights of the second words in the N second word segmentation sequences.

7. A server, characterized in that, The server includes one or more processors and one or more memories, wherein at least one piece of program code is stored in the one or more memories, and the at least one piece of program code is loaded and executed by the one or more processors to implement the text-based word segmentation method as described in any one of claims 1 to 5.

8. A computer-readable storage medium, characterized in that, The storage medium stores at least one piece of program code, which is loaded and executed by a processor to implement the text-based word segmentation method as described in any one of claims 1 to 5.

9. A computer program product, characterized in that, The computer program product includes computer program code stored in a computer-readable storage medium. The server's processor reads the computer program code from the computer-readable storage medium and executes the computer program code, causing the server to perform the text-based word segmentation method as described in any one of claims 1 to 5.