New word discovery methods and apparatus, computer equipment, storage media

By combining a tree-structured storage method with part-of-speech filtering, the problem of slow data retrieval in new word discovery methods is solved, achieving efficient new word discovery and improving query speed and accuracy.

CN115374768BActive Publication Date: 2026-06-30HEBEI XUNFEI ARTIFICIAL INTELLIGENCE RES INST +2

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HEBEI XUNFEI ARTIFICIAL INTELLIGENCE RES INST
Filing Date
2022-08-24
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing new word discovery methods are insufficient to meet practical needs in terms of data search speed, especially when processing large-scale data. Existing unsupervised methods require multiple database traversals, resulting in low efficiency.

Method used

A tree-structured storage method is used to store multiple n-gram words. Each n-gram word carries word frequency and contextual information. The node relationships of the tree-structured storage method are used to directly locate candidate words. Combined with part-of-speech filtering, new words are obtained.

Benefits of technology

By applying a tree-structured storage method, the query speed for new word discovery was improved, the number of database traversals was reduced, data retrieval efficiency was increased, and the accuracy of new word discovery was improved through part-of-speech filtering.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115374768B_ABST
    Figure CN115374768B_ABST
Patent Text Reader

Abstract

This application provides a new word discovery method, apparatus, computer device, and storage medium, solving the problem of slow data retrieval speed in the existing new word discovery process. The new word discovery method includes: storing multiple n-gram words using a tree-structured storage method. These n-gram words are obtained by performing n-gram word frequency statistics on a predetermined text sequence, where n is a series of consecutive natural numbers starting from 1. Each n-gram word carries word frequency and contextual information, including at least one adjacent unary word and the positional relationship between each adjacent unary word and the n-gram word. The n-level nodes of the tree-structured storage method store n-gram words, and the n+1-gram words stored in the n+1-level nodes of the same path depend on the contextual information of the n-gram words stored in the n-level nodes. Candidate words are determined from the multiple n-gram words based on the tree-structured storage method. New words are obtained by filtering the candidate words based on their parts of speech.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of natural language processing technology, specifically to a new word discovery method and apparatus, computer equipment, and storage medium. Background Technology

[0002] New word discovery is one of the fundamental tasks in the field of natural language processing. It involves deeply processing a large amount of existing corpus, storing the processed data, and then searching for new words from this stored content according to a predetermined strategy. However, the process of searching for new words from the stored content requires repeatedly traversing the database to find the required data. Due to the massive storage volume, the search speed often fails to meet practical needs. Summary of the Invention

[0003] In view of this, embodiments of this application provide a new word discovery method and apparatus, computer equipment, and storage medium to solve the problem of slow data search speed in the new word discovery process in the prior art.

[0004] The first aspect of this application provides a new word discovery method, comprising: storing multiple n-gram words obtained using a tree storage structure, wherein the multiple n-gram words are obtained by performing n-gram word frequency statistics on a predetermined text sequence, where n is a series of consecutive natural numbers starting from 1, and each n-gram word carries word frequency and context information, the context information including at least one adjacent unary word and the positional relationship between each of the at least one adjacent unary word and the n-gram word, the n-level nodes of the tree storage structure storing n-gram words, and the n+1-gram words stored in the n+1-level nodes in the same path depending on the context information of the n-gram words stored in the n-level nodes; determining candidate words among the multiple n-gram words based on the tree storage structure; and filtering the candidate words based on part-of-speech tags to obtain new words.

[0005] In one embodiment, the multiple n-gram words include multiple unary words and multiple binary words; storing the obtained multiple n-gram words using a tree storage structure includes: storing the multiple unary words into multiple first-level nodes respectively; for each first-level node, determining at least one binary word based on the unary words stored in the first-level node and the context information they carry; and storing at least one binary word into at least one second-level node corresponding to the first-level node respectively.

[0006] In one embodiment, after storing multiple unary words into multiple first-level nodes, the method further includes: for each first-level node, storing at least one adjacent unary word of the unary word stored in the first-level node into at least one second-level node. Storing at least one binary word into at least one second-level node corresponding to the first-level node includes: storing at least one binary word into the second-level node containing its respective adjacent unary words.

[0007] In one embodiment, within the same path, the n+1-gram words stored in the n+1-level nodes consist of the n-gram words stored in the n-level nodes and their adjacent unary words. Determining candidate words among multiple n-gram words based on the tree-structured storage structure includes: for each path in the tree-structured storage structure, determining the cohesion of the n+1-gram words stored in the n+1-level nodes based on the n-level nodes and n+1-level nodes; determining the degrees of freedom of the n+1-gram words based on the n+1-level nodes; and determining the n+1-gram words as candidate words when the cohesion is greater than a first threshold and the degrees of freedom are greater than a second threshold.

[0008] In one embodiment, determining the cohesion of n+1-gram words stored in n+1-level nodes based on n-level nodes and n+1-level nodes includes: determining the cohesion based on the word frequency of n-gram words stored in n-level nodes, the word frequency of adjacent unary words, and the word frequency of n+1-gram words stored in n+1-level nodes.

[0009] In one embodiment, determining the degrees of freedom of n+1-gram words based on n+1-level nodes includes: determining at least one left neighbor word and at least one right neighbor word based on the context information carried by the n+1-gram words stored in the n+1-level nodes; determining a first information entropy based on the word frequency of each of the at least one left neighbor word; determining a second information entropy based on the word frequency of each of the at least one right neighbor word; and determining the larger of the first information entropy and the second information entropy as the degree of freedom.

[0010] In one embodiment, filtering candidate words to obtain new words includes: filtering candidate words based on the part of speech of at least one n-gram word in the candidate words to obtain new words.

[0011] In one embodiment, filtering candidate words based on the part of speech of at least one n-gram word in the candidate words to obtain new words includes: when the part of speech of the first and last unary words of a candidate word is a noun, the candidate word is determined to be a new word.

[0012] In one embodiment, before storing the acquired n-gram words using a tree storage structure, the method further includes: replacing the word frequency of n-gram words whose word frequency is greater than a word frequency threshold with the word frequency threshold.

[0013] The second aspect of this application provides a new word discovery device, comprising: a storage module configured to store multiple n-gram words obtained using a tree-structured storage structure, wherein the multiple n-gram words are obtained by performing n-gram word frequency statistics on a predetermined text sequence, where n is a series of consecutive natural numbers starting from 1, and each n-gram word carries word frequency, part-of-speech, and context information, the context information including at least one adjacent unary word and the positional relationship between each adjacent unary word and the n-gram word, the n-level nodes of the tree-structured storage structure storing n-gram words, and the n+1-gram words stored in the n+1-level nodes of the same path depending on the context information of the n-gram words stored in the n-level nodes; a determination module configured to determine candidate words among the multiple n-gram words based on the tree-structured storage structure; and a filtering module configured to filter the candidate words to obtain new words.

[0014] A third aspect of this application provides a computer device, including a memory, a processor, and a computer program stored in the memory and executed by the processor, characterized in that the processor executes the computer program to implement the steps of the new word discovery method provided in any of the above embodiments.

[0015] A fourth aspect of this application provides a computer-readable storage medium having a computer program stored thereon, characterized in that, when executed by a processor, the computer program implements the steps of the new word discovery method provided in any of the above embodiments.

[0016] According to the new word discovery method, apparatus, computer equipment, and storage medium provided in the embodiments of this application, a tree-structured storage method is used to store multiple n-gram words. These n-gram words are obtained by performing n-gram word frequency statistics on a predetermined text sequence, where n is a series of consecutive natural numbers starting from 1. Each n-gram word carries word frequency and contextual information. The contextual information includes at least one adjacent unary word and the positional relationship between each adjacent unary word and the n-gram word. The n-level nodes of the tree-structured storage method store n-gram words, and the n+1-gram words stored in the n+1-level nodes of the same path depend on the contextual information of the n-gram words stored in the n-level nodes. Candidate words are determined based on the tree-structured storage method. The candidate words are then filtered based on part-of-speech tags to obtain new words. (The last sentence is a repetition of the previous one and can be omitted.) In this way, the n+1 gram words can be located directly based on the n-gram words, without having to traverse the entire database to find the n+1 gram words, thus improving the data query speed. Attached Figure Description

[0017] Figure 1This is a flowchart illustrating the new word discovery method provided in the first embodiment of this application.

[0018] Figure 2 This is a partial schematic diagram of a tree storage structure provided in an embodiment of this application.

[0019] Figure 3 A flowchart illustrating the new word discovery method provided in the second embodiment of this application.

[0020] Figure 4 A flowchart illustrating the new word discovery method provided in the third embodiment of this application.

[0021] Figure 5 This is a flowchart illustrating the new word discovery method provided in the fourth embodiment of this application.

[0022] Figure 6 This is a flowchart illustrating the new word discovery method provided in the fifth embodiment of this application.

[0023] Figure 7 A structural block diagram of the novel word discovery device provided in the first embodiment of this application.

[0024] Figure 8 A structural block diagram of the novel word discovery device provided in the second embodiment of this application.

[0025] Figure 9 This is a structural block diagram of a computer device provided in an embodiment of this application. Detailed Implementation

[0026] New word discovery is one of the fundamental tasks in the field of natural language processing. It involves mining existing corpora to identify new words. New word discovery can also be called out-of-vocabulary word identification. Strictly speaking, new words refer to words that have emerged with the development of the times or words that are used in new ways as old words are used.

[0027] Common methods for discovering new words include the following:

[0028] I. Rule-based methods. These methods establish rule bases, specialized lexicons, or pattern libraries based on the word formation or appearance characteristics of new words, and then discover new words through rule matching. However, rule-based methods are too deeply coupled with domain and article format, resulting in a narrow scope of application for the established rule bases, specialized lexicons, or pattern libraries. Furthermore, both the construction and subsequent maintenance require significant manual labor costs.

[0029] II. Statistical-based supervised methods. Supervised methods utilize labeled corpora. One implementation approach is to regard new word discovery as a classification problem. Based on certain statistics of the labeled corpora, these are used as features to train a binary classification model, and then new words are discovered through the binary classification model. Another implementation approach is to regard new word discovery as a sequence labeling problem. Based on sequence labeling information, sequence labeling is directly performed to obtain new words, or the obtained new words are further judged. Statistical-based supervised methods require obtaining a large amount of labeled corpora. Obtaining a large amount of labeled corpora itself is a very difficult task. And if there is already a labeled corpus, then directly adding the newly labeled words to the new word dictionary is sufficient, without the need to build a model again.

[0030] III. Statistical-based unsupervised methods. Without relying on any existing word dictionaries and labeled corpora, solely based on the common features of words, statistical strategies are used to extract all text segments that may form words in a large-scale corpus, and then linguistic knowledge is used to exclude the "useless segments" that are not new words to find the words and combinations of words with the highest relevance. Finally, by comparing all the extracted words with the existing word dictionary, new words can be obtained.

[0031] Conventional unsupervised methods include: First, segment the text sequence to obtain multiple words and their respective context information. The words mentioned here are common words, and the context information includes at least one adjacent word and the position relationship between at least one adjacent word and the current word. Taking the text sequence "Eat grapes without spitting out grape skins" as an example, the multiple words include: eat, grapes, without spitting out, grape skins. The context information of "grapes" includes {(eat; left), (without spitting out; right)}. Second, store the above multiple words and context information in the database in list form. Third, determine whether the combined word formed by every two words can be used as a new word. When determining whether a combined word is a new word, it is necessary to judge the cohesion and freedom of the combined word. Among them, cohesion is used to measure the occurrence probability of the combined word, that is, whether two words often appear together. If so, it is considered that its cohesion degree is relatively high and can be used as a new word. Freedom is used to measure whether the combined word can be flexibly applied to different scenarios. Taking "artificial intelligence" as an example, its context can be paired with many verbs and nouns, such as "learning artificial intelligence knowledge", "engaging in the artificial intelligence industry". But for the word "artificial intelligence", although many words can be paired in the above context, there are basically only "intelligence" that can be paired in the following context. Then, we can consider that the freedom of "artificial intelligence" is relatively low, and thus it cannot be used as a new word.

[0032] Statistical unsupervised methods typically store the context information of each word in a list format for calculating cohesion and degrees of freedom. Each calculation of cohesion and degrees of freedom requires multiple database lookups, each of which involves traversing the database. When dealing with large datasets, the processing speed becomes unsatisfactory. Furthermore, existing unsupervised methods can only determine whether two words can form a new word, but not whether three or more words can form one.

[0033] In view of this, embodiments of this application provide a new word discovery method and apparatus, computer equipment, and storage medium. The new word discovery method is an unsupervised method, comprising: storing multiple n-gram words using a tree-structured storage structure; the multiple n-gram words are obtained by performing n-gram word frequency statistics on a predetermined text sequence, where n is a series of consecutive natural numbers starting from 1; each n-gram word carries word frequency and context information, the context information including at least one adjacent unary word and the positional relationship between each adjacent unary word and the n-gram word; storing n-gram words in n-level nodes of the tree-structured storage structure; the n+1-gram words stored in n+1-level nodes within the same path depend on the context information of the n-gram words stored in the n-level nodes; determining candidate words among the multiple n-gram words based on the tree-structured storage structure; and filtering the candidate words based on part-of-speech tags to obtain new words. According to the new word discovery method, apparatus, computer equipment, and storage medium provided in this application, a tree-structured storage method is used to store multiple n-gram words. The n-level nodes of the tree structure store n-gram words, and the n+1-gram words stored in the n+1-level nodes within the same path depend on the context information of the n-gram words stored in the n-level nodes. On one hand, in the tree structure, each node connects to the next, and n-gram words are connected to (n+1)-gram words through context information. Therefore, both the data and the storage results exhibit chain-like relationships, and they are well-matched. Thus, the tree structure facilitates the storage of contextual relationships. On the other hand, since n-gram words and n+1-gram words are connected through the context information of the n-gram words, the n+1-gram words can be directly located based on the n-gram words without traversing the entire database to find them, thereby improving data query speed. Furthermore, a step of filtering candidate words using part-of-speech tags is added to obtain new words that better conform to the word formation structure.

[0034] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0035] Figure 1This is a flowchart illustrating the new word discovery method provided in the first embodiment of this application. Figure 1 As shown, the new word discovery method 100 includes:

[0036] Step S110: Use a tree storage structure to store the obtained n-gram words.

[0037] Multiple n-gram words are obtained by performing n-gram word frequency statistics on a predetermined text sequence. The value of n takes multiple consecutive natural numbers starting from 1. For example, if the value of n is 1 and 2, then multiple n-gram words are obtained by performing 1-gram and 2-gram word frequency statistics on the predetermined text sequence. Multiple n-gram words include multiple unary words and multiple binary words. Taking "Eat grapes without spitting out the grape skins, don't eat grapes but spit out the grape skins" as an example, its corresponding multiple unary words include "eat, grapes, don't spit out, grape skins, don't eat, spit out," and its multiple binary words include "eat grapes, grapes don't spit out, don't spit out grape skins, don't eat grapes, grapes spit out, spit out grape skins."

[0038] Each n-gram carries word frequency and contextual information. The contextual information includes at least one adjacent unary word and the positional relationship between each adjacent unary word and the n-gram. Taking "grape" as an example, its word frequency is 2 / 14, where "2" indicates that "grape" appears twice, and "14" indicates the total number of occurrences of all unary words and all binary words. The contextual information for "grape" is {(eat; left side), (don't eat; left side); (don't spit; right side), (spit out; right side)}. In one embodiment, the contextual information also includes the word frequency and / or part-of-speech of each of the at least one adjacent unary word.

[0039] In one embodiment, before step S110, a step of preprocessing the predetermined text sequence to obtain multiple n-gram words may be included. For example, a word segmentation tool, such as jieba, is first used to segment the predetermined text sequence, the frequency of each word is counted based on the jieba dictionary, and the context information of each word is recorded, thus saving the frequency and context information of each word.

[0040] A tree-like storage structure is, for example, a trie tree. In a tree-like storage structure, n-level nodes store n-gram words. The n+1-gram words stored in the (n+1)-level nodes of the same path depend on the n-gram words stored in the n-level nodes and their context information. For example, an n+1-gram word is composed of the n-gram words stored in the n-level nodes and the adjacent unary words of the n-gram words. Figure 2 This is a partial schematic diagram of a tree storage structure provided in an embodiment of this application. Figure 2As shown, the tree-like storage structure includes multiple first-level nodes and multiple second-level nodes. Each first-level node stores multiple unary words, and each second-level node stores multiple binary words. Each unary word and each binary word carries its own contextual information. Within the same path, a binary word depends on the unary word and its contextual information; that is, a binary word is determined by the unary word, one of its adjacent unary words, and their positional relationship. Taking "grape" as an example, the first-level node stores the contextual information of "grape" and "grape". The contextual information of "grape" is {(eat; left side), (don't eat; left side); (don't spit; right side), (spit out; right side)}. Correspondingly, the binary words stored in the second-level nodes can include "eat grapes, don't eat grapes, don't spit out grapes, spit out grapes". These four binary words can be stored in four different second-level nodes, or in the same second-level node, or as binary words formed by left-neighboring words, i.e., "eat grapes and don't eat grapes" stored in the same second-level node, and binary words formed by right-neighboring words, i.e., "don't spit out grapes and spit out grapes" stored in the same second-level node.

[0041] Step S120: Based on the tree storage structure, candidate words are determined from multiple n-gram words. Generally speaking, unigram words are commonly used words, and there is no need to discover new words from unigram words; candidate words can be found from multi-gram words.

[0042] Step S130: Filter candidate words based on part of speech to obtain new words.

[0043] Once candidate words are identified, their part of speech is used to determine whether they are new words. For example, if at least one of the first and last unary words of a candidate word is a noun, then the candidate word is considered a new word.

[0044] The part of speech of at least one unary word among the candidate words can be obtained through the following two methods.

[0045] The first implementation method includes: after identifying candidate words, performing part-of-speech tagging on the candidate words to obtain the part-of-speech tags for multiple unary words. For example, the hanlp package can be used to perform part-of-speech tagging on the candidate words.

[0046] The second implementation includes: the context information is marked with the part-of-speech tag of at least one unary word. In this case, the part-of-speech tag of at least one unary word constituting the candidate word is output along with the candidate word.

[0047] For example, for "front MacPherson independent suspension", the result after word - part tagging is [('front', 'f'), ('MacPherson style', 'n'), ('independent', 'v'), ('suspension', 'n')], forming a modifier - head structure. Therefore, it can be considered to have complete semantic information and thus can be regarded as a new word. Another example is "front MacPherson independent suspension drive", whose word - part tagging result is [('front', 'f'), ('MacPherson style', 'n'), ('independent', 'ad'), ('suspension', 'v'), ('drive', 'v')]. Since the last word "drive" is a verb, this word does not have complete semantics. Therefore, we can consider that "front MacPherson independent suspension drive" should not be regarded as a new word. It can be seen that by performing word - part tagging on candidate words to filter them, new words that better conform to the word - formation structure can be determined.

[0048] According to the new - word discovery method provided by the embodiments of the present application, a tree - shaped storage structure is used to store multiple n - word terms. The n - level nodes of the tree - shaped storage structure store n - word terms, and the (n + 1) - word terms stored in the (n + 1) - level nodes in the same path depend on the context information of the n - word terms stored in the n - level nodes. Based on the tree - shaped storage structure, candidate words in the multiple n - word terms are determined, and the candidate words are filtered based on word parts to obtain new words. On the one hand, the previous node in the tree - shaped storage structure connects to the next node, and the n - word terms are connected to the (n + 1) - word terms through context information. It can be seen that both the data and the storage result include a chain relationship, and the two are adapted. Therefore, using the tree - shaped storage structure can facilitate the storage of context relationships. On the other hand, the n - word terms and the (n + 1) - word terms are connected through the context information of the n - word terms. Therefore, the (n + 1) - word terms can be directly located based on the n - word terms without traversing the entire database to find the (n + 1) - word terms, thus improving the data query speed. On the third hand, the step of filtering candidate words based on word parts is added, so as to obtain new words that better conform to the word - formation structure.

[0049] Figure 3 It is a schematic flowchart of the new - word discovery method provided by the second embodiment of the present application. In this embodiment, step S110 specifically includes:

[0050] Step S310, storing multiple one - word terms into multiple first - level nodes respectively.

[0051] Refer to Figure 2The system stores multiple unary words, namely "eat," "grape," "don't spit," "grape skin," "don't eat," and "spit out," into multiple first-level nodes. Each unary word carries word frequency and contextual information. Taking "grape" as an example, the storage content of its first-level node can be a dictionary structure, such as {(grape, word frequency); (eat, left side), (don't eat, left side), (don't spit, right side), (spit out, right side)}, where "grape" is the key of this first-level node, and all other content is the value of this first-level node. In one embodiment, the contextual information also includes the word frequency and / or part-of-speech of at least one adjacent unary word.

[0052] Step S320: For each first-level node, determine at least one big word based on the unary words and context information stored in that first-level node.

[0053] Specifically, taking the first-level node containing "grapes" as an example, the stored content of this first-level node includes {(grapes, word frequency); (eat, left side), (don't eat, left side), (don't spit, right side), (spit out, right side)}, which can identify four binary words, namely "eat grapes, don't eat grapes, don't spit out grapes, spit out grapes".

[0054] Step S330: Store at least one binary word into at least one second-level node corresponding to the first-level node.

[0055] Figure 4 This is a flowchart illustrating the new word discovery method provided in the third embodiment of this application. In this embodiment, step S110 further includes, after step S310:

[0056] Step S410: For each first-level node, at least one adjacent unary word of the unary word stored in the first-level node is stored in at least one second-level node. In this case, at least one adjacent unary word can serve as the index of the at least one second-level node.

[0057] In this case, step S330 is specifically executed as follows:

[0058] Step S420: Store at least one bigram word into the second-level node containing its adjacent unary words. That is, the bigram words stored in each second-level node include the adjacent unary words stored in the same second-level node. For example, the bigram word stored in the second-level node containing "eat" is "eat grapes".

[0059] Figure 5 This is a flowchart illustrating the new word discovery method provided in the fourth embodiment of this application. In this embodiment, step S120 specifically includes:

[0060] Step S510: For each path, determine the cohesion of the n+1 metawords stored in the n+1 level node based on the n-level node and the n+1 level node.

[0061] Cohesion , p ( x , y () represents the probability of a word formed by combining x and y. p ( x () represents the probability of x occurring. p ( y The cohesion value indicates the probability of y appearing. The higher the cohesion value, the more the compound word resembles a meaningful collocation; the lower the cohesion value, the more the compound word resembles a random combination of x and y.

[0062] When the context information includes at least one adjacent unary word, its respective word frequency, and its positional relationship with the n-gram word, cohesion can be determined based on the word frequencies of the n-gram word stored in the n-level nodes, the word frequencies of adjacent unary words, and the word frequencies of the n+1-gram word stored in the n+1-level nodes. When the context information does not include the word frequencies of at least one adjacent unary word, it is necessary to search for that at least one adjacent unary word in the first-level nodes of the tree storage structure to obtain the word frequencies of each adjacent unary word. In comparison, the former saves one search step, which can further improve the search speed.

[0063] See Figure 2 Taking the path "grape-grape not spitting out" as an example, when "grape" is searched, its frequency is recorded as 2 / 14. Based on the contextual information (not spitting out; right side) carried by "grape", "grape not spitting out" is located, and its frequency is recorded as 1 / 14. "Not spitting out" is searched from the first-level node of the tree storage structure, and its frequency is recorded as 1 / 14. The cohesion is calculated based on the frequencies of "grape", "not spitting out", and "grape not spitting out". Comparing the cohesion calculation process provided in this embodiment with the conventional cohesion calculation process described at the beginning of the specification, it can be seen that in the cohesion calculation process provided in this embodiment, "grapes not spitting out" can be directly located based on the context information carried by "grapes" without having to traverse the database for searching, thereby improving the search speed.

[0064] Step S520: Determine the degrees of freedom of the n+1 gram words based on the n+1 level nodes.

[0065] Degrees of freedom can be measured using information entropy. In this embodiment, the smaller of the information entropy of all left-neighboring words and the information entropy of all right-neighboring words is used as the degree of freedom. That is, if any word is not free on one side, it cannot be considered a separate word. (Information entropy) In this context, for the information entropy of left-neighbor words, y represents the probability of a certain left-neighbor word appearing, and x represents the sum of the probabilities of all left-neighbor words appearing. For the information entropy of right-neighbor words, y represents the probability of a certain right-neighbor word appearing, and x represents the sum of the probabilities of all right-neighbor words appearing.

[0066] When the context information includes at least one adjacent unary word, its respective word frequency, and its positional relationship with the n-gram word, at least one left neighbor word and at least one right neighbor word can be determined based on the context information carried by the n+1-gram word stored in the n+1-level nodes; the first information entropy is determined based on the word frequency of each of the at least one left neighbor word; the second information entropy is determined based on the word frequency of each of the at least one right neighbor word; and the larger of the first and second information entropies is determined as the degree of freedom. When the context information does not include the word frequency of each of the at least one adjacent unary word, it is also necessary to search for the word frequencies of each of the at least one left neighbor word and at least one right neighbor word from the first-level nodes of the tree storage structure.

[0067] Continuing the previous example, after locating "grapes don't spit out," based on at least one adjacent unary word in its contextual information and their respective positional relationships, i.e., {(eat; left side); (grape skin; right side)}, a left neighbor word and a right neighbor word are determined, namely "eat" and "grape skin." "Eat" and "grape skin" are searched from the first-level nodes of the tree storage structure, and their respective word frequencies are recorded as 1 / 14 and 2 / 14. The first information entropy of the left neighbor word is... =0.081, the second information entropy of the right-neighboring word =0.121. H 1< H 2, then the first information entropy H 1 is defined as the degree of freedom.

[0068] Step S530: When the cohesion (PMI) is greater than a first threshold and the degrees of freedom are greater than a second threshold, multi-dimensional words are determined as candidate words. In this embodiment, when the cohesion (PMI) is greater than the first threshold and the degrees of freedom are greater than the second threshold, i.e., the first information entropy... H If the threshold value is greater than the second threshold, "grapes not spit out" will be selected as a candidate word. The first and second thresholds can be set reasonably according to actual needs.

[0069] Figure 6 This is a flowchart illustrating the new word discovery method provided in the fifth embodiment of this application. In this embodiment, the new word discovery method 600, based on the new word discovery method provided in any of the above embodiments, further includes:

[0070] Step S610: Replace the word frequency of n-gram words with the word frequency threshold among the multiple n-gram words obtained.

[0071] For example, if the preset word frequency is 1 / 14, and the word frequencies of "grape" and "grape skin" are both 2 / 14, then the word frequencies of "grape" and "grape skin" need to be replaced with 1 / 14. It should be understood that this example is only for correspondence with the embodiment given in the specification. In actual practice, it is used to reduce the weight of common words, such as "today," "China," and "of," thereby reducing the impact of extreme values ​​on cohesion and freedom, and thus improving the accuracy of newly discovered words.

[0072] This application also provides a novel word discovery device. Figure 7 This is a structural block diagram of the new word discovery device provided in the first embodiment of this application. Figure 7 As shown, the new word discovery device 70 includes a storage module 71, a determination module 72, and a filtering module 73. The storage module 71 is configured to store multiple n-gram words using a tree-structured storage method. These n-gram words are obtained by performing n-gram frequency statistics on a predetermined text sequence, where n is a series of consecutive natural numbers starting from 1. Each n-gram word carries frequency and contextual information, including at least one adjacent unary word and the positional relationship between each adjacent unary word and the n-gram word. The n-level nodes of the tree-structured storage method store the n-gram words, and the n+1-level nodes in the same path store the n+1-gram words based on the contextual information of the n-level nodes. The determination module 72 is configured to determine candidate words from the multiple n-gram words based on the tree-structured storage method. The filtering module 73 is configured to filter the candidate words based on part-of-speech tags to obtain new words.

[0073] In one embodiment, the multiple n-gram words include multiple unary words and multiple binary words. The storage module 71 is specifically configured to store the multiple unary words into multiple first-level nodes; for each first-level node, determine at least one binary word based on the unary words stored in the first-level node and their accompanying context information; and store the at least one binary word into at least one second-level node corresponding to the first-level node.

[0074] In one embodiment, the storage module 71 is specifically used to store multiple unary words into multiple first-level nodes; for each first-level node, at least one adjacent unary word of the unary word stored in the first-level node is stored into at least one second-level node, and at least one binary word is determined based on the unary word stored in the first-level node and its carried context information; and at least one binary word is stored into the second-level node where its respective adjacent unary word is located.

[0075] In one embodiment, within the same path, the n+1-gram words stored in the n+1-level nodes consist of the n-gram words stored in the n-level nodes and their adjacent unary words. In this case, determining candidate words among multiple n-gram words based on the tree storage structure includes: the determination module 72 is specifically used to determine the cohesion of the n+1-gram words stored in the n+1-level nodes based on the n-level nodes and n+1-level nodes for each path in the tree storage structure; determine the degrees of freedom of the n+1-gram words based on the n+1-level nodes; and determine the n+1-gram words as candidate words when the cohesion is greater than a first threshold and the degrees of freedom are greater than a second threshold.

[0076] Among them, determining the cohesion of n+1-gram words stored in n+1-level nodes based on n-level nodes and n+1-level nodes includes: determining the cohesion based on the word frequency of n-gram words stored in n-level nodes, the word frequency of adjacent unary words, and the word frequency of n+1-gram words stored in n+1-level nodes.

[0077] The degrees of freedom for determining n+1-gram words based on n+1-level nodes include: determining at least one left neighbor word and at least one right neighbor word based on the context information carried by the n+1-gram words stored in the n+1-level nodes; determining the first information entropy based on the word frequency of each of the at least one left neighbor word; determining the second information entropy based on the word frequency of each of the at least one right neighbor word; and determining the larger of the first information entropy and the second information entropy as the degree of freedom.

[0078] In one embodiment, the filtering module 73 is specifically used to: filter candidate words based on the part of speech of at least one n-gram among the candidate words to obtain new words. For example, when the part of speech of the first and last n-grams of a candidate word is both noun, the candidate word is determined to be a new word. Filtering candidate words using part of speech ensures that new words that better conform to the word formation structure are obtained.

[0079] According to the new word discovery device provided in this embodiment, a tree-structured storage method is used to store multiple n-gram words. The n-level nodes of the tree structure store n-gram words, and the n+1-gram words stored in the n+1-level nodes within the same path depend on the context information of the n-gram words stored in the n-level nodes. Candidate words are determined from the multiple n-gram words based on the tree structure, and these candidate words are filtered based on part-of-speech tags to obtain new words. On one hand, in the tree structure, each node connects to the next, and n-gram words are connected to (n+1)-gram words through context information. Therefore, both the data and the storage results exhibit chain-like relationships, and they are well-matched. Thus, the tree structure facilitates the storage of contextual relationships. On the other hand, since n-gram words and n+1-gram words are connected through the context information of the n-gram words, the n+1-gram words can be directly located based on the n-gram words without traversing the entire database to find them, thereby improving data query speed. Furthermore, the addition of a step to filter candidate words using part-of-speech tags results in new words that better conform to the word formation structure.

[0080] Figure 8 This is a structural block diagram of the new word discovery device provided in the second embodiment of this application. Figure 8 As shown, the new word discovery device 80, based on the new word discovery device 70, further includes a replacement module 81, configured to replace the word frequency of n-gram words with a word frequency threshold among the acquired n-gram words with the word frequency threshold.

[0081] According to the new word discovery device provided in this embodiment, by setting the replacement module 81, the word frequency of n-gram words with a word frequency greater than the word frequency threshold can be replaced with the word frequency threshold, thereby reducing the impact of extreme values ​​on cohesion and freedom, and thus improving the accuracy of new word discovery.

[0082] The new word discovery apparatus provided in any embodiment of this application belongs to the same concept as the new word discovery method provided in the embodiments of this application. It can execute the new word discovery method provided in any embodiment of this application and has the corresponding functional modules and beneficial effects of executing the new word discovery method. Technical details not described in detail in this embodiment can be found in the new word discovery method provided in the embodiments of this application, and will not be repeated here.

[0083] This application also provides a computer device. Figure 9 This is a structural block diagram of a computer device provided in one embodiment of this application. Figure 9 As shown, the computer device 90 includes a memory 91, a processor 92, and a computer program stored on the memory 91 and executed by the processor 92. When the processor 92 executes the computer program, it implements the steps of the new word discovery method as provided in any of the above embodiments.

[0084] Memory 91 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and / or non-volatile memory. Volatile memory includes random access memory (RAM) and / or cache memory, etc. Non-volatile memory includes read-only memory (ROM), hard disk, flash memory, etc. Memory 91 may also store a tree-like storage structure of multiple n-gram words.

[0085] Processor 92 may be a processing unit with data processing and / or instruction execution capabilities, such as a central processing unit (CPU).

[0086] Computer programs can be written in one or more programming languages. Programming languages ​​include object-oriented programming languages ​​such as Java and C++, as well as conventional procedural programming languages ​​such as C. A computer program can execute entirely on a computer device 90, partially on a computer device 90 and partially on a server, or as a standalone software package.

[0087] In one embodiment, the computer device 90 further includes an input device 93 and an output device 94, which are respectively connected to the processor 92. The input device 93 may be a microphone or microphone array for capturing sound signals. The input device 93 may also be a keyboard, mouse, etc. The output device 94 can output various information to the outside, including newly identified words. The output device 94 may be a monitor, speaker, printer, etc.

[0088] This application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the new word discovery method provided in any of the above embodiments.

[0089] Computer-readable storage media can take the form of any combination of one or more readable media. A readable storage medium can be any of the following forms: electrical, magnetic, optical, electromagnetic, infrared, semiconductor, or a combination thereof. Examples of readable storage media include: hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, etc.

[0090] The above description has been given for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of this application to the forms disclosed herein. Although numerous exemplary aspects and embodiments have been discussed above, those skilled in the art will recognize certain variations, modifications, alterations, additions, and sub-combinations thereof.

Claims

1. A method for discovering new words, characterized in that, include: A tree-structured storage method is used to store multiple n-gram words. These n-gram words are obtained by performing n-gram word frequency statistics on a predetermined text sequence. The value of n is a series of consecutive natural numbers starting from 1. Each n-gram word carries word frequency and context information. The context information includes at least one adjacent unary word and the positional relationship between the at least one adjacent unary word and the n-gram word. The n-level nodes of the tree-structured storage method store n-gram words. The n+1-gram words stored in the n+1-level nodes of the same path depend on the context information of the n-gram words stored in the n-level nodes. The n-gram words and the n+1-gram words are connected through the context information of the n-gram words. Based on the tree-structured storage, candidate words are determined from the multi-word set of the multiple n-gram words; The candidate words are filtered based on their parts of speech to obtain new words.

2. The new word discovery method according to claim 1, characterized in that, The multiple n-gram words include multiple unary words and multiple binary words; The method of storing the acquired n-gram words using a tree-structured storage method includes: The multiple unary words are stored in multiple first-level nodes respectively; For each first-level node, at least one big word is determined based on the unary words stored in the first-level node and the context information they carry; The at least one binary word is stored in at least one second-level node corresponding to the first-level node.

3. The new word discovery method according to claim 2, characterized in that, After storing the multiple unary words into multiple first-level nodes, the method further includes: For each first-level node, at least one adjacent unary word of the unary word stored in the first-level node is stored in at least one second-level node; The step of storing the at least one binary word into at least one second-level node corresponding to the first-level node includes: The at least one binary word is stored in the second-level node containing the adjacent unary word.

4. The new word discovery method according to any one of claims 1-3, characterized in that, In the same path, the n+1-gram words stored in the n+1-level nodes are composed of the n-gram words stored in the n-level nodes and the adjacent unary words of the n-gram words; The process of determining candidate words from the plurality of n-gram words based on the tree-structured storage includes: For each path in the tree-like storage structure, the cohesion of the n+1 metawords stored in the n+1 level node is determined based on the n-level node and the n+1 level node. The degrees of freedom of the n+1 grammatical words are determined based on the n+1 level nodes; When the cohesion is greater than the first threshold and the degree of freedom is greater than the second threshold, the n+1 grammatical words are determined as candidate words.

5. The new word discovery method according to claim 4, characterized in that, The determination of the cohesion of the n+1 meta-words stored in the n+1 level node based on the n-level node and the n+1 level node includes: The cohesion is determined based on the word frequency of the n-gram words stored in the n-level nodes, the word frequency of the adjacent unary words, and the word frequency of the n+1-gram words stored in the n+1-level nodes.

6. The new word discovery method according to claim 4, characterized in that, The degrees of freedom for determining the n+1 gram words based on n+1 level nodes include: Based on the context information carried by the n+1 metawords stored in the n+1 level nodes, at least one left neighbor word and at least one right neighbor word can be determined. The first information entropy is determined based on the word frequency of each of the at least one left-neighboring word; The second information entropy is determined based on the word frequency of each of the at least one right-neighboring word; The larger of the first information entropy and the second information entropy is determined to be the degree of freedom.

7. The new word discovery method according to any one of claims 1-3, characterized in that, The filtering of the candidate words to obtain new words includes: The candidate words are filtered based on the part of speech of at least one n-gram word in the candidate words to obtain the new word.

8. The new word discovery method according to claim 7, characterized in that, The process of filtering the candidate words based on the part-of-speech tag of at least one n-gram word in the candidate words to obtain the new words includes: When at least one of the first and last unary words of the candidate word is a noun, the candidate word is determined to be the new word.

9. The new word discovery method according to any one of claims 1-3, characterized in that, Before storing the acquired n-gram words using a tree-structured storage method, the method further includes: Replace the word frequency of n-gram words with the word frequency threshold among the obtained n-gram words with the word frequency threshold.

10. A new word discovery device, characterized in that, include: The storage module is configured to store multiple n-gram words using a tree-structured storage method. These n-gram words are obtained by performing n-gram word frequency statistics on a predetermined text sequence. The value of n is a series of consecutive natural numbers starting from 1. Each n-gram word carries word frequency, part-of-speech tag, and context information. The context information includes at least one adjacent unary word and the positional relationship between the at least one adjacent unary word and the n-gram word. The n-level nodes of the tree-structured storage method store n-gram words. The n+1-gram words stored in the n+1-level nodes of the same path depend on the context information of the n-gram words stored in the n-level nodes. The n-gram words and the n+1-gram words are connected through the context information of the n-gram words. The determination module is configured to determine candidate words from the multiple n-gram words in the multi-gram words based on the tree storage structure; The filtering module is configured to filter the candidate words to obtain new words.

11. A computer device comprising a memory, a processor, and a computer program stored in the memory and executed by the processor, characterized in that, When the processor executes the computer program, it implements the steps of the new word discovery method as described in any one of claims 1 to 9.

12. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the new word discovery method as described in any one of claims 1 to 9.