[0061] Hereinafter, the present invention will be described in detail with reference to the drawings.
[0062] Since the inventor has made an invention and creation after absorbing the essence of the double array Trie, before specifically introducing the vocabulary fuzzy query of the present invention, first introduce the double array Trie.
[0063] If you want to query the double array Trie, you first need to construct the double array Trie to determine the base value array and the corresponding check value array.
[0064] Suppose there are only the words "Ah, Argentina, Ejiao, Arab, Arab, Egypt" in the thesaurus.
[0065] First, encode all 10 Chinese characters that appear in the thesaurus: Ah-1, Ah-2, Ai-3, Gen-4, Jiao-5, La-6, and -7, Ting-8, Bo-9 , People-10. This kind of coding is not unique. It only needs to correspond to a unique coding for all the characters in the lexicon. It can be sequential coding or the corresponding coding of each Chinese character that already exists in the computer. If it is the former, a coding mapping unit needs to be recreated to store the one-to-one correspondence between Chinese characters and codes. If the latter coding method is adopted, the coding mapping unit can be omitted to save storage space.
[0066] Then, the thesaurus is represented by the Trie structure, such as image 3 Shown.
[0067] Subsequently, a double array Trie is constructed to determine the base value array base[] and the corresponding check value array check[].
[0068] For each Chinese character, a base value needs to be determined so that all words beginning with the Chinese character can be put in the double array. For example, to determine the base value of the word "阿", suppose the second word sequence code of the word beginning with "阿" is a1, a2, a3...an, we must find a value i, Make base[i+a1], check[i+a1], base[i+a2], check[i+a2]...base[i+an], check[i+an] all 0 . Once this i is found, the base value of "A" is determined to be i. Use this method to construct a double array Trie (Double array trie), after several traversals, put all the words into the double array, and then traverse the vocabulary once to modify the base value. Suppose a negative base value indicates that the position is a word. If the state i corresponds to a certain word, and Base[i]=0, then let Base[i]=(-1)*i, if the value of Base[i] is not 0, then let Base[i]=(-1 )*Base[i]. The double array is shown in the following table. It should be noted that Table 1 is only an example of a double array.
[0069] Table 1
[0070] Subscript 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Base -1 4 4 0 0 0 0 4 -9 4 -11 -12 -4 -14 Check 0 0 0 0 0 0 0 2 2 2 3 8 10 13 affix what Ah Angstrom Agen Donkey-hide glue Allah Egypt Argentina Arab Arab
[0071] The double array generated by the above method will be "Ah", "A", "Ah", "Agen", "Allah", "Ejiao", "Egypt", "Arab", "Arab", "Argentina" These affixes are considered status. An affix is different from the traditional concept of affixes. It can be the words "ah", "阿", "埃", or the words "Ejiao", "Egypt", "Arab", "Arab", "Argentina", It can also be just a prefix or suffix, such as "Agen" or "Allah". Each state corresponds to a subscript of the array. For example, if the subscript of "Argent" is i=8, then the content of check[i] is the subscript of "Argentina", and base[i] is the base value of the subscript of "Argentina". The serial code of "Ting" is x=8, so the subscript of "Argentina" is base[i]+x=base[8]+8=12. In other words, each affix has a one-to-one correspondence with an array element of the base value array and the check value array.
[0072] Finally, the specific query process is:
[0073] The query process of Double array trie is actually a state transfer process of DFA. It is relatively simple to implement in Double array trie: you only need to perform state transfer according to the status flag. For example, to query "Argentina", first follow the sequence code of "A" b=2, find the subscript 2 of the state "A", and then find the subscript base[b]+d=8 of "Agen" according to the sequence code of "root" d=4, and according to check[base[b] +d]=b, indicating that "Agen" is part of a certain word, you can continue to inquire. Then find the status "Argentina". Its subscript is y=12, at this time base[y]<0, check[y]=base[b]+d=8, indicating that "Argentina" is in the vocabulary and the query is complete.
[0074] During the query process, we can see that the query time for a word is only related to its length, which means that its time complexity is O(1), so its speed is extremely fast.
[0075] After long-term thinking, the applicant can take advantage of the fast query speed of the double array Trie and realize the function of fuzzy query.
[0076] See Figure 4 , Which is a flowchart of a fuzzy query method for thesaurus of the present invention. it includes:
[0077] S110: Establish the data structure of the entry:
[0078] S11: Store all the entries in the thesaurus in the entry storage unit of the entry data structure in order;
[0079] S12: Construct the forward entry index structure of the entry data structure: firstly all the words of all entries correspond to a unique code, then construct the double array Trie, determine the base value array and the corresponding check value array, and store it Each affix is stored in the storage address information of all words at the beginning of the affix in the entry storage unit, and the affix corresponds to an array unit of the base value array and the check value array one by one.
[0080] See Figure 5 , Which is a schematic diagram of the entry data structure. It consists of the lexicon header 11 and the lexicon content 12. Among them, the lexicon header is the index information of the term, and the lexicon content stores the detailed information of the term. The content of the lexicon is the term storage unit, and each term in the storage unit can be sorted by keywords from small to large. In this way, in the forward query, only the storage address information of the first entry and the storage address information of the last entry that meet the conditions are given to obtain all the words that meet the conditions, thereby saving the index of the thesaurus space. As the entries are sorted by keywords from smallest to largest, keywords starting with "apple", such as "apple", "apple juice", and "apple tree" are all located next to each other in the thesaurus. In this way, only the storage address information of "apple" and "apple tree" can be given, and the detailed information of "apple", "apple juice" and "apple tree" can be obtained.
[0081] The thesaurus header 11 includes at least a forward term index structure 111. See Figure 6 , Which is an example diagram of the forward term index structure of the present invention.
[0082] It includes an encoding mapping unit, which encodes every word that appears in the dictionary. This kind of coding is not unique. It only needs to correspond to a unique coding for all the characters in the lexicon. It can be sequential coding or the corresponding coding of each Chinese character that already exists in the computer. If it is the former, a coding mapping unit needs to be recreated to store the one-to-one correspondence between Chinese characters and codes. If the latter coding method is adopted, the coding mapping unit can be omitted to save storage space.
[0083] It includes a slot array, and each array unit of the slot array represents a affix, and each array unit stores the storage address information of the affix at the beginning of the affix in the entry storage unit, and the base value corresponding to the affix Array value, check value array value.
[0084] Store the storage address information of each affix at the beginning of this affix in the entry storage unit further: store the offset address information startoffset and startoffset of the first word at the beginning of the affix in the entry storage unit for each affix The offset address information endoffset of the last word at the beginning of the affix, where the offset address information is the offset address of the word based on the first address of the word entry storage unit. Still with image 3 For example, the generated forward term index structure can be Table 2.
[0085] Table 2
[0086] Subscript 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Base -1 4 4 0 0 0 0 4 -9 4 -11 -12 -4 -14 Check 0 0 0 0 0 0 0 2 2 2 3 8 10 13
[0087] startoffset
[0088] Construct the forward entry index structure of the entry data structure: firstly all the words of all entries correspond to a unique code, then construct the double array Trie, determine the base value array and the corresponding check value array, and then store each The affix is stored in the storage address information of all the words at the beginning of the affix in the entry storage unit. The affix corresponds to an array unit of the base value array and the check value array. The foregoing Table 2 is only an example diagram of a forward term index structure.
[0089] The specific steps for constructing a forward term index structure can be as follows:
[0090] First, encode the keyword of the query word, each word (Chinese, English) corresponds to a unique code, and set the coded as a0, a1,..., aN-1.
[0091] Then, loop through each code in turn,
[0092] If it is the first word and the check value of slots[a0] is -1, it means that the word exists. Save the base value and a0 value of slots[a0] as preBase and preIdx respectively, continue, otherwise exit. (Note: If slots[i].check=-2, it means that the position of i is not used; if slots[i].check=-1, it means it is the first word; if slots[i].check>=0, it means The previous position of the word at position i is the position indicated by slots[i].check)
[0093] If it is not the first word, set the code of the word to aI, and find the next position pos=aI+abs(preBase). If slots[pos].check is equal to preIdx, it means to find a string starting with a1a2......aI, save the base value and pos value of slots[pos] as preBase and preIdx respectively, continue, otherwise exit.
[0094] Subsequently, the offset range of all entries starting with the keywords represented by the codes a0, a1,..., aN-1 is obtained.
[0095] If the last word is processed, get slots[preIdx].StartOffset and slots[preIdx].endOffset, these two values represent all the keywords starting with the keywords represented by a0, a1,..., aN-1 The range of the offset of the entry. The present invention can obtain the detailed information of the entry through the algorithm of locating the detailed information position of the entry by the offset idx. At this time, the value of idx is slots[preIdx].StartOffset
[0096] If you quit halfway and the last word is not processed, it means that there are no entries starting with the keywords represented by a1, a2,..., aN.
[0097] S120: When receiving the query term, first obtain the codes of all the words in the query term, then use the double array Trie to find the base value array unit where the query term is located, and then find the affix corresponding to the base value array unit in the entry storage unit The storage address information of all words at the beginning of the Nakamoto affix, and all the corresponding words are found in the word storage unit.
[0098] Take Table 2 as an example. For example, to query "阿根", first find the subscript 2 of the state "阿" according to the sequence code b=2 of "阿", and then find "阿" according to the sequence code d=4 of the "root" The subscript of "root" is base[b]+d=8, find the "startoffset" and "endoffset" corresponding to slot[8], and query the entry storage unit to get the corresponding words. These words are "Agen "Fuzzy query results."
[0099] In addition to providing forward query, the present invention also provides reverse query.
[0100] The index structure of the reverse term is also included in the thesaurus header.
[0101] The reverse term index structure includes:
[0102] A second slot array, each array unit of the second slot array represents a reversed affix after inversion, and each array unit stores the storage address of all words ending with the affix in the term storage unit of the affix Information, the base value array value and the check value array value corresponding to the inverted affix. See Figure 7 , Which is an example diagram of the reverse term index structure.
[0103] Construct the reverse entry index structure of the entry data structure: firstly encode all the words of all entries one by one (you can use the encoding method of the forward entry index structure), and then reverse the words in the thesaurus Then it is represented by the Trie tree structure, and then a double array of Trie is constructed, the base value array and the corresponding check value array are determined, and then each reversed affix is stored in the entry storage unit for all words ending with this affix The address information is stored, and the inverted affix corresponds to one array element of the base value array and the check value array.
[0104] When the query term is received, the query term is reversed, and then the code of each word in the reverse query term is obtained. The index structure of the reverse term is searched to obtain the corresponding base value array unit, and then the affix corresponding to the base value array unit is found The storage address information of all words ending with this affix in the term storage unit, and all words are found in the term storage unit.
[0105] In fact, reverse lookup is similar to forward lookup. The only difference is that the offsets in the thesaurus storage unit of the keywords containing the common ending part are not ordered, so startOffet and endOffset cannot be used to represent a range. However, the offset of each affix in the storage unit of the thesaurus must be exhaustively listed, so the structure is slightly changed.
[0106] This structure is also an improvement of the Double array trie index structure, adding the offset list pointed to by Offset and Offset on the basis of base and check. The construction and query process are similar to the standard Double array trie. The only difference is that the assignment of the offset list pointed to by Offset and Offset is added during construction. During the query process, after the standard Double array trie query is completed, Get the offset list (arranged from small to large) pointed to by the Offset of the current slot as the final query result of the Double arraytrie. For example, suppose the keywords ending with "mobile phone" are "mobile phone", "mp3 mobile phone", "Apple mobile phone", and "Nokia mobile phone". Because the positions of keywords with the same ending component in puredata are not continuous, their offsets in puredata are not continuous, assuming 2, 6, 11, 78. In the reverse entry index structure, the number N of the offset list pointed to by the offset of the slot where the word "hand" of "mobile phone" is located is 4, and the offsets are 2, 6, 11, 78 respectively.
[0107] In addition, it should be noted that when constructing the double array Trie, determining the base value array and the corresponding check value array, the construction is reversed: "Robot", "Robot 3pm", "Robot Apple" , "Robot Aquino" is represented by the Trie tree structure, and then the base value array and the check value array are determined. Therefore, when querying the query, first reverse the query, then determine the code corresponding to each word in the query, look up the reverse term index structure to obtain the corresponding base value array unit, and then find the base value array The affix corresponding to the unit is stored in the storage address information of all words ending with the affix in the entry storage unit, and all words are found in the entry storage unit.
[0108] Through the above method, the query can be a forward query or a reverse query, which makes the query more comprehensive and the query effect is better. This way of both forward and reverse query is called two-way query. In two-way query, you can first use forward query to obtain the offset range, and then use reverse query to obtain the corresponding offset And then find each corresponding word in the vocabulary storage unit according to the offset information, and the word is the vaguely queried word in the query language.
[0109] It should be noted that since the encoding of the word (affix) in the present invention is unique, the search result of the present invention is unique, and because the Hash algorithm has a certain conflict rate, the uniqueness of the Hash function sequence table is usually not guaranteed. Therefore, other measures (such as open chain, closed chain, etc.) are needed to ensure the uniqueness of the search results, so the search method of the present invention has a faster search speed. If using the present invention, the code of "apple" is 12223, and it can be determined that the code of 12223 only corresponds to the keyword "apple". But when using the Hash algorithm, the code of "apple" is 12223, and the code of "chestnut" may also be 12223 Therefore, other measures are needed to ensure that the query is "apple" or "chestnut" at this time, so the search method of the present invention is faster.
[0110] In addition to only supporting forward query and two-way query, the present invention may also only support reverse query of thesaurus.
[0111] A fuzzy query method for thesaurus, including:
[0112] (1) Establish the entry data structure:
[0113] (1-1) Store all the entries in the thesaurus in the entry storage unit of the entry data structure in order;
[0114] (1-2) Constructing the reverse entry index structure of the entry data structure: First, all the words of all entries correspond to a unique code, and then construct the double array Trie to determine the base value array and the corresponding check value array , After storing each affix in the entry storage unit, the storage address information of all words at the beginning of the affix, the affix corresponds to an array unit of the base value array and the check value array one by one;
[0115] (2) When receiving the query term, reverse the query term, and then obtain the code of each word in the reversed query term, search the reverse term index structure to obtain the corresponding base value array unit, and then find the base value array unit The corresponding affix is stored in the storage address information of all words ending in the affix in the term storage unit, and all words are found in the term storage unit.
[0116] The reverse query has been made public, so I won’t repeat it here.
[0117] The above disclosures are only a few specific embodiments of the present invention, but the present invention is not limited thereto, and any changes that can be thought of by those skilled in the art should fall within the protection scope of the present invention.