Entity recognition method and apparatus, terminal, and storage medium
By combining a dictionary database and mutual information entropy, the problem of low entity recognition accuracy is solved, providing an efficient and accurate entity recognition method that is applicable to intelligent customer service and other fields, improving recognition efficiency and user experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING XIAOMI MOBILE SOFTWARE CO LTD
- Filing Date
- 2022-02-11
- Publication Date
- 2026-06-12
AI Technical Summary
In existing technologies, entity recognition methods have low accuracy, especially in the later stages of speech recognition, where they are greatly affected by factors such as human accent, age, and environmental noise, and there is a lack of effective optimization methods.
Entity word recognition is performed by setting up a dictionary database and combining mutual information and left and right information entropy. This avoids deep learning and training with a large amount of data, and allows for a direct cold start, which improves recognition efficiency and accuracy.
It enables entity word recognition at low cost and high efficiency, and is applicable to intelligent customer service and other fields, improving recognition accuracy and user experience.
Smart Images

Figure CN114462410B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of terminal technology, and in particular to an entity recognition method, device, terminal and storage medium. Background Technology
[0002] In human-computer interaction applications, it's necessary to first convert human speech into text, then process it to allow the machine to understand the human's intent, and finally select appropriate follow-up strategies. Speech recognition and entity recognition (i.e., entity word recognition) are both crucial, with entity recognition occurring after speech recognition. Complex factors such as accent, age, speaking habits, education level, and environmental noise significantly impact not only the accuracy of speech recognition but also the accuracy of entity word recognition within sentences. While many methods exist to optimize speech recognition, entity recognition methods are fewer and their accuracy is relatively poor. Summary of the Invention
[0003] To overcome the problems existing in related technologies, this disclosure provides an entity recognition method, apparatus, terminal and storage medium.
[0004] According to a first aspect of the present disclosure, an entity recognition method is provided, applied to a terminal, the method comprising:
[0005] Obtain the statement to be recognized;
[0006] The sentence to be identified is identified based on a set dictionary database, and a first identification result is determined;
[0007] If it is determined that the first identification result does not include all the entity words of the statement to be identified, then the statement to be identified is identified based on mutual information and left and right information entropy to determine the pending identification result;
[0008] Based on the first identification result and the pending identification result, the target identification result is determined.
[0009] Optionally, determining the target identification result based on the first identification result and the pending identification result includes:
[0010] Words in the pending identification results that are different from the entity words in the first identification result are identified as pending entity words;
[0011] The second recognition result is determined based on the undefined entity words that meet the first set conditions;
[0012] The target identification result is determined based on the first identification result and the second identification result.
[0013] Optionally, determining the second recognition result based on the undefined entity words that satisfy the first set condition includes:
[0014] If it is determined that the first model value of the undetermined entity word is greater than or equal to the first threshold, and it is determined that the second model value of the undetermined entity word is greater than or equal to the second threshold, then the undetermined entity word is determined as the second entity word.
[0015] The recognition result consisting of all the second entity words is determined as the second recognition result.
[0016] Optionally, the method further includes:
[0017] If it is determined that the first recognition result includes all entity words of the statement to be recognized, then the first recognition result is determined as the target recognition result.
[0018] Optionally, the dictionary database is obtained in the following way:
[0019] Determine the statement library based on the statements in the defined domain;
[0020] The statements in the statement library are segmented into words to determine the first word library;
[0021] The sentences in the sentence database are identified based on mutual information and left and right information entropy to determine the undetermined word database;
[0022] The set dictionary database is determined based on the first dictionary database and the undetermined dictionary database.
[0023] Optionally, determining the set dictionary database based on the first dictionary database and the undetermined dictionary database includes:
[0024] Words in the undetermined word library that are different from the set words in the first word library are identified as undetermined set words;
[0025] The second word library is determined based on the undefined words that meet the second set conditions;
[0026] The set dictionary database is determined based on the first dictionary database and the second dictionary database.
[0027] Optionally, determining the second word library based on the undefined words that satisfy the second set conditions includes:
[0028] If it is determined that the first model value of the pending setting word is greater than or equal to the third threshold, and it is determined that the second model value of the pending setting word is greater than or equal to the fourth threshold, then the pending setting word is determined as the second setting word;
[0029] The dictionary database consisting of all the second set words is determined as the second word database.
[0030] According to a second aspect of the present disclosure, an entity recognition device is provided, applied to a terminal, the device comprising:
[0031] The acquisition module is used to acquire the statement to be recognized;
[0032] The determination module is used to identify the statement to be identified based on a set dictionary and determine a first identification result;
[0033] It is also used to identify the statement to be identified based on mutual information and left and right information entropy if it is determined that the first identification result does not include all the entity words of the statement to be identified;
[0034] It is also used to determine the target identification result based on the first identification result and the pending identification result.
[0035] Optionally, the determining module is configured to:
[0036] Words in the pending identification results that are different from the entity words in the first identification result are identified as pending entity words;
[0037] The second recognition result is determined based on the undefined entity words that meet the first set conditions;
[0038] The target identification result is determined based on the first identification result and the second identification result.
[0039] Optionally, the determining module is configured to:
[0040] If it is determined that the first model value of the undetermined entity word is greater than or equal to the first threshold, and it is determined that the second model value of the undetermined entity word is greater than or equal to the second threshold, then the undetermined entity word is determined as the second entity word.
[0041] The recognition result consisting of all the second entity words is determined as the second recognition result.
[0042] Optionally, the determining module is configured to:
[0043] If it is determined that the first recognition result includes all entity words of the statement to be recognized, then the first recognition result is determined as the target recognition result.
[0044] According to a third aspect of the present disclosure, a terminal is provided, the terminal comprising:
[0045] processor;
[0046] Memory used to store the processor's executable instructions;
[0047] The processor is configured to perform the method as described in the first aspect.
[0048] According to a fourth aspect of the present disclosure, a non-transitory computer-readable storage medium is provided, wherein when instructions in the storage medium are executed by a processor of a terminal, the terminal is enabled to perform the method as described in the first aspect.
[0049] The technical solutions provided by the embodiments of this disclosure can include the following beneficial effects: In this method, entity word recognition is performed based on a set dictionary, mutual information, and left and right information entropy. It does not require preparing a large amount of data for model training or deployment for a cold start, thus reducing difficulty and improving efficiency and accuracy. Furthermore, this method can be applied not only to the field of intelligent customer service but also to other fields involving entity word recognition, demonstrating broad applicability.
[0050] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure. Attached Figure Description
[0051] The accompanying drawings, which are incorporated in and form a part of this specification, illustrate embodiments consistent with this disclosure and, together with the description, serve to explain the principles of this disclosure.
[0052] Figure 1 This is a flowchart illustrating an entity recognition method according to an exemplary embodiment.
[0053] Figure 2 This is a block diagram illustrating an entity recognition device according to an exemplary embodiment.
[0054] Figure 3 This is a block diagram of a terminal according to an exemplary embodiment. Detailed Implementation
[0055] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numerals in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this disclosure. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this disclosure as detailed in the appended claims.
[0056] In related technologies, entity recognition (referred to as entity recognition) is mainly achieved through deep learning methods, pinyin correction methods, or a combination of both. Deep learning methods require a large amount of data for model training, which is time-consuming, and different models need to be trained differently. Pinyin correction methods require the establishment of specialized dictionaries for certain vertical domains, and the accuracy of this method is relatively low.
[0057] This disclosure provides an entity recognition method for terminal applications. This method identifies entity words based on a predefined dictionary, mutual information, and left and right information entropy. It eliminates the need for extensive data preparation for model training or deployment for cold starts, resulting in lower difficulty and better efficiency and accuracy. Furthermore, this method is applicable not only to intelligent customer service but also to other fields involving entity word recognition, demonstrating broad applicability.
[0058] In one exemplary embodiment, an entity recognition method is provided, applied to a terminal. (Reference) Figure 1 As shown, the method includes:
[0059] S110, Obtain the statement to be recognized;
[0060] S120. Based on the set dictionary, identify the statement to be identified and determine the first identification result;
[0061] S130. Determine whether the first recognition result includes all entity words of the statement to be recognized; if it is determined that the first recognition result includes all entity words of the statement to be recognized, then execute step S140; if it is determined that the first recognition result does not include all entity words of the statement to be recognized, then execute steps S150 and S160.
[0062] S140. The first identification result is determined as the target identification result;
[0063] S150. Based on mutual information and left and right information entropy, identify the statement to be identified and determine the pending identification result;
[0064] S160. Based on the first identification result and the pending identification result, determine the target identification result.
[0065] In step S110, the statement to be recognized can be manually entered by the user, entered by the user via voice, or be non-user voice recognized by the terminal, or received by the terminal from another terminal; there are no limitations on this. It should be noted that after the terminal receives the voice information, it needs to convert the voice information into text form of the statement to be recognized through voice recognition so that the terminal can acquire the statement.
[0066] In step S120, the setting dictionary can be set before or after the terminal leaves the factory. Furthermore, the setting dictionary can be modified after it is set. The setting dictionary can include multiple sets of setting word pairs, each pair including a setting word and its pinyin. The setting word is a physical word. Modifications to the setting dictionary can include deleting setting words, adding new setting words, or modifying existing setting words, etc., and are not limited to these actions.
[0067] The dictionary database can be configured according to actual needs and is not limited in this regard. For example, the dictionary database can be composed of words from one or more (including two) vertical domains. That is, the dictionary database can be used to identify entity words in a specific vertical domain, or it can be used to identify entity words in a broader domain (i.e., multiple vertical domains). Vertical domains may include intelligent customer service, healthcare, entertainment, education, and sports, etc.
[0068] In this step, entity words in the statement to be identified can be recognized using a predefined dictionary, and the identified entity words constitute the first recognition result. Each identified entity word has a corresponding predefined word pair in the predefined dictionary. For example, entity recognition of the statement to be identified can be performed using regular expression matching (also known as rule matching) based on the predefined dictionary.
[0069] In step S130, the last entity word in the sentence to be identified is determined from the first recognition result, the position of this entity word in the sentence to be identified is determined, and then the number of remaining characters after the aforementioned entity word in the sentence to be identified is determined and recorded as the number of remaining characters. In addition, the number of characters included in each entity word in the first recognition result is determined and recorded as the number of entity word characters.
[0070] If the number of characters corresponding to each entity word is greater than the number of remaining characters, it means that the remaining characters cannot constitute an entity word, and the first identification result can be determined to include all entity words of the statement to be identified. If the number of characters corresponding to at least one entity word is less than or equal to the number of remaining characters, it means that the remaining characters may constitute an entity word, and the first identification result can be determined to not include all entity words of the statement to be identified.
[0071] Example 1,
[0072] The dictionary and regular expression matching methods are used to identify entity words in the sentence to be identified, determining a first identification result consisting of entity words A, B, and C. Then, entity word C is identified as the last entity word in the sentence, and the number of characters remaining after entity word C is determined, denoted as the number of remaining characters *i*. Additionally, the number of characters for entity word A is determined, denoted as the number of entity word characters *a*; the number of characters for entity word B is determined, denoted as the number of entity word characters *b*; and the number of characters for entity word C is determined, denoted as the number of entity word characters *c*. The number of remaining characters *i* is compared with the number of entity word characters *a*, *b*, and *c*, respectively. If the number of remaining characters *i* is greater than or equal to any one of these three numbered entities, the first identification result is determined to not include all entity words in the sentence to be identified; if the number of remaining characters *i* is less than all three numbered entities, the first identification result is determined to include all entity words in the sentence to be identified.
[0073] In step S140, since it has been determined that the first recognition result includes all entity words of the statement to be recognized, the first recognition result can be directly determined as the target recognition result for entity recognition. That is, the recognition result obtained based on the set dictionary is determined as the target recognition result.
[0074] In step S150, since it has been determined that the first recognition result does not include all the entity words of the statement to be recognized, that is, the statement to be recognized may also include other entity words, the statement to be recognized can be recognized based on mutual information and left and right information entropy. The recognized words are determined as undetermined words, and then the recognition result composed of all undetermined words can be determined as the undetermined recognition result. Here, left and right information entropy can be simply referred to as left and right entropy.
[0075] In step S160, the first identification result can be corrected using the pending identification result, and the corrected identification result can be determined as the target identification result. Since the entity identification results of mutual information and left and right information entropy are relatively accurate, the reliability of entity identification can be improved by correcting the first identification result using the pending identification result.
[0076] This method identifies entity words based on a predefined dictionary, mutual information, and left and right information entropy. It eliminates the need for extensive data preparation for model training or deployment during a cold start, resulting in lower difficulty and better efficiency and accuracy. Furthermore, this method is applicable not only to intelligent customer service but also to other fields involving entity word recognition, demonstrating broad applicability and further enhancing the user experience.
[0077] In one exemplary embodiment, an entity recognition method is provided, applied to a terminal. The method, which determines a target recognition result based on a first recognition result and a pending recognition result, may include:
[0078] S210. In the pending identification results, words that are different from the entity words in the first identification result are identified as pending entity words;
[0079] S230. Determine the second recognition result based on the undetermined entity words that meet the first set conditions;
[0080] S240. Determine the target identification result based on the first identification result and the second identification result.
[0081] In step S210, the pending identification results include multiple pending words. The pending words in the pending identification results can be compared with the entity words in the first identification results, and then the pending words that are different from all entity words can be identified as pending entity words.
[0082] In step S220, it should be noted that when using mutual information and left and right information entropy to identify the statement to be identified, a first model value and a second model value corresponding to each undetermined word can be determined. The first model value is calculated by the model corresponding to the mutual information, and the second model value is calculated by the model corresponding to the left and right information entropy.
[0083] The larger the first model value corresponding to the undetermined entity word, the greater the probability that the undetermined entity word is an entity word; the larger the second model value corresponding to the undetermined entity word, the greater the probability that the undetermined entity word is an entity word.
[0084] In this step, if it is determined that the first model value corresponding to the undetermined entity word is greater than or equal to the first threshold, and it is determined that the second model value corresponding to the undetermined entity word is greater than or equal to the second threshold, then it can be determined that the undetermined entity word meets the first set condition, and thus it can be identified as the second entity word. Then, the recognition result composed of all the second entity words is determined as the second recognition result.
[0085] The first threshold can be set before or after the terminal leaves the factory, and it can be modified afterward. The specific value of the first threshold can be set according to actual needs and is not limited thereto. For example, the first threshold can be greater than or equal to 0.75 and less than or equal to 1.
[0086] The method for setting the second threshold can refer to that of the first threshold. The specific value of the second threshold can be set according to actual needs and is not limited thereto. For example, the second threshold can be greater than or equal to 0.75 and less than or equal to 1.
[0087] It should be noted that the first threshold and the second threshold can be the same or different; there is no limitation on this. For example, the first threshold can be 0.75, and the second threshold can be 0.75. Another example is that the first threshold can be 0.85, and the second threshold can be 0.80. Furthermore, according to statistics, the entity recognition results of this method are better when the first threshold is 0.85 and the second threshold is 0.80.
[0088] Example 1,
[0089] The first threshold is 0.85, and the second threshold is 0.80.
[0090] In the pending recognition results, words that differ from the entity words in the first recognition result may include pending entity word D', pending entity word E', and pending entity word F'. The first model value corresponding to pending entity word D' is denoted as m. D′1 Let m be the second model value corresponding to the undetermined entity word D'. D′2 The first model value corresponding to the undetermined entity word E' is denoted as m. E′1 Let m be the second model value corresponding to the undetermined entity word E'.E′2 The first model value corresponding to the undetermined entity word F' is denoted as m. F′1 Let m be the second model value corresponding to the undetermined entity word F'. F′2 .
[0091] Wherein, the first model value m D′1 Less than 0.85, second model value m D′2 A value less than 0.80 indicates that the undetermined entity word D' is not an entity word. The first model value m... E′1 The second model value m is greater than or equal to 0.85. E′2 A value greater than or equal to 0.80 indicates that the entity word E' is an entity word, and thus it can be determined as the second entity word E. The first model value m F′1 The second model value m is greater than or equal to 0.85. F′2 A value less than 0.80 indicates that the undetermined entity word F' is not an entity word.
[0092] Therefore, in this example, the undetermined entity word E' can be identified as the second entity word E, and the second recognition result includes the second entity word E.
[0093] In step S240, the entity words included in the first recognition result can be recorded as the first entity words, and the entity words included in the second recognition result can be recorded as the second entity words.
[0094] After determining the second recognition result, the recognition result can be composed of the first entity word in the first recognition result and the second entity word in the second recognition result, and this recognition result can be determined as the target recognition result, thereby ensuring the reliability of entity recognition.
[0095] Example 2,
[0096] The first recognition result includes first entity word A, first entity word B, and first entity word C, wherein the last entity word in the sentence to be recognized is determined to be first entity word C. It should be noted that the last entity word here refers to the entity word that appears last in the sentence in the first recognition result.
[0097] Then, determine the number of characters remaining after the first entity word C in the sentence to be identified, denoted as the number of remaining characters i. Additionally, determine the number of characters for the first entity word A, denoted as the number of entity word characters a; determine the number of characters for the first entity word B, denoted as the number of entity word characters b; and determine the number of characters for the first entity word C, denoted as the number of entity word characters c. Compare the number of remaining characters i with the number of entity word characters a, b, and c, respectively.
[0098] If the number of remaining characters i is greater than or equal to the number of any one of the number of entity words a, b, and c, then the first recognition result is determined to not include all entity words of the statement to be recognized. Then, the statement to be recognized is identified based on mutual information and left and right information entropy to determine the pending recognition result.
[0099] The pending identification results include pending entity words A', B', C', D', E', and F'. Among them, pending entity word A' is the same as the first entity word A, pending entity word B' is the same as the first entity word B, pending entity word C' is the same as the first entity word C, and pending entity words D', E', and F' are all different from the three first entity words (i.e., first entity words A, B, and C).
[0100] In Example 2, the first threshold is 0.85 and the second threshold is 0.80. The first model value corresponding to the entity word D' to be determined is denoted as m. D′1 Let m be the second model value corresponding to the undetermined entity word D'. D′2 The first model value corresponding to the undetermined entity word E' is denoted as m. E′1 Let m be the second model value corresponding to the undetermined entity word E'. E′2 The first model value corresponding to the undetermined entity word F' is denoted as m. F′1 Let m be the second model value corresponding to the undetermined entity word F'. F′2 .
[0101] Wherein, the first model value m D′1 Less than 0.85, second model value m D′2 A value less than 0.80 indicates that the undetermined entity word D' is not an entity word. The first model value m... E′1 The second model value m is greater than or equal to 0.85. E′2 A value greater than or equal to 0.80 indicates that the entity word E' is an entity word, and thus it can be determined as the second entity word E. The first model value m F′1 The second model value m is greater than or equal to 0.85. F′2 A value less than 0.80 indicates that the undetermined entity word F' is not an entity word.
[0102] Therefore, in this example, the undetermined entity word E' can be identified as the second entity word E, and the recognition result formed by the second entity word E can be identified as the second recognition result. That is, the second recognition result includes the second entity word E.
[0103] Then, the first entity words A, B, and C included in the first identification result, and the second entity word E included in the second identification result, are combined to form a new identification result, which is determined as the target identification result. The target identification result includes the first entity words A, B, C, and E.
[0104] In this method, the first identification result can be revised by the identification results of mutual information and left and right information entropy, thereby obtaining a more accurate target identification result and improving the user experience.
[0105] In one exemplary embodiment, an entity recognition method is provided, applied to a terminal. In this method, the dictionary database can be obtained through the following methods:
[0106] S310. Determine the statement library based on the statements in the defined domain;
[0107] S320. Perform word segmentation on the statements in the statement database to determine the first word database;
[0108] S330. Based on mutual information and left and right information entropy, identify the sentences in the sentence database to determine the undetermined word database;
[0109] S340. Determine the dictionary database based on the first dictionary database and the undetermined dictionary database.
[0110] In step S310, the defined domain may include at least one vertical domain, or it may be a wide domain, meaning the defined domain is not limited to vertical domains. It should be noted that wide domain and vertical domain are corresponding concepts. In this method, a wide domain refers to a domain that is not limited to one or more (including two) vertical domains; a wide domain can be understood as the entire domain.
[0111] Specifically, when the defined domain includes at least one vertical domain, the defined dictionary can be used to identify entity words within that vertical domain. When the defined domain is broad, the defined dictionary can be used to identify entity words across the entire domain, meaning it can be used to identify entity words from any domain. Vertical domains may include intelligent customer service, healthcare, entertainment, education, and sports, among others.
[0112] In this step, a large number of statements within the defined domain can be collected, forming a statement library. The more statements in the statement library, the richer the defined word pairs in the final defined dictionary, and the higher the reliability of the entity recognition results of this method.
[0113] In step S320, the statements in the statement library can be segmented using a word segmentation tool to identify multiple entity words. These identified entity words are then designated as set words, and the first word library is composed of all the set words. That is, the first word library includes multiple set words. The word segmentation tool can include hanLP (Han Language Processing), Jieba (also known as Jieba word segmentation), or the open-source CRF++ (where CRF is also known as Conditional Random Field), etc., and is not limited thereto.
[0114] In step S330, the statements in the statement library can be identified based on mutual information and left and right information entropy, and the identified words can be used to form a pending word library.
[0115] The process of identifying statements in the statement database based on mutual information and left and right information entropy can be referenced from the process of identifying statements based on mutual information and left and right information entropy, which will not be elaborated here.
[0116] In step S340, the first word library can be modified using the pending word library. Then, the set words in the modified word library are converted to pinyin to obtain the pinyin corresponding to each set word. Each set word and its corresponding pinyin are then identified as a set word pair. Finally, all set word pairs constitute the set word dictionary. A pinyin conversion tool can be used to determine the pinyin corresponding to each set word.
[0117] Since the entity recognition results of mutual information and left and right information entropy are relatively accurate, the reliability of the set dictionary can be improved by correcting the first dictionary by the undetermined dictionary, thereby improving the reliability of the entity recognition results of this method.
[0118] This method uses word segmentation tools, mutual information, and left and right information entropy to identify statements in a statement database, automatically generating a defined dictionary database. This requires minimal manual intervention, saving costs and improving both the efficiency and reliability of dictionary database construction. Furthermore, this method can be applied not only to building defined dictionary databases in the field of intelligent customer service but also to defined dictionary databases in other fields involving entity word recognition, thus expanding the applicability of this method for entity recognition.
[0119] In one exemplary embodiment, an entity recognition method is provided, applied to a terminal. The method, which determines a set dictionary based on a first dictionary and a pending dictionary, may include:
[0120] S410. In the pending terminology database, words that are different from the set words in the first terminology database are identified as pending terminology.
[0121] S420. Determine the second word library based on the pending setting words that meet the second setting conditions;
[0122] S430. Determine the dictionary database based on the first and second dictionary databases.
[0123] Step S410 can refer to step S210 in other embodiments, and step S420 can refer to step S220 in other embodiments.
[0124] In step S410, the undetermined word library may include multiple undetermined words. The undetermined words in the undetermined word library can be compared with the set words in the first word library, and then the undetermined words that are different from any set words can be determined as undetermined set words.
[0125] In step S420, if it is determined that the first model value corresponding to the pending setting word is greater than or equal to the third threshold, and it is determined that the second model value corresponding to the pending setting word is greater than or equal to the fourth threshold, then it can be determined that the pending setting word meets the second setting condition, and thus the pending setting word can be determined as the second setting word. Then, the word library composed of all the second setting words is determined as the second word library.
[0126] The third threshold can be the same as or different from the first threshold; there is no limitation in this regard. Similarly, the fourth threshold can be the same as or different from the second threshold; there is no limitation in this regard. For example, the first, second, third, and fourth thresholds can all be greater than or equal to 0.75 and less than or equal to 1.
[0127] For example, the first and third thresholds are both 0.85, and the second and fourth thresholds are both 0.80.
[0128] In step S430, the set words included in the first word library can be recorded as the first set words, and the set words included in the second word library can be recorded as the second set words.
[0129] After determining the second vocabulary database, a new vocabulary database can be constructed from the first set words in the first vocabulary database and the second set words in the second vocabulary database. This vocabulary database can be denoted as the target vocabulary database. Then, a pinyin conversion tool can be used to convert the set words in the target vocabulary database to pinyin, determining the pinyin corresponding to each set word. Each set word and its corresponding pinyin are then identified as a set word pair. Finally, all set word pairs constitute the set word dictionary database. In other words, the set word dictionary database includes multiple set word pairs to ensure its reliability, thereby improving the reliability of entity recognition in this method.
[0130] In this method, the first word library can be revised by identifying the undetermined word library through mutual information and left and right information entropy, thereby obtaining a more reliable set word library, which in turn improves the reliability of entity recognition and enhances the user experience.
[0131] In one exemplary embodiment, an entity recognition device is provided, applied to a terminal. This device is used to implement the above-described method; for example, refer to... Figure 2 As shown, the device may include an acquisition module 101 and a determination module 102. During the implementation of the above method, the device...
[0132] Module 101 is used to acquire the statement to be recognized;
[0133] The determining module 102 is used to identify the statement to be identified based on a set dictionary and determine the first identification result;
[0134] It is also used to identify the statement to be identified based on mutual information and left and right information entropy if it is determined that the first identification result does not include all the entity words of the statement to be identified, and to determine the pending identification result.
[0135] It is also used to determine the target identification result based on the first identification result and the pending identification result.
[0136] In one exemplary embodiment, an entity recognition device is provided, applied to a terminal. (Reference) Figure 2 As shown, in this device, the determining module 102 is used for:
[0137] Words in the pending identification results that are different from the entity words in the first identification result are identified as pending entity words;
[0138] The second recognition result is determined based on the undefined entity words that meet the first set conditions;
[0139] Based on the first and second identification results, the target identification result is determined.
[0140] In one exemplary embodiment, an entity recognition device is provided, applied to a terminal. (Reference) Figure 2 As shown, in this device, the determining module 102 is used for:
[0141] If the first model value of the undetermined entity word is determined to be greater than or equal to the first threshold, and the second model value of the undetermined entity word is determined to be greater than or equal to the second threshold, then the undetermined entity word is determined to be the second entity word.
[0142] The recognition result consisting of all the second entity words is determined as the second recognition result.
[0143] In one exemplary embodiment, an entity recognition device is provided, applied to a terminal. (Reference) Figure 2 As shown, in this device, the determining module 102 is used for:
[0144] If it is determined that the first recognition result includes all entity words of the sentence to be recognized, then the first recognition result is determined as the target recognition result.
[0145] In one exemplary embodiment, a terminal is provided, such as a mobile phone, a laptop computer, a tablet computer, and a wearable device.
[0146] refer to Figure 3 As shown, terminal 400 may include one or more of the following components: processing component 402, memory 404, power supply component 406, multimedia component 408, audio component 410, input / output (I / O) interface 412, sensor component 414, and communication component 416.
[0147] Processing component 402 typically controls the overall operation of terminal 400, such as operations associated with display, telephone calls, data communication, camera operation, and recording. Processing component 402 may include one or more processors 420 to execute instructions to complete all or part of the steps of the methods described above. Furthermore, processing component 402 may include one or more modules to facilitate interaction between processing component 402 and other components. For example, processing component 402 may include a multimedia module to facilitate interaction between multimedia component 408 and processing component 402.
[0148] Memory 404 is configured to store various types of data to support operation on terminal 400. Examples of this data include instructions for any application or method operating on terminal 400, contact data, phonebook data, messages, pictures, videos, etc. Memory 404 can be implemented by any type of volatile or non-volatile storage terminal or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk.
[0149] Power supply component 406 provides power to various components of terminal 400. Power supply component 406 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to terminal 400.
[0150] Multimedia component 408 includes a screen that provides an output interface between terminal 400 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touchscreen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensors may sense not only the boundaries of touch or swipe actions but also the duration and pressure associated with the touch or swipe operation. In some embodiments, multimedia component 408 includes a front-facing camera application and / or a rear-facing camera application. When terminal 400 is in an operating mode, such as shooting mode or video mode, the front-facing camera application and / or the rear-facing camera application may receive external multimedia data. Each front-facing camera application and rear-facing camera application may be a fixed optical lens system or have focal length and optical zoom capabilities.
[0151] Audio component 410 is configured to output and / or input audio signals. For example, audio component 410 includes a microphone (MIC) configured to receive external audio signals when terminal 400 is in an operating mode, such as call mode, recording mode, and voice recognition mode. The received audio signals may be further stored in memory 404 or transmitted via communication component 416. In some embodiments, audio component 410 also includes a speaker for outputting audio signals.
[0152] I / O interface 412 provides an interface between processing component 402 and peripheral interface modules, such as keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to, home buttons, volume buttons, power buttons, and lock buttons.
[0153] Sensor assembly 414 includes one or more sensors for providing state assessments of various aspects of terminal 400. For example, sensor assembly 414 may detect the on / off state of terminal 400, the relative positioning of components such as the display and keypad of terminal 400, changes in the position of terminal 400 or a component of terminal 400, the presence or absence of user contact with terminal 400, the orientation or acceleration / deceleration of terminal 400, and temperature changes of terminal 400. Sensor assembly 414 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. Sensor assembly 414 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, sensor assembly 414 may also include an accelerometer, a gyroscope, a magnetometer, a pressure sensor, or a temperature sensor.
[0154] Communication component 416 is configured to facilitate wired or wireless communication between terminal 400 and other terminals. Terminal 700 can access wireless networks based on communication standards, such as WiFi, 2G, 3G, 4G, 5G, or combinations thereof. In one exemplary embodiment, communication component 416 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, communication component 416 also includes a near-field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
[0155] In an exemplary embodiment, terminal 400 may be implemented by one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing terminals (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components to perform the methods described above.
[0156] In one exemplary embodiment, a non-transitory computer-readable storage medium including instructions is also provided, such as a memory 404 including instructions, which can be executed by a processor 420 of a terminal 400 to perform the described method. For example, the non-transitory computer-readable storage medium may be a ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, and optical data storage terminal, etc. When the instructions in the storage medium are executed by the terminal's processor, the terminal is able to perform the method shown in the above embodiments.
[0157] Other embodiments of this disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this disclosure are indicated by the claims.
[0158] Other embodiments of this disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this disclosure are indicated by the following claims.
[0159] It should be understood that this disclosure is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this disclosure is limited only by the appended claims.
Claims
1. An entity recognition method, applied to a terminal, characterized in that, The method includes: Obtain the statement to be recognized; The sentence to be identified is identified based on a set dictionary database, and a first identification result is determined; Determine whether the first recognition result includes all entity words of the statement to be recognized; If it is determined that the first identification result does not include all the entity words of the statement to be identified, then the statement to be identified is identified based on mutual information and left and right information entropy to determine the pending identification result; Based on the first identification result and the pending identification result, the target identification result is determined; The determination of whether the first recognition result includes all entity words of the statement to be recognized includes: Based on the number of remaining characters after the last entity word in the first recognition result and the number of entity words for each entity word in the first recognition result, it is determined whether the first recognition result includes all entity words of the statement to be recognized. If the number of entity words for each entity word is greater than the number of remaining characters, it is determined that the first recognition result includes all entity words of the statement to be recognized. If the number of entity words for at least one entity word is less than or equal to the number of remaining characters, it is determined that the first recognition result does not include all entity words of the statement to be recognized.
2. The method according to claim 1, characterized in that, The step of determining the target identification result based on the first identification result and the pending identification result includes: Words in the pending identification results that are different from the entity words in the first identification result are identified as pending entity words; The second recognition result is determined based on the undefined entity words that meet the first set conditions; The target identification result is determined based on the first identification result and the second identification result.
3. The method according to claim 2, characterized in that, The step of determining the second recognition result based on the undetermined entity words that satisfy the first set conditions includes: If it is determined that the first model value of the undetermined entity word is greater than or equal to the first threshold, and it is determined that the second model value of the undetermined entity word is greater than or equal to the second threshold, then the undetermined entity word is determined as the second entity word. The recognition result consisting of all the second entity words is determined as the second recognition result.
4. The method according to claim 1, characterized in that, The method further includes: If it is determined that the first recognition result includes all entity words of the statement to be recognized, then the first recognition result is determined as the target recognition result.
5. The method according to any one of claims 1-4, characterized in that, The dictionary database is obtained through the following method: Determine the statement library based on the statements in the defined domain; The statements in the statement library are segmented into words to determine the first word library; The sentences in the sentence database are identified based on mutual information and left and right information entropy to determine the undetermined word database; The set dictionary database is determined based on the first dictionary database and the undetermined dictionary database.
6. The method according to claim 5, characterized in that, The step of determining the set dictionary database based on the first dictionary database and the undetermined dictionary database includes: Words in the undetermined word library that are different from the set words in the first word library are identified as undetermined set words; The second word library is determined based on the undefined words that meet the second set conditions; The set dictionary database is determined based on the first dictionary database and the second dictionary database.
7. The method according to claim 6, characterized in that, The step of determining the second word library based on the undetermined words that satisfy the second set conditions includes: If it is determined that the first model value of the pending setting word is greater than or equal to the third threshold, and it is determined that the second model value of the pending setting word is greater than or equal to the fourth threshold, then the pending setting word is determined as the second setting word; The dictionary database consisting of all the second set words is determined as the second word database.
8. An entity recognition device, applied to a terminal, characterized in that, The device includes: The acquisition module is used to acquire the statement to be recognized; The determination module is used to identify the statement to be identified based on a set dictionary and determine a first identification result; It is also used to determine whether the first recognition result includes all entity words of the statement to be recognized; If it is determined that the first identification result does not include all the entity words of the statement to be identified, then the statement to be identified is identified based on mutual information and left and right information entropy to determine the pending identification result; It is also used to determine the target identification result based on the first identification result and the pending identification result; The determination of whether the first recognition result includes all entity words of the statement to be recognized includes: Based on the number of remaining characters after the last entity word in the first recognition result and the number of entity words for each entity word in the first recognition result, it is determined whether the first recognition result includes all entity words of the statement to be recognized. If the number of entity words for each entity word is greater than the number of remaining characters, it is determined that the first recognition result includes all entity words of the statement to be recognized. If the number of entity words for at least one entity word is less than or equal to the number of remaining characters, it is determined that the first recognition result does not include all entity words of the statement to be recognized.
9. The apparatus according to claim 8, characterized in that, The determining module is used for: Words in the pending identification results that are different from the entity words in the first identification result are identified as pending entity words; The second recognition result is determined based on the undefined entity words that meet the first set conditions; The target identification result is determined based on the first identification result and the second identification result.
10. The apparatus according to claim 9, characterized in that, The determining module is used for: If it is determined that the first model value of the undetermined entity word is greater than or equal to the first threshold, and it is determined that the second model value of the undetermined entity word is greater than or equal to the second threshold, then the undetermined entity word is determined as the second entity word. The recognition result consisting of all the second entity words is determined as the second recognition result.
11. The apparatus according to claim 8, characterized in that, The determining module is used for: If it is determined that the first recognition result includes all entity words of the statement to be recognized, then the first recognition result is determined as the target recognition result.
12. A terminal, characterized in that, The terminal includes: processor; Memory used to store the processor's executable instructions; The processor is configured to perform the method as described in any one of claims 1-7.
13. A non-transitory computer-readable storage medium, characterized in that, When the instructions in the storage medium are executed by the processor of the terminal, the terminal is able to perform the method as described in any one of claims 1-7.