Method and device for extracting entity related information, electronic device and storage medium
A technology related to information and entities, applied in the field of information processing, can solve problems such as the inability to meet the performance requirements of extracting entity-related information, poor user experience, etc.
Active Publication Date: 2019-05-21
BEIJING BAIDU NETCOM SCI & TECH CO LTD
9 Cites 11 Cited by
AI-Extracted Technical Summary
Problems solved by technology
[0003] However, there are still various problems and deficiencies in these traditional solutions for extracting entity-related information. In many cases, the pe...
Abstract
The embodiment of the invention provides a method and a device for extracting entity related information, an electronic device and a computer readable storage medium. In the method, a computing deviceobtains a plurality of candidate texts associated with a predetermined entity and a predetermined attribute. Further, the computing device determines at least one target text from the plurality of candidate texts based on semantics of an entity attribute pair formed by the predetermined entity and the predetermined attribute. Further, the computing device determines an attribute value of a predetermined attribute of the predetermined entity based on the at least one target text. According to the embodiment of the invention, the timeliness can be improved and the labor cost can be reduced whenthe entity related information is extracted.
Application Domain
Text database queryingText database clustering/classification +1
Technology Topic
Information retrievalTarget text +3
Image
Examples
- Experimental program(1)
Example Embodiment
[0017] The principle and spirit of the present disclosure will be described below with reference to several exemplary embodiments shown in the accompanying drawings. It should be understood that these specific embodiments are described only to enable those skilled in the art to better understand and implement the present disclosure, but not to limit the scope of the present disclosure in any way.
[0018] As mentioned above, traditional entity relationship extraction methods mainly include purely open extraction methods and structured extraction methods. However, these two traditional extraction methods have some problems and shortcomings. For example, the pure open extraction method is mainly used to process the batch extraction of knowledge, but the extraction time for new entities and new knowledge is relatively long and the update time is long, so it cannot solve the problem of time-sensitive knowledge update. On the other hand, the main disadvantage of the structured extraction method is that the labor cost is relatively large, and the extraction template needs to be manually configured according to the web page structure, and only a certain degree of directional extraction can be achieved. By configuring the template of the target category, the category granularity orientation can be achieved, but the "entity + attribute" granularity orientation cannot be achieved yet.
[0019] In view of the above-mentioned problems and potential other problems in the traditional solution, the embodiments of the present disclosure propose a method, device, electronic device, and computer-readable storage medium for extracting entity-related information to improve the timeliness when extracting entity-related information And reduce labor costs. Specifically, the embodiments of the present disclosure propose a directed knowledge extraction technology, which is mainly used to extract the corresponding attribute value of a given "entity-attribute" two-tuple in a targeted manner. The proposed directional extraction technology aims to directionally extract high-confidence entity relationship data from text databases (for example, massive Internet texts) through information extraction technology.
[0020] From the perspective of knowledge graph construction, the proposed directional extraction technology can extract the missing relationship attribute values of entities, which can be used to improve the connectivity of the knowledge graph, and efficiently improve the knowledge richness and completeness of the knowledge graph. From the perspective of product application, the supplementary entity relationship data can directly meet the needs of users for entity association, and can also effectively improve the efficiency of people searching and browsing entities, and improve user experience. Typical applications can include entity question and answer, entity recommendation, etc. .
[0021] Compared with the traditional entity information extraction solution, the embodiments of the present disclosure solve the problem of timeliness on the one hand. If there is a new entity or a highly popular entity in a short period of time, due to the short update time, each embodiment can quickly extract the missing attribute values of the new entity or the highly popular entity, supplement the entity attributes, and improve the knowledge graph’s timeliness "entity" -Attribute-attribute value" override. On the other hand, the embodiments of the present disclosure reduce labor costs. For example, a deep learning model is used to uniformly model all "entity-attribute-attribute value" relationships, so there is no need to have an in-depth understanding of domain knowledge and no need to design complicated The advanced features of, which is easy to maintain and expand. Several embodiments of the present disclosure are described below with reference to the accompanying drawings.
[0022] figure 1 A schematic diagram of an example environment (or system) 100 in which some embodiments of the present disclosure can be implemented is shown. Such as figure 1 As shown, in the example environment 100, the predetermined entity 105 and the predetermined attribute 110 can be input into the computing device 120, so that the computing device 120 obtains the predetermined attribute 110 of the predetermined entity 105 from the text of the text library (not shown). The attribute value is 160. In some embodiments, the text library may include a collection of texts obtained from the Internet. In other embodiments, the text library may include any appropriate text collection describing any attribute of any entity, including but not limited to text collections of various uses and sources.
[0023] In the context of the present disclosure, the term "entity" refers to something that is distinguishable and exists independently, such as a certain person, a certain city, a certain kind of plant, a certain kind of commodity, and so on. Everything in the world is made up of concrete things, all of which can be called entities. For example, "China", "United States", "Japan", etc. The term "attribute" refers to a certain property of an entity or the relationship between an entity and another entity. For example, attributes can refer to a person's height, gender, birthplace, and so on. In addition, attributes can also refer to the relationship between an entity and another entity. For example, husband, father, friend, etc. The term "attribute value" refers to the specific content of an entity's attributes or another entity that has a certain relationship with the entity. For example, the attribute value of the attribute "gender" of a certain person may be "male". For another example, the attribute value that has a certain relationship attribute (for example, wife) with a certain entity (for example, Yao Ming) may be another entity (for example, Ye Li). It should be understood that the above definitions of various terms are only exemplary to help understand the present disclosure, and are not intended to limit the scope of the present disclosure in any way. In other embodiments, various terms used herein will conform to the technical meaning generally understood by those skilled in the art.
[0024] Continue to refer figure 1 , The computing device 120 may obtain a plurality of candidate texts 140-1 to 140-N associated with the predetermined entity 105 and the predetermined attribute 110 (hereinafter may be collectively referred to as multiple Candidate text 140). Because the plurality of candidate texts 140 are related to the predetermined entity 105 and the predetermined attribute 110, it is possible for the computing device 120 to extract the attribute value 160 from the plurality of candidate texts 140. In addition, in order to improve the performance and robustness of the system 100, the computing device 120 may filter multiple candidate texts 140. To this end, the computing device 120 may determine at least one target text 150-1 to 150-M (hereinafter may be collectively referred to as multiple target texts) from the plurality of candidate texts 140 based on the semantics of the entity attribute pair composed of the predetermined entity 105 and the predetermined attribute 110. The text 150) is used to extract the attribute value 160, where M and N are both positive integers and M can be less than or equal to N. Then, the computing device 120 may determine the attribute value 160 of the predetermined attribute 110 of the predetermined entity 105 based on the determined at least one target text 150.
[0025] It will be understood that the computing device 120 may be any type of mobile terminal, fixed terminal, or portable terminal, including mobile phones, sites, units, devices, multimedia computers, multimedia tablets, Internet nodes, communicators, desktop computers, laptop computers, Notebook computers, netbook computers, tablet computers, personal communication system (PCS) equipment, personal navigation devices, personal digital assistants (PDA), audio/video players, digital cameras/camcorders, positioning devices, TV receivers, radio broadcast receivers , E-book equipment, game equipment or any combination thereof, including accessories and peripherals of these devices or any combination thereof. It is also foreseen that the computing device 120 can support any type of user-oriented interface (such as "wearable" circuitry, etc.). Combine below figure 2 To describe an example operation for extracting entity-related information according to an embodiment of the present disclosure.
[0026] figure 2 A schematic flowchart of a method 200 for extracting entity-related information according to an embodiment of the present disclosure is shown. In some embodiments, the method 200 may be figure 1 The computing device 120 may be implemented by the processor or processing unit of the computing device 120, for example. In other embodiments, all or part of the method 200 may also be implemented by a computing device independent of the computing device system 120, or may be implemented by other units in the example environment 100. To facilitate the discussion, will combine figure 1 To describe the method 200.
[0027] At 210, the computing device 120 obtains a plurality of candidate text 140 associated with the predetermined entity 105 and the predetermined attribute 110. It should be understood that the computing device 120 may use any appropriate method to obtain the plurality of candidate texts 140, as long as the plurality of candidate texts 140 are associated with the predetermined entity 105 and the predetermined attribute 110, the embodiment of the present disclosure is not limited in this respect. For example, regarding the specific attributes of some specific entities, there may already be a text collection that introduces or explains the specific attributes of the specific entities. In this case, the computing device 120 can obtain multiple candidate texts 140 by importing the text collection.
[0028] More generally, in some embodiments, the computing device 120 may obtain multiple candidate texts 140 by searching in a text library. For example, the computing device 120 may determine an entity search term corresponding to the predetermined entity 105 and an attribute search term corresponding to the predetermined attribute 110. Then, the computing device 120 can use the determined entity search terms and attribute search terms to retrieve multiple candidate texts 140 from the text database. In this way, the computing device 120 can find the text related to the predetermined entity 105 and the predetermined attribute 110 in the text library. As noted above, the text library used for retrieval may include a collection of texts obtained from the Internet. Additionally or alternatively, the text library used for retrieval may include any appropriate text collection describing any attribute of any entity, including but not limited to text collections of various purposes and sources.
[0029] In some embodiments, the entity search term used by the computing device 120 may include the name of the predetermined entity 105, aliases, other keywords that may refer to the predetermined entity 105, etc., and any combination thereof. Similarly, the attribute search term used by the computing device 120 may include the name of the predetermined attribute 110, aliases, leading words, other keywords related to the predetermined attribute 110, etc., and any combination thereof. As used in this article, the leading word of an attribute can be used to lead to a certain attribute of the entity. For example, the introductory word "marriage" can be used to guide the entity's attribute "spouse". In this way, the computing device 120 can avoid omitting text related to the predetermined entity 105 and the predetermined attribute 110 in the search.
[0030] In some embodiments, in order to extract relevant information of popular entities or new entities in a targeted manner, the computing device 120 may determine a newly-appearing entity or an entity whose search frequency is higher than a threshold as the predetermined entity 105. As an example of a popular entity, suppose there is currently a person with high social attention (such as a certain celebrity) who has a high search frequency on the search platform, which reflects that the person is in a short period of time. Entities with high heat. In this case, the computing device 120 may use the person as the predetermined entity 105. To this end, the computing device 120 may determine whether the entity has a higher search frequency by comparing the search frequency of the entity with a predetermined threshold. It will be understood that the threshold here can be reasonably selected according to the specific system environment and design requirements. In addition, as an example of a new entity, if a newly built amusement park is about to be opened to the public recently, the amusement park will be a new entity that has recently appeared. In this case, the computing device 120 may use the playground as the predetermined entity 105.
[0031] After the predetermined entity 105 is determined, the computing device 120 may determine the predetermined attribute 110 based on the predetermined entity 105. For example, in a case where a certain celebrity is determined as the predetermined entity 105, the computing device 120 may correspondingly determine the predetermined attribute 110 as attributes related to the celebrity, such as height, weight, birthplace, graduation school, boyfriend and girlfriend, and many more. For another example, in a case where the new playground is determined to be the predetermined entity 105, the computing device 120 may correspondingly determine the predetermined attribute 110 as attributes related to the playground, such as specific address, floor area, business hours, amusement facilities ,and many more.
[0032] At 220, the computing device 120 determines at least one target text 150 from the plurality of candidate texts 140 based on the semantics of the entity attribute pair formed by the predetermined entity 105 and the predetermined attribute 110. It will be understood that although the multiple candidate texts 140 are associated with the predetermined entity 105 and the predetermined attribute 110, this does not mean that the multiple candidate texts 140 are semantically bound to the entity attribute pair consisting of the predetermined entity 105 and the predetermined attribute 110. The semantics are related. For example, a certain text may include the entity "Yao Ming" and the attribute "height", but the semantics of the text may not necessarily be related to "Yao Ming's height". It may only mention Yao Ming and describe the height of another person. Therefore, by selecting at least one target text 150 based on the semantics of the entity attribute pair of the predetermined entity 105 and the predetermined attribute 110, the computing device 120 can filter all obtained candidate texts 140, thereby reducing the amount of text used to extract the attribute value 160 , Only the semantic correlation of the entity attribute pair formed by the predetermined entity 105 and the predetermined attribute 110 is retained, and the text of the attribute value 160 can be extracted, thereby improving the performance and robustness of the system 100.
[0033] In some embodiments, for a given candidate text 140-1 in the plurality of candidate texts 140, the computing device 120 may process the candidate text 140-1 to determine the semantics of the candidate text 140-1. For example, the computing device 120 may obtain the word segmentation and part-of-speech recognition result of the candidate text 140-1 through a part-of-speech recognition tool, obtain the dependency recognition result of the sentence of the candidate text 140-1 through a dependency analysis tool, and obtain it through a subgraph association tool The entity recognition and upper concept recognition result of the candidate text 140-1. It should be understood that the computing device 120 may also determine the semantics of the candidate text 140-1 through any other semantic analysis method.
[0034] Then, the computing device 120 may determine the similarity between the semantics of the candidate text 140-1 and the semantics of the entity attribute pair of the predetermined entity 105 and the predetermined attribute 110. For example, the computing device 120 may call a semantically related text validity classification model (or called an operator) to calculate the semantic relevance, and call a classification algorithm to determine whether the semantics of the candidate text 140-1 is consistent with the predetermined entity 105 and The entity attribute pairs composed of the predetermined attributes 110 are semantically related, and then texts that are not related to the semantics are filtered from the candidate text 140. It should be understood that the computing device 120 may also determine the above-mentioned semantic relevance through any other method of determining semantic similarity. Then, if the determined semantic similarity is higher than the threshold, the computing device 120 may select the candidate text 140-1 as one of the at least one target text 150. It will be understood that the threshold here can be reasonably selected according to the specific system environment and design requirements.
[0035] In addition, in some embodiments, before determining the semantics of the multiple candidate texts 140, the computing device 120 may also perform preliminary filtering on the multiple candidate texts 140 to filter out the semantics of entity attribute pairs with the predetermined entity 105 and the predetermined attribute 110. Irrelevant candidate text 140. For example, the computing device 120 can determine whether the candidate text 140 contains the name of the predetermined entity 105 (including the entity’s name and alias, etc.), the name of the predetermined attribute 110 (including the attribute’s name, alias, guide word, etc.), and whether the text length Preliminary filtering of multiple candidate texts 140 is performed with features such as within a predefined length interval and the proportion of Chinese characters of the text, so as to exclude candidate texts 140 that are obviously irrelevant to the semantics of the entity attribute pair of the predetermined entity 105 and the predetermined attribute 110.
[0036] At 230, the computing device 120 determines the attribute value 160 of the predetermined attribute 110 of the predetermined entity 105 based on the at least one target text 150. It should be understood that the computing device 120 may use any existing extraction method or extraction method developed in the future to extract the attribute value 160 from the at least one target text 150, and the embodiment of the present disclosure is not limited in this respect. For example, the computing device 120 may extract the attribute value 160 from the at least one target text 150 using an extraction model based on deep learning. Additionally or alternatively, the computing device 120 may also use other types of extraction models to extract the attribute value 160 from the at least one target text 150.
[0037] In some embodiments, in order to improve the extraction accuracy of the attribute value 160, the computing device 120 may use a plurality of different extraction models with different model structures to extract from at least one target text 150 based on the predetermined entity 105 and the predetermined attribute 110 Multiple candidate attribute values. It will be understood that multiple different extraction models may include any model capable of extracting attribute values from a given text according to predetermined entities and predetermined attributes, for example, multiple neural network-based extraction models with different neural network structures.
[0038] By way of example, the computing device 120 may use three different extraction models. The first extraction model can be a slot filling (Slot Filling) model, which is a deep learning model based on a deep learning computing framework (for example, PaddlePaddle platform), which is aimed at slot filling tasks (known entities and attributes to extract attribute values) The completed attribute value extraction model. The other two extraction models can be two different structures of reading comprehension models, which are attribute value extraction models based on reading comprehension tasks. The two reading comprehension models can convert entities and attributes into queries, and use query and text as model inputs to mark the starting and ending positions of attribute values in the text. It will be understood that the specific models and the number of models given here are only exemplary, and are not intended to limit the scope of the present disclosure in any way. In other embodiments, the computing device 120 may use any number of any different models to extract the attributes 160.
[0039] After extracting multiple candidate attribute values using different model structures, the computing device 120 may determine the respective confidence levels of the multiple candidate attribute values. As an example, assuming that the predetermined entity 105 is "Yao Ming" and the predetermined attribute 110 is "Birthplace", the multiple candidate attribute values extracted from at least one target text 150 by multiple different models may be China, the United States, and Beijing. ,Shanghai. In this case, the computing device 120 can determine the respective confidence levels of the four candidate attribute values, that is, the probability that they are the correct birthplace of Yao Ming.
[0040] It will be understood that the computing device 120 may determine the confidence level of the candidate attribute value in any suitable manner, including but not limited to obtaining through an attribute value extraction model, verifying through other databases, and determining other attributes related to a predetermined entity. Relevance, etc. For example, in the above example regarding the birthplace of Yao Ming, the computing device 120 may determine that the respective confidence levels of China, the United States, Beijing, and Shanghai are 0.7, 0.3, 0.5, and 0.8.
[0041] Then, the computing device 120 may select an attribute value with a confidence level higher than the threshold value from a plurality of candidate attribute values. As an example, the threshold here can be set to 0.7, so the computing device 120 can select "Shanghai" as the attribute value of the predetermined attribute "Birthplace" of the predetermined entity "Yao Ming". It should be understood that the specific numerical values and place names given here are only examples, and are not intended to limit the scope of the present disclosure in any way. In addition, the threshold here can be selected reasonably according to the specific system environment and design requirements. As an alternative way of selecting an attribute value from multiple candidate attribute values, the computing device 120 may also select the attribute value with the highest confidence from the multiple candidate attribute values.
[0042] In some embodiments, at least one target text 150 may include a plurality of target texts 150-1 to 150-M. In this case, different extraction models may extract the same candidate attribute values from different target texts. Therefore, in order to determine the corresponding confidence level of each candidate attribute value among the multiple candidate attribute values, the computing device 120 may determine the extraction model for extracting the given candidate attribute value for a given candidate attribute value among the multiple candidate attribute values. Multiple pairs with target text.
[0043] Continuing the example used above, without loss of generality, assume that the candidate attribute value "Shanghai" is extracted from the first target text 150-1 by the first extraction model, and also from the second target text 150-2 by the first extraction model. It is also extracted from the second target text 150-2 by the second extraction model, is also extracted from the fourth target text 150-4 by the second extraction model, and is also extracted from the third target text by the third extraction model. Extracted from text 150-3. In this case, for the candidate attribute value "Shanghai", the computing device 120 may determine to extract the following multiple pairs of the attribute value "Shanghai": the first extraction model and the first target text 150-1, the first extraction model And the second target text 150-2, the second extraction model and the second target text 150-2, the second extraction model and the fourth target text 150-4, and the third extraction model and the third target text 150-3.
[0044] Then, the computing device 120 may obtain a plurality of confidence scores of the candidate attribute value, and the plurality of confidence scores are respectively associated with a plurality of pairs. For example, continuing the above example, for the candidate attribute value "Shanghai", the first extraction model gives a confidence score of 0.6 for the first target text 150-1, and the first extraction model gives a confidence score for the second target text 150-2 Score 0.5, the second extraction model gives a confidence score of 0.8 for the second target text 150-2, the second extraction model gives a confidence score of 0.7 for the fourth target text 150-4, and the third extraction model for the third target Text 150-3 gives a confidence score of 0.6. In this case, the computing device 120 can obtain multiple confidence scores of 0.6, 0.5, 0.8, 0.7, and 0.6 for the candidate attribute value "Shanghai".
[0045] Then, the computing device 120 may add the multiple confidence scores of the candidate attribute value to obtain the confidence of the candidate attribute value. In the above example, the computing device 120 may add multiple confidence scores of 0.6, 0.5, 0.8, 0.7, and 0.6 for the attribute value "Shanghai" to determine that the confidence level of the candidate attribute value "Shanghai" is 3.2. In this way, the computing device 120 can comprehensively evaluate the confidence level of a certain candidate attribute value in a quantitative manner. Similarly, the computing device 120 may calculate the confidence of other candidate attribute values (such as China, the United States, and Beijing), and finally select the attribute value with the confidence higher than the threshold.
[0046] image 3 A schematic block diagram of an apparatus 300 for extracting entity-related information according to an embodiment of the present disclosure is shown. In some embodiments, the device 300 may be included in figure 1 In the computing device 120 or implemented as the computing device 120.
[0047] Such as image 3 As shown, the device 300 includes a candidate text obtaining module 310, a target text determining module 320, and an attribute value determining module 330. The candidate text obtaining module 310 is configured to obtain a plurality of candidate texts associated with predetermined entities and predetermined attributes. The target text determination module 320 is configured to determine at least one target text from a plurality of candidate texts based on the semantics of an entity attribute pair formed by a predetermined entity and a predetermined attribute. The attribute value determining module 330 is configured to determine an attribute value of a predetermined attribute of a predetermined entity based on at least one target text.
[0048] In some embodiments, the candidate text obtaining module 310 includes: a search term determination module configured to determine an entity search term corresponding to a predetermined entity and an attribute search term corresponding to a predetermined attribute; and a search module configured to use Entity search terms and attribute search terms, retrieve multiple candidate texts from the text database.
[0049] In some embodiments, the entity search term includes at least one of the name and alias of the predetermined entity, and the attribute search term includes at least one of the name of the predetermined attribute, the alias and the guide word, and the guide word is used to guide the predetermined entity Attributes.
[0050] In some embodiments, the apparatus 300 further includes: a predetermined entity determination module configured to determine a newly-appearing entity or an entity whose search frequency is higher than a threshold as the predetermined entity; and a predetermined attribute determination module configured to determine based on the predetermined entity Scheduled attributes.
[0051] In some embodiments, for a given candidate text among a plurality of candidate texts, the target text determination module 320 includes: a processing module configured to process the given candidate text to determine the semantics of the given candidate text; and a similarity determination module, Is configured to determine the similarity between the semantics of the given candidate text and the semantics of the entity attribute pair; and the target text selection module is configured to select the given candidate text as one of the at least one target text in response to the similarity being higher than the threshold One.
[0052] In some embodiments, the attribute value determination module 330 includes: an attribute value extraction module configured to use a plurality of different extraction models with different model structures, and extract a plurality of items from at least one target text based on predetermined entities and predetermined attributes. Candidate attribute values; a confidence determination module configured to determine the confidence levels of multiple candidate attribute values; and an attribute value selection module configured to select attribute values with a confidence level higher than a threshold from the multiple candidate attribute values.
[0053] In some embodiments, at least one target text includes multiple target texts, and for a given candidate attribute value among the multiple candidate attribute values, the confidence determination module includes: a pairing determination module configured to determine to extract the given candidate Multiple pairs of the attribute value extraction model and the target text; a score obtaining module configured to obtain multiple confidence scores respectively associated with the multiple pairs of candidate attribute values; and an addition module configured to combine multiple The confidence scores are added to obtain the confidence of a given candidate attribute value.
[0054] Figure 4 A schematic block diagram of a general technical framework 400 for extracting attribute values of entity attributes according to an embodiment of the present disclosure is shown. Such as Figure 4 As shown, the general technical framework 400 may include an attribute value extraction tool 401 and an external tool 403. In some embodiments, the attribute value extraction tool 401 can utilize the external tool 403 to implement the embodiments of the present disclosure, such as figure 2 Described method 200. For example, the attribute value extraction tool 401 can extract the predetermined entity and the attribute value 407 information corresponding to the predetermined attribute from the text library after inputting the predetermined entity attribute pair 405.
[0055] The attribute value extraction tool 401 includes a text retrieval module 410, a text validity classification module 420, an attribute value extraction model 430, and a multi-source fusion module 440. Each module of the attribute value extraction tool 401 can use the retrieval interface 450 of the external tool 403, the library scanning tool 460, the dependency analysis and part-of-speech recognition module 470, the sub-graph association module 480 and the deep learning framework 490 to extract the attribute value 407. The specific description is as follows.
[0056] The main function of the text retrieval module 410 may include obtaining the corpus text for attribute value extraction through the retrieval interface 450 and the database scanning tool 460 (such as seeksign scanning database tool) according to the input predetermined entity attribute pair 405. The text retrieval module 410 supports obtaining text information related to predetermined entity attribute pairs from multiple text retrieval model sources, and is easy to add and extend other models.
[0057] In addition, considering that entities often have the same name phenomenon, the text retrieval module 410 may include two text acquisition methods that combine entity granularity and text granularity. The entity granularity refers to extracting only the text information corresponding to the input entity without considering other entities with the same name. Text granularity refers to considering all text information corresponding to all entities with the same name at the same time. In some embodiments, the text retrieval model of the text retrieval module 410 may include four categories: encyclopedia text, entity page, question-and-answer text database, and search results of Dasou to obtain relevant web page results. The first two types may be entity-granular. The latter two can be text-granular.
[0058] The main function of the text validity classification module 420 can include filtering and classifying all the text obtained by the text retrieval module 410 to reduce the amount of text sent to subsequent modules, and only retain the semantic correlation with predetermined entity attribute pairs, and can extract attribute values The text, thereby improving the performance and robustness of the system. In some embodiments, the text validity classification module 420 can implement, for example, a semantic-independent initial filtering function, a semantic information acquisition function, and a semantic-related classification function.
[0059] The semantic-independent initial filtering function can be performed by, for example, whether the text contains the name of the entity (including the name and alias of the entity), the name of the attribute (including the name of the attribute, the alias, and the leading word), the length of the text, and the proportion of Chinese characters in the text. Initial filtration. The semantic information acquisition function can, for example, obtain word segmentation and part-of-speech recognition results through part-of-speech recognition tools, obtain dependency recognition results of sentences through dependency analysis tools, and obtain entity recognition and upper concept recognition results through subgraph association tools. The semantic-related classification function can, for example, call a semantic-related text validity classification model to perform semantic-related feature calculations, and call a classification algorithm to determine whether the text is semantically related to predetermined entities and predetermined attributes, and then filter out semantically irrelevant text.
[0060] The main function of the attribute value extraction model 430 may include extracting the attribute value corresponding to the entity attribute pair from the text given a predetermined entity, predetermined attribute, and text for extracting the attribute value. The attribute value extraction model 430 supports the addition of multiple extraction models, that is, the results are obtained separately through multiple extraction models, and the model is easy to expand.
[0061] The input of the multi-source fusion module 440 may be entity-attribute-text-attribute value, and the output may be entity-attribute-attribute value. Its main function may include calling the knowledge fusion model for multiple attribute values for each entity attribute pair. The extraction model performs multi-source fusion to select the best attribute values produced from multiple target texts, and finally outputs the attribute value 407. In the multi-source fusion module 440, the extraction results of multiple extraction models in the attribute value extraction model 430 can be easily extended to the candidate attribute values participating in the selection.
[0062] Figure 5 A block diagram of a device 500 that can be used to implement embodiments of the present disclosure is schematically shown. Such as Figure 5 As shown in, the device 500 includes a central processing unit (CPU) 501, which can be loaded into a random access storage device (RAM) 503 according to computer program instructions stored in a read-only storage device (ROM) 502 or from the storage unit 508 The computer program instructions in the computer to perform various appropriate actions and processing. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The CPU 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to the bus 504.
[0063] Multiple components in the device 500 are connected to the I/O interface 505, including: an input unit 506, such as a keyboard, a mouse, etc.; an output unit 507, such as various types of displays, speakers, etc.; and a storage unit 508, such as a magnetic disk, an optical disk, etc. ; And a communication unit 509, such as a network card, a modem, a wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.
[0064] The various processes and processing described above, for example, the method 200 may be executed by the processing unit 501. For example, in some embodiments, the method 200 may be implemented as a computer software program, which is tangibly contained in a machine-readable medium, such as the storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed on the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into the RAM 503 and executed by the CPU 501, one or more steps of the method 200 described above may be executed.
[0065] As used herein, the term "including" and similar terms should be understood as open-ended inclusion, that is, "including but not limited to". The term "based on" should be understood as "based at least in part on." The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment." The terms "first", "second", etc. may refer to different or the same objects. This article may also include other explicit and implicit definitions.
[0066] As used herein, the term "determine" encompasses various actions. For example, "determining" may include computing, calculating, processing, exporting, investigating, searching (for example, searching in a table, database, or another data structure), ascertaining, etc. In addition, "determining" may include receiving (for example, receiving information), accessing (for example, accessing data in a memory), and the like. In addition, "determining" may include analyzing, selecting, selecting, establishing, and so on.
[0067] It should be noted that the embodiments of the present disclosure may be implemented by hardware, software, or a combination of software and hardware. The hardware part can be implemented using dedicated logic; the software part can be stored in a memory and executed by an appropriate instruction execution system, such as a microprocessor or dedicated design hardware. Those skilled in the art can understand that the above-mentioned devices and methods can be implemented using computer-executable instructions and/or included in processor control codes, for example, provided on a programmable memory or a data carrier such as an optical or electronic signal carrier Such code.
[0068] In addition, although the operations of the method of the present disclosure are described in a specific order in the drawings, this does not require or imply that these operations must be performed in the specific order, or that all the operations shown must be performed to achieve the desired result. Conversely, the steps depicted in the flowchart can change the order of execution. Additionally or alternatively, some steps may be omitted, multiple steps may be combined into one step for execution, and/or one step may be decomposed into multiple steps for execution. It should also be noted that the features and functions of two or more devices according to the present disclosure may be embodied in one device. Conversely, the features and functions of one device described above can be further divided into multiple devices to be embodied.
[0069] Although the present disclosure has been described with reference to several specific embodiments, it should be understood that the present disclosure is not limited to the specific embodiments disclosed. The present disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.
PUM


Description & Claims & Application Information
We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.