Entity attribute extraction method and system oriented to open web page

A technology of entity attributes and entities, which is applied in network data retrieval, unstructured text data retrieval, instruments, etc., can solve the problems of unfixed text length, low accuracy and recall rate, and does not support nested matching, etc., to achieve improvement The effect of machine learning, improving precision and recall, improving efficiency and accuracy

Active Publication Date: 2015-05-20
INST OF COMPUTING TECH CHINESE ACAD OF SCI
View PDF12 Cites 29 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] 3. The length of the text is not fixed, and there is a large gap in the amount of contextual information;
[0006] 4. The data source is not fixed and the language phenomenon is complex
[0009] First, the text structure of open web pages is not fixed, and entities and their descriptions have no fixed rules to follow, and most of them are in free text, which is not easy to extract and analyze;
[0010] Second, the traditional rule-oriented attribute extraction method, the rule definition is rigid, too dependent on the context grammar, and the matching efficiency is low;
[0011] Third, the data sources of open web pages are not fixed, the language phenomenon is complex, and common rules are difficult to cover. Traditional rule-based attribute extraction does not support nested matching of rules;
[0012] Fourth, the traditional statistics-based entity attribute extraction method, the preparation of training data is too manual, the efficiency is not high, and the accuracy and recall rate are low;
[0013] Fifth, traditional attribute extraction is mostly limited to a certain field or discipline, and the system cannot be directly transplanted to other fields or disciplines for use. It lacks universal correlation features and is not easy to transplant and expand

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Entity attribute extraction method and system oriented to open web page
  • Entity attribute extraction method and system oriented to open web page
  • Entity attribute extraction method and system oriented to open web page

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0070] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below through specific embodiments in conjunction with the accompanying drawings. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0071] According to an embodiment of the present invention, a method for extracting entity attributes oriented to open web pages is provided.

[0072] In a nutshell, the method includes: extracting the text of the open webpage, and obtaining a candidate text set of the target entity; according to the frequency of the target entity attribute appearing in the training text set, select a rule-based method or a statistics-based method from the candidate text set. Extract the value of the target entity attribute in the text collection.

[0073] Before describing the entity attribute extrac...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides an entity attribute extraction method and system oriented to an open web page. The method includes the steps: extracting a text of the open web page to obtain a candidate text set of a target entity; extracting the value of a target entity attribute from the candidate text set in a rule-based or statistical-based mode according to the frequency of the target entity attribute in a training text set. Extraction accuracy and recall rate of the entity attribute of the open web page can be improved, and the method is independent of a web page structure and can adapt to change of open web page types.

Description

technical field [0001] The invention relates to the technical field of data mining, in particular to an entity attribute extraction method and system for open webpages. Background technique [0002] Open webpages refer to unstructured Internet webpages with unfixed data sources and containing a variety of network data, such as blogs, forums, news, chat records, emails, etc. It's all unpredictable. With the development of network technology, especially the rapid development of Internet and Intranet technology, open web pages have their own characteristics of flexible structure, while the number of open web pages is rapidly increasing, it also brings difficulties to understand the text: [0003] 1. The text structure is not fixed, and there is no specific context grammar; [0004] 2. The scope of keywords is not fixed, and the subject areas involved are diverse; [0005] 3. The length of the text is not fixed, and there is a large gap in the amount of contextual information...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/30G06F16/95
Inventor 程学旗贾岩涛赵泽亚王元卓熊锦华李曼玲林海伦许洪波
Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products