Webpage information extraction method and system

A technology of webpage information and webpage, which is applied in the field of information extraction, and can solve the problems of no overall awareness of the extraction method, low versatility, and low robustness of the extraction method

Active Publication Date: 2014-06-18
INST OF COMPUTING TECH CHINESE ACAD OF SCI
View PDF4 Cites 18 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This method is sensitive to the structure of the web page and has poor generalization ability. In order to ensure the recall rate, a large number of rules and manual intervention are required, and a large number of rules will lead to a greater possibility of conflicts between rules, such as a specific Rules that correspond to data nodes in one web page may correspond to noise nodes in another slightly different web page
Existing methods often trade off between accuracy, recall and manual cost
[0012] 2. Single feature rule
In some webpages, the data and noise differ greatly in the characteristics used by the existing methods, and the method can achieve better results, but in other webpages, the data and noise may differ in the characteristics used by the method. is not obvious, the method cannot achieve a good extraction effect
The generality of the method is not high
[0013] 3. Does not support complex data schema (semantic structure)
Existing methods often only support simple flat data schemas and cannot adequately express more complex data schemas
[0014] 4. Extraction methods are not globally aware
Existing methods usually do not consider whether the matching position is the optimal position and the impact of the matching on the subsequent matching of other rules after the partial successful matching of the web page. A partial error or failed matching may affect the subsequent extraction. A series of side effects, the extraction method is less robust

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage information extraction method and system
  • Webpage information extraction method and system
  • Webpage information extraction method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0072] The technical solutions of the present invention will be described in detail below in conjunction with the embodiments and the accompanying drawings.

[0073] First, the application scenarios and concepts used in the present invention are described.

[0074] The content in a web page is composed of some semantic units, and each semantic unit corresponds to a semantic attribute. The combination of semantic attributes can form a new semantic attribute. The new semantic attribute is called the parent semantic attribute. The semantic attribute directly contained in the parent semantic attribute is Sub-semantic attributes, the sub-semantic attributes under the same parent semantic attribute are sibling semantic attributes. Each specific value of the semantic attribute is a subtree forest in the DOM tree of the web page, and the subtrees in the subtree forest are continuous and non-overlapping, that is, there are no adjacent subtrees in the subtree forest. If there are other...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a webpage information extraction method and system. The method includes the steps: acquiring a marked webpage, generating a semantic structure tree, building an information mode pattern, generating semantic attribute node information of each semantic attribute node in the information mode pattern, generating a wrapper and deriving the wrapper into a wrapper document; building an extractor for extracting webpages similar to the marked webpage; acquiring the webpages to be extracted, and recursively extracting a data extraction area or an iterative data extraction area corresponding to each semantic attribute node in the information mode pattern layer by layer from the root semantic attribute node in the information mode pattern in a DOM (document object model) tree of the webpages to be extracted by the extractor; deriving data in the data extraction area or the iterative data extraction area corresponding to each semantic attribute node as extraction results. The method has high universality, generalization capability, fault tolerance and expandability and low manual involvement degree, and online extraction efficiency is ensured, so that practicability is high.

Description

technical field [0001] The invention belongs to the field of information extraction, and in particular relates to generation of a wrapper (wrapper) based on a webpage DOM tree and webpage information extraction technology. Background technique [0002] Since the 1990s, the World Wide Web (WWW) has developed rapidly, and the amount of information it contains has exploded. While the Internet has increasingly become a tool widely used by people, it has also become a huge treasure house of knowledge, which contains a large amount of valuable information. How to make full use of the massive information on the Internet to provide better services for human beings has always been a hot spot that people pay attention to. As an important information carrier on the Internet, web pages are the main way to obtain information from the Internet. How to extract needs from web pages has become an important research topic, that is, web page information extraction. Webpage information extrac...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/835G06F16/951
Inventor 程学旗万圣贤余钧郭岩刘悦张瑾余智华
Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products