Unlock instant, AI-driven research and patent intelligence for your innovation.

A Web Data Record Extraction Method Based on Incomplete Subtree Matching

A technology of data recording and complete subtree, applied in electronic digital data processing, special data processing applications, instruments, etc., can solve the problems of unrealistic, labor-intensive, large data sets, etc., to eliminate structural differences and improve extraction. Refined, highly versatile effect

Active Publication Date: 2016-03-16
XIAMEN MEIYA PICO INFORMATION
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0013] The statistics-based method for this type of page is no longer applicable, because the statistical method generally needs to use the statistical information of a long text, and the webpage of the data record type does not meet this characteristic
The rule-based method requires a large training data set. Manually labeling web pages is a rather labor-intensive process, and the rules generally apply to one website. It is unrealistic to obtain a general rule with high accuracy for data extraction from multiple websites. of
Currently, the most widely used method is the method of manual programming. This method has high accuracy, but its outstanding disadvantages are that it consumes a lot of manpower and is difficult to maintain.
For each website, the corresponding extraction code must be written. In the case of a revision of the target website, the program failure is not easy to detect, and the code still needs to be changed after detection

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Web Data Record Extraction Method Based on Incomplete Subtree Matching
  • A Web Data Record Extraction Method Based on Incomplete Subtree Matching
  • A Web Data Record Extraction Method Based on Incomplete Subtree Matching

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0032] Such as figure 1 As shown, a kind of web data record extraction method based on incomplete subtree matching of the present invention comprises the following steps:

[0033] a. Download the HTML source code of the webpage according to the HTTP protocol, and encode the downloaded characters with a unified UNICODE;

[0034] b. Filter noise marker information;

[0035] c. Use components such as NEKO or HTMLParser to parse the HTML source code and construct the Document tree of the web page;

[0036] d. Candidate subtree set extraction;

[0037] e. Incomplete subtree matching;

[0038] f. Data record set determination;

[0039] Noise tag information includes JavaScript scripts, CSS style sheets, annotations, some useless tags, and empty content tags. Filtering these noise information can prevent noise tags from affecting the analysis and speed up the processing speed of the method.

[0040] The subtrees of the candidate subtree set have a common parent node, but not ne...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a Web data record extraction method based on incomplete subtree matching. The Web data record extraction method comprises the following steps: downloading hypertext markup language (HTML) source codes according to hyper text transport protocol (HTTP), and encoding the downloaded characters according to UNICODE; filtering noise, and marking information; analyzing the HTML source codes by using assembles of NEKO or HTMLParser and the like, and constructing Document trees of webpage; extracting candidate subtree sets; matching incomplete subtrees; and determining data record set. The method is based on the subtree matching and does not depend on a template structure of the webpage so as to have high generality. Through label filtering and determination of the candidate subtrees, performance of a data extraction process can be improved effectively. Based on intercepted incomplete subtree matching, the method judges similarity between subtree structures, can effectively eliminate structural difference caused by the fact that a temperate is filled by data, and improves accuracy of data record extraction.

Description

technical field [0001] The invention relates to a method for extracting Web data records based on incomplete subtree matching. Background technique [0002] With the rapid development of the Internet and the continuous improvement of Web technology, more and more organizations and individuals distribute information to the Internet. Every day, thousands of web pages are generated on the Internet, and the Internet has become a huge "library library" for information sharing. How to find and extract effective data information from massive Web information has become an important topic. [0003] HTML webpage is one of the most important data formats on the Internet. It is a label language, which is displayed by the browser after combining scripts and styles. The essence of HTML is a semi-structured language, which is suitable for browsing by humans after being rendered, but it is not conducive to identifying and extracting data by computer programs. In the definition of HTML ta...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30G06F9/44
Inventor 胡海斌王慧昌
Owner XIAMEN MEIYA PICO INFORMATION