A Web Data Record Extraction Method Based on Incomplete Subtree Matching

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology of data recording and complete subtree, applied in electronic digital data processing, special data processing applications, instruments, etc., can solve the problems of unrealistic, labor-intensive, large data sets, etc., to eliminate structural differences and improve extraction. Refined, highly versatile effect

Active Publication Date: 2016-03-16

XIAMEN MEIYA PICO INFORMATION

View PDF2 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0013] The statistics-based method for this type of page is no longer applicable, because the statistical method generally needs to use the statistical information of a long text, and the webpage of the data record type does not meet this characteristic

The rule-based method requires a large training data set. Manually labeling web pages is a rather labor-intensive process, and the rules generally apply to one website. It is unrealistic to obtain a general rule with high accuracy for data extraction from multiple websites. of

Currently, the most widely used method is the method of manual programming. This method has high accuracy, but its outstanding disadvantages are that it consumes a lot of manpower and is difficult to maintain.

For each website, the corresponding extraction code must be written. In the case of a revision of the target website, the program failure is not easy to detect, and the code still needs to be changed after detection

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0032] Such as figure 1 As shown, a kind of web data record extraction method based on incomplete subtree matching of the present invention comprises the following steps:

[0033] a. Download the HTML source code of the webpage according to the HTTP protocol, and encode the downloaded characters with a unified UNICODE;

[0034] b. Filter noise marker information;

[0035] c. Use components such as NEKO or HTMLParser to parse the HTML source code and construct the Document tree of the web page;

[0036] d. Candidate subtree set extraction;

[0037] e. Incomplete subtree matching;

[0038] f. Data record set determination;

[0039] Noise tag information includes JavaScript scripts, CSS style sheets, annotations, some useless tags, and empty content tags. Filtering these noise information can prevent noise tags from affecting the analysis and speed up the processing speed of the method.

[0040] The subtrees of the candidate subtree set have a common parent node, but not ne...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a Web data record extraction method based on incomplete subtree matching. The Web data record extraction method comprises the following steps: downloading hypertext markup language (HTML) source codes according to hyper text transport protocol (HTTP), and encoding the downloaded characters according to UNICODE; filtering noise, and marking information; analyzing the HTML source codes by using assembles of NEKO or HTMLParser and the like, and constructing Document trees of webpage; extracting candidate subtree sets; matching incomplete subtrees; and determining data record set. The method is based on the subtree matching and does not depend on a template structure of the webpage so as to have high generality. Through label filtering and determination of the candidate subtrees, performance of a data extraction process can be improved effectively. Based on intercepted incomplete subtree matching, the method judges similarity between subtree structures, can effectively eliminate structural difference caused by the fact that a temperate is filled by data, and improves accuracy of data record extraction.

Description

technical field [0001] The invention relates to a method for extracting Web data records based on incomplete subtree matching. Background technique [0002] With the rapid development of the Internet and the continuous improvement of Web technology, more and more organizations and individuals distribute information to the Internet. Every day, thousands of web pages are generated on the Internet, and the Internet has become a huge "library library" for information sharing. How to find and extract effective data information from massive Web information has become an important topic. [0003] HTML webpage is one of the most important data formats on the Internet. It is a label language, which is displayed by the browser after combining scripts and styles. The essence of HTML is a semi-structured language, which is suitable for browsing by humans after being rendered, but it is not conducive to identifying and extracting data by computer programs. In the definition of HTML ta...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Patents(China)

IPC IPC(8): G06F17/30G06F9/44

Inventor 胡海斌王慧昌

Owner XIAMEN MEIYA PICO INFORMATION

A Web Data Record Extraction Method Based on Incomplete Subtree Matching

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology