A Web Data Record Extraction Method Based on Incomplete Subtree Matching
A technology of data recording and complete subtree, applied in electronic digital data processing, special data processing applications, instruments, etc., can solve the problems of unrealistic, labor-intensive, large data sets, etc., to eliminate structural differences and improve extraction. Refined, highly versatile effect
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment Construction
[0032] Such as figure 1 As shown, a kind of web data record extraction method based on incomplete subtree matching of the present invention comprises the following steps:
[0033] a. Download the HTML source code of the webpage according to the HTTP protocol, and encode the downloaded characters with a unified UNICODE;
[0034] b. Filter noise marker information;
[0035] c. Use components such as NEKO or HTMLParser to parse the HTML source code and construct the Document tree of the web page;
[0036] d. Candidate subtree set extraction;
[0037] e. Incomplete subtree matching;
[0038] f. Data record set determination;
[0039] Noise tag information includes JavaScript scripts, CSS style sheets, annotations, some useless tags, and empty content tags. Filtering these noise information can prevent noise tags from affecting the analysis and speed up the processing speed of the method.
[0040] The subtrees of the candidate subtree set have a common parent node, but not ne...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 