A Method of Automatically Extracting List Pages
An automatic extraction and list page technology, applied in the network field, can solve problems such as high maintenance costs, low efficiency, and no longer applicable rules, and achieve the effect of saving time and labor costs
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment Construction
[0035] Such as figure 1 As shown, a method for automatically extracting list pages includes the following steps:
[0036] (1) Generation of dom tree:
[0037] (1.1) Obtain the web page source code of the website to be collected;
[0038] (1.2) Parse the source code of the web page into a dom tree;
[0039] (1.3) Perform preorder traversal according to the dom tree, and record the node path of each leaf element in the dom tree; (1.4) Extract and save the element node path with text.
[0040] (2) Obtain the position information of the element node with text extracted in step (1), score according to the position information of the element node, and filter out element nodes that do not meet the visual possibility of the list page: specifically include:
[0041] (2.1) Collect the css and js files of the html webpage to obtain the location information of the node;
[0042] (2.2) calculate the pixel position of the element node of the dom tree after each web page is parsed;
[0...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 


