Unlock instant, AI-driven research and patent intelligence for your innovation.

A Method of Automatically Extracting List Pages

An automatic extraction and list page technology, applied in the network field, can solve problems such as high maintenance costs, low efficiency, and no longer applicable rules, and achieve the effect of saving time and labor costs

Active Publication Date: 2022-02-11
上海嘉道信息技术有限公司
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] A single web page can accurately collect the desired information through methods such as regular expressions, and the essence of methods like regular expressions and css selectors is to summarize the rules of the source code of the web page by human beings, and then use these rules to Extraction, this method is difficult to extract with the same set of rules on web pages with different structures, because different web pages need different rules to support the extraction. When users need to collect a large number of web pages, they need to rely on manual writing. Rules, this kind of efficiency is not only low, even on thousands or tens of thousands of websites, it has become completely impossible to rely solely on manual labor.
Not only that, the extraction method based on rules is limited by the webpage itself. When the website is revised, the original rules will no longer be applicable, and the rules need to be rewritten manually. This also makes some projects that rely on open source information collection Maintenance costs become insanely high

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Method of Automatically Extracting List Pages
  • A Method of Automatically Extracting List Pages
  • A Method of Automatically Extracting List Pages

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0035] Such as figure 1 As shown, a method for automatically extracting list pages includes the following steps:

[0036] (1) Generation of dom tree:

[0037] (1.1) Obtain the web page source code of the website to be collected;

[0038] (1.2) Parse the source code of the web page into a dom tree;

[0039] (1.3) Perform preorder traversal according to the dom tree, and record the node path of each leaf element in the dom tree; (1.4) Extract and save the element node path with text.

[0040] (2) Obtain the position information of the element node with text extracted in step (1), score according to the position information of the element node, and filter out element nodes that do not meet the visual possibility of the list page: specifically include:

[0041] (2.1) Collect the css and js files of the html webpage to obtain the location information of the node;

[0042] (2.2) calculate the pixel position of the element node of the dom tree after each web page is parsed;

[0...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a method for automatically extracting a list page, comprising the following steps: parsing the source code of a web page into a DOM tree; extracting the element node path with text in the DOM tree; scoring and filtering the position information of the element node; extracting the similarity of the node Fingerprint; extract the deep fingerprint of the node block; extract the similar fingerprint of the title and address link; extract the list page and return the encapsulation result. The present invention is applicable to the extraction of a large number of Internet website list pages, and has universal applicability to a large number of list pages. Therefore, even if there is a website revision, the extraction method based on the webpage structure can still take effect, saving the time spent on rewriting the extraction rules and maintenance rules. The resulting time cost and labor cost. In the webpage-based structure extraction algorithm, the location pixel information of elements on the webpage is added as a feature, which is more in line with people's sensory judgment of list pages, making the extraction results more in line with the target.

Description

technical field [0001] The invention relates to the field of network technology, in particular to a method for automatically extracting list pages. Background technique [0002] The traditional list page extraction technology is mainly in the form of rules, such as regular expressions, xpath, css selectors, or even manually collecting the information on the page. [0003] A single web page can accurately collect the desired information through methods such as regular expressions, and the essence of methods like regular expressions and css selectors is to summarize the rules of the source code of the web page by human beings, and then use these rules to Extraction, this method is difficult to extract with the same set of rules on web pages with different structures, because different web pages need different rules to support the extraction. When users need to collect a large number of web pages, they need to rely on manual writing. Rules, this kind of efficiency is not only ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F8/40
CPCG06F8/40
Inventor 庞一文
Owner 上海嘉道信息技术有限公司