Method and system for accurately extracting webpage content

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A web page content and precise extraction technology, applied in the Internet field, can solve the problems of time-consuming adjustment, difficult maintenance, and difficult system maintenance, etc., and achieve the effect of reducing maintenance costs and improving development efficiency

Active Publication Date: 2013-07-31

翁杰

View PDF5 Cites 13 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

Using this method is relatively excellent in performance, but it is very difficult for the maintenance of the entire system

To accurately locate the data of a page, it is usually necessary to extract the content of the page in segments, so that the written regular expression code is very large, and it is difficult to maintain

If the label of the page to be collected changes slightly, the corresponding regular expression also needs to be readjusted, and the whole adjustment is very time-consuming

[0004] In the javascript script of the web page, the label selector provided by the jquery framework can filter out the labels in the page very conveniently, but it can only be used in the client script of the browser

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0068] When the user input command is "INPUT, SELECT, TEXTAREA, IMG", it means to find all INPUT, SELECT, TEXTAREA, IMG tags in the page.

[0069] Through lexical analysis, the system will form such image 3 The complete, unoptimized expression tree is shown.

[0070] while in syntax table The block is defined as follows:

[0071] ::= ','

[0072] |

[0073] its meaning means one or more Composed of commas "," between separated.

[0074] Before optimization, the nodes presented in the syntax tree are nested, which can be expressed as follows:

[0075] INPUT,SELECT,TEXTAREA,IMG=(INPUT,SELECT,TEXTAREA)+(IMG)

[0076] INPUT, SELECT, TEXTAREA = (INPUT, SELECT) + (TEXTAREA)

[0077] INPUT,SELECT=(INPUT)+(SELECT)

[0078] SELECT=(SELECT)

[0079] The above format is not conducive to grammatical analysis, and the nodes need to be optimized into the following optimized expression tree that is conducive to grammatical analysis:

[0080] INPUT,SE...

Embodiment 2

[0097] When the user enters the expression "DIV.a>IMG.b[alt]", the meaning is to first find the DIV tag with the style name a in the page, and then filter out the child elements with the style name b, and Contains the IMG tag with the alt attribute.

[0098] Through lexical analysis, the system will construct as Figure 5 The complete, unoptimized expression tree is shown.

[0099] in the grammar table The block is defined as follows:

[0100] ::=

[0101] |

[0102] ::='>'|'+'|'~'|

[0103] its meaning means is composed of one or more It consists of three layers separated by layer separators, which are ">", "+", and "~". When parsing the jQuery syntax, another separator is one or more blank characters. In this case, it is necessary to judge whether the node contains blank characters, and keep them if they exist, otherwise they will be removed.

[0104] For the above nodes can be optimized as Image 6 As shown in the schematic diagram, the op...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The embodiment of the invention discloses a method for accurately extracting webpage content. The method particularly comprises the following steps: webpage content corresponding to a URL (uniform resource locator) is obtained, and source codes of the webpage content is analyzed into a DOM (document object model) structure tree; a screening expression input by a user is read; a syntactic analyzer uploads a grammar table, and the screening expression is analyzed into an expression tree composed of multiple words; the multiple words are analyzed to obtain a screening condition set as per semantic analysis; the screening condition set is composed of multiple screening condition objects; and each screening condition object is composed of a tag extracting method and multiple tag screening methods. The embodiment of the invention further discloses a system for accurately extracting webpage content. With adoption of the method and the system, the screening expression is recombined into an optimized expression tree through grammatical analysis and forms a set of multiple screening condition objects through semantic analysis, so that acquisition positioning and quick screening of DOM document tree nodes can be realized. With adoption of the method and the system, improvement of development efficiency and reduction of maintenance cost are facilitated.

Description

technical field [0001] The present invention relates to the technical field of the Internet, in particular to a method and system for accurately extracting web page content. Background technique [0002] Web page data collection technology is a technology similar to search engine ROBOT, which collects articles and data on the Internet and stores them in the database to fill the website content. Data collection technology is of great help to enrich the content of the website and improve the flow of the website. [0003] However, most of the data collection methods used in the prior art use regular expressions to locate and extract data. Using this method is relatively excellent in performance, but it is not easy for the maintenance of the entire system. To accurately locate the data of a page, it is usually necessary to extract the content of the page in segments, so the written regular expression code is very large, and it is difficult to maintain. If the label of the pag...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30

Inventor翁杰

Owner翁杰

Method and system for accurately extracting webpage content

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology