Method and system for accurately extracting webpage content

A web page content and precise extraction technology, applied in the Internet field, can solve the problems of time-consuming adjustment, difficult maintenance, and difficult system maintenance, etc., and achieve the effect of reducing maintenance costs and improving development efficiency

Active Publication Date: 2013-07-31
翁杰
View PDF5 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Using this method is relatively excellent in performance, but it is very difficult for the maintenance of the entire system
To accurately locate the data of a page, it is usually necessary to extract the content of the page in segments, so that the written regular expression code is very large, and it is difficult to maintain
If the label of the page to be

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for accurately extracting webpage content
  • Method and system for accurately extracting webpage content
  • Method and system for accurately extracting webpage content

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0068] When the user input command is "INPUT, SELECT, TEXTAREA, IMG", it means to find all INPUT, SELECT, TEXTAREA, IMG tags in the page.

[0069] Through lexical analysis, the system will form such image 3 The complete, unoptimized expression tree is shown.

[0070] while in syntax table The block is defined as follows:

[0071] ::= ','

[0072] |

[0073] its meaning means one or more Composed of commas "," between separated.

[0074] Before optimization, the nodes presented in the syntax tree are nested, which can be expressed as follows:

[0075] INPUT,SELECT,TEXTAREA,IMG=(INPUT,SELECT,TEXTAREA)+(IMG)

[0076] INPUT, SELECT, TEXTAREA = (INPUT, SELECT) + (TEXTAREA)

[0077] INPUT,SELECT=(INPUT)+(SELECT)

[0078] SELECT=(SELECT)

[0079] The above format is not conducive to grammatical analysis, and the nodes need to be optimized into the following optimized expression tree that is conducive to grammatical analysis:

[0080] INPUT,SE...

Embodiment 2

[0097] When the user enters the expression "DIV.a>IMG.b[alt]", the meaning is to first find the DIV tag with the style name a in the page, and then filter out the child elements with the style name b, and Contains the IMG tag with the alt attribute.

[0098] Through lexical analysis, the system will construct as Figure 5 The complete, unoptimized expression tree is shown.

[0099] in the grammar table The block is defined as follows:

[0100] ::=

[0101] |

[0102] ::='>'|'+'|'~'|

[0103] its meaning means is composed of one or more It consists of three layers separated by layer separators, which are ">", "+", and "~". When parsing the jQuery syntax, another separator is one or more blank characters. In this case, it is necessary to judge whether the node contains blank characters, and keep them if they exist, otherwise they will be removed.

[0104] For the above nodes can be optimized as Image 6 As shown in the schematic diagram, the op...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention discloses a method for accurately extracting webpage content. The method particularly comprises the following steps: webpage content corresponding to a URL (uniform resource locator) is obtained, and source codes of the webpage content is analyzed into a DOM (document object model) structure tree; a screening expression input by a user is read; a syntactic analyzer uploads a grammar table, and the screening expression is analyzed into an expression tree composed of multiple words; the multiple words are analyzed to obtain a screening condition set as per semantic analysis; the screening condition set is composed of multiple screening condition objects; and each screening condition object is composed of a tag extracting method and multiple tag screening methods. The embodiment of the invention further discloses a system for accurately extracting webpage content. With adoption of the method and the system, the screening expression is recombined into an optimized expression tree through grammatical analysis and forms a set of multiple screening condition objects through semantic analysis, so that acquisition positioning and quick screening of DOM document tree nodes can be realized. With adoption of the method and the system, improvement of development efficiency and reduction of maintenance cost are facilitated.

Description

technical field [0001] The present invention relates to the technical field of the Internet, in particular to a method and system for accurately extracting web page content. Background technique [0002] Web page data collection technology is a technology similar to search engine ROBOT, which collects articles and data on the Internet and stores them in the database to fill the website content. Data collection technology is of great help to enrich the content of the website and improve the flow of the website. [0003] However, most of the data collection methods used in the prior art use regular expressions to locate and extract data. Using this method is relatively excellent in performance, but it is not easy for the maintenance of the entire system. To accurately locate the data of a page, it is usually necessary to extract the content of the page in segments, so the written regular expression code is very large, and it is difficult to maintain. If the label of the pag...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 翁杰
Owner 翁杰
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products