Web page content extraction method and web page content extraction system

A webpage content and extraction system technology, applied in the Internet field, can solve problems such as difficulties, complicated manual rules, and low efficiency of processing methods, and achieve the effect of high accuracy

Inactive Publication Date: 2016-10-05
BEIJING QIHOO TECH CO LTD +1
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The efficiency of the above two processing methods is low, and there are problems of poor versatility and low universality. Especially at present, the forms of webpages in the Internet vary widely, and the dimensionality of webpage features is getting larger and larger, even reaching hundreds of dimensions. It is very difficult to summarize qualified empirical formulas or create labeling templates in complex web pages
In addition, for webpage features with many dimensions, the manual rules written must be very complicated and the maintenance is very complicated; and when the website is revised, the previous template may become invalid, resulting in deviations in the extracted content

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web page content extraction method and web page content extraction system
  • Web page content extraction method and web page content extraction system
  • Web page content extraction method and web page content extraction system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0027] The algorithms and displays presented herein are not inherently related to any particular computer, virtual system, or other device. Various generic systems can also be used with the teachings based on this. The structure required to construct such a system is apparent from the above description. Furthermore, the present invention is not specific to any particular programming language. It should be understood that various programming languages ​​can be used to implement the content of the present invention described herein, and the above description of specific languages ​​is for disclosing the best mode of the present invention.

[0028] The technical solution of the embodiment of the present invention is proposed based on dividing the page into blocks, that is, dividing the page into different types of "blocks" according to the content. figure 1 and figure 2 Two common page structures are shown respectively, wherein, figure 1 The page of the forum website shown i...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a webpage content extraction method and a webpage content extraction system. The web page content extraction method includes: extracting the DOM tree structure corresponding to the web page; traversing the DOM tree to obtain the dimensional features of each node in the DOM tree; inputting the dimensional features of each node into a decision tree according to predetermined rules, and classifying each node , and determine the structural blocks of the webpage according to the classification results of the decision tree; and selectively extract the corresponding webpage content according to the structural blocks. Utilize the technical scheme of the present invention, carry out structural block according to the DOM tree structure of webpage, filter out the content of irrelevant block according to structural block, extract the required block webpage content, do not need to use manual rules in the process of block and extraction , which solves the problems of low efficiency and complicated maintenance of manual rules.

Description

technical field [0001] The present invention relates to the Internet field, in particular to a web page content extraction method and a web page content extraction system. Background technique [0002] Generally speaking, web pages contain rich and complex information, including navigation, title, text, time, and even advertisements. In order to extract effective content from a web page, fine analysis of the web page is required. In the prior art, there are two processing methods for extracting web page content. [0003] The first is to use manual setting rules to extract the content of a fixed area in the page, [0004] The second is to manually mark the page compilation language to form a webpage construction template. For most simple webpages, a certain template form can be summed up based on the position of the webpage information. In this way, when extracting a webpage information, only need to follow this The template corresponding to the web page extracts the corre...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 王志刚
Owner BEIJING QIHOO TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products