Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Web page element extraction method and web page element extraction system

A technology of web page elements and elements, applied in the Internet field, can solve problems such as complex maintenance, poor versatility, and complicated manual rules, and achieve the effect of high block accuracy

Active Publication Date: 2016-10-05
BEIJING QIHOO TECH CO LTD
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The efficiency of the above two processing methods is low, and there are problems of poor versatility and low universality. Especially at present, the forms of webpages in the Internet vary widely, and the dimensionality of webpage features is getting larger and larger, even reaching hundreds of dimensions. It is very difficult to summarize qualified empirical formulas or create labeling templates in complex web pages
In addition, for webpage features with many dimensions, the manual rules written must be very complicated and the maintenance is very complicated; and when the website is revised, the previous template may become invalid, resulting in deviations in the extracted elements

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web page element extraction method and web page element extraction system
  • Web page element extraction method and web page element extraction system
  • Web page element extraction method and web page element extraction system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028] The algorithms and displays provided herein are not inherently related to any particular computer, virtual system, or other device. Various general-purpose systems can also be used with teaching based on this. The structure required to construct such a system is apparent from the above description. Furthermore, the present invention is not specific to any particular programming language. It is to be understood that various programming languages ​​may be used to implement the inventions described herein, and that the descriptions of specific languages ​​above are intended to disclose the best mode for carrying out the invention.

[0029] The technical solution of the embodiment of the present invention is proposed based on dividing the page into blocks, that is, dividing the page into different types of "blocks" according to the content. figure 1 and figure 2 Two common page structures are shown respectively, wherein, figure 1 The page of the forum website shown is ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a page element extraction method and a page element extraction system. The page element extraction method comprises building a DOM (Document Object Model) tree structure which is corresponding to a webpage; sorting nodes of the DOM tree structure by utilizing a decision tree and building a first block sequence of the webpage according to a sort result; inputting conditional random fields to the first block sequence, performing optimal computation and obtaining a second block sequence; selecting sequence elements of presupposed types in the second block sequence and extracting page elements which are corresponding to the sequence elements. According to the technical scheme of the page element extraction method, the block sequences of the webpage are built according to the DOM tree structure of the webpage, irrelevant contents of blocks are filtered, needed page elements are extracted, manual rules are not needed in the extraction process, and the problems that the manual rules are low in efficient and complex in maintenance are solved.

Description

technical field [0001] The present invention relates to the field of the Internet, in particular to a webpage element extraction method and a webpage element extraction system. Background technique [0002] Generally speaking, web pages contain rich and complex information, including navigation, title, text, time, and even advertisements. In order to extract effective elements from a web page, it is necessary to conduct a detailed analysis of the web page. In the prior art, there are two processing methods for extracting web page elements. [0003] The first is to use manual setting rules to extract elements that are fixed in a certain area of ​​the page. [0004] The second is to manually mark the page compilation language to form a webpage construction template. For most simple webpages, a certain template form can be summed up based on the position of the webpage information. In this way, when extracting a webpage information, only need to follow this The template corr...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
CPCG06F16/9577
Inventor 王志刚
Owner BEIJING QIHOO TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products