Unlock instant, AI-driven research and patent intelligence for your innovation.

Webpage information extracting method and webpage information extracting equipment

A web page information and extraction method technology, applied in the field of Internet applications, can solve problems such as high cost, error rate, and inability to update in real time, and achieve the effect of improving accuracy and reducing costs

Inactive Publication Date: 2014-04-09
BEIJING QIHOO TECH CO LTD +1
View PDF5 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] However, there are at least two deficiencies in this method of manual extraction in the prior art: First, each site in the web page needs to manually write rules, and when a large number of sites need to be crawled, manually extract the rules and perform There is a certain error rate in writing programs, and the cost is too high
Secondly, when the page structure of the site changes, the original rules lose their effectiveness, so it is necessary to manually extract and code the rules again. However, if the manual discovery of the page structure change is not timely, the extraction rules based on web page information extraction cannot be updated in real time, reducing the cost of web pages. Accuracy of Information Extraction

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage information extracting method and webpage information extracting equipment
  • Webpage information extracting method and webpage information extracting equipment
  • Webpage information extracting method and webpage information extracting equipment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0029] Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided for more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

[0030] As mentioned in related technologies, manual extraction of web page information has a certain error rate, and when the page structure changes, manual extraction of page information has the problem that manual discovery of page structure changes is not timely, and the extraction rules based on web page information extraction cannot be updated in real time. Furthermore, it may lead to a problem that the accuracy of webpage info...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a webpage information extracting method and webpage information extracting equipment. The webpage information extracting method comprises the following steps of acquiring an extracting rule which is automatically generated according to webpage contents; and extracting webpage information by using the extracting rule. According to an embodiment of the invention, the problem that a certain error rate exists due to the fact that the webpage information is manually extracted in the prior art is solved; the webpage information extracting cost is reduced; moreover, the problem that the extracting rule which is the basis for extraction of the webpage information in the prior art cannot be updated in real time can be solved; and the accuracy on extraction of the webpage information is improved.

Description

technical field [0001] The invention relates to the field of Internet applications, in particular to a method and device for extracting web page information. Background technique [0002] Web page information extraction technology is a technology about extracting target information from web pages, that is, a technology to extract valuable information from natural language text and structured data of web pages. [0003] The webpage information extraction in the prior art adopts manual extraction method, by observing the webpage and its source code, programmers find out some rules, and then write programs according to these rules to extract valuable information. In order to make the web page information extraction process easier, programmers have constructed several schema specification languages ​​and their user interfaces. [0004] However, there are at least two deficiencies in this method of manual extraction in the prior art: First, each site in the web page needs to man...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/95
Inventor 徐锐波付赟
Owner BEIJING QIHOO TECH CO LTD