Method and device for extracting webpage information

A technology of web page information and page information, applied in the network field, can solve the problems of reducing the amount of data storage, long setting period of extraction rules, and inability to extract web page information at the same time, so as to achieve reduced impact, stability, universal applicability, reliability and accuracy Extraction effect

Inactive Publication Date: 2013-10-09
人民搜索网络股份公司
View PDF5 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] Because it is necessary to analyze the web pages of different sites one by one, the corresponding extraction rules can be set according to the analysis results, resulting in a long period of setting the extraction rules
[0009] In addition, in order to reduce the amount of data saved in the database, although the common rules for the same entry in different sites will be extracted, each site may have many special rules, which greatly limits the effect of reducing the amount of data saved; at the same time, due to various Sites ma

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for extracting webpage information
  • Method and device for extracting webpage information
  • Method and device for extracting webpage information

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0057] In order to enable those skilled in the art to better understand the solution of the present invention, the embodiments of the present invention will be further described in detail below with reference to the accompanying drawings and embodiments.

[0058] The following first introduces the application scenarios of the present invention and the preparation work before information extraction.

[0059] Webpage information extraction is an important work of search engine page analysis. The webpage content that users are interested in is extracted and organized into structured data, which is conducive to more effective indexing and searching of webpages by search engines. The invention provides an automatic and reliable Web page information extraction scheme.

[0060] For a website, its HTML (Hyper Text Markup Language, is a markup language used to describe web documents) web pages are not all built by manual editing, mainly by website creation tools and template code. Become. T...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method and a device for extracting webpage information. The method comprises the following steps of: determining the identity tag of a to-be-extracted webpage according to the page information of the to-be-extracted webpage; looking up a sample set corresponding to the identity tag of the to-be-extracted webpage in a sample database, wherein the sample set comprises at least one DOM (document object model) sample; selecting one of the at least one DOM sample as a present DOM sample, and matching the present DOM sample with a DOM structure analysed from the to-be-extracted webpage; if the matching is successful, then positioning nodes with to-be-extracted information in the DOM structure according to the position of the to-be-extracted information in the present DOM sample, thus obtaining the to-be-extracted information by virtue of the nodes; if the matching is unsuccessful, then continuing to execute the step of selecting the present DOM sample, and returning a message that extraction is failed until matching for each DOM sample is failed. According to the method and the device for extracting webpage information disclosed by the invention, the influence of the changes of a webpage structure on an information extraction process can be furthest reduced, so as to realize reliable and accurate extraction for webpage information.

Description

technical field [0001] The invention relates to the field of network technology, in particular to a method and device for extracting web page information. Background technique [0002] With the continuous development of Internet technology, the Internet has become an important information release platform. How to quickly and accurately obtain the information needed by users from the Internet has become an urgent problem to be solved. Webpage information extraction uses the Internet as an information source to obtain webpages that users are interested in from different information sources. After information extraction, the extracted information is stored in the database, so that users can use the information in the database for information query, search, Data mining or data analysis. The purpose of webpage information extraction is to extract the semi-structured information presented in the textual form of the webpage and represent it as structured data, so as to convert the...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 李杨瑞崔世起杨青
Owner 人民搜索网络股份公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products