Unlock instant, AI-driven research and patent intelligence for your innovation.

Method and device for extracting information from webpage

A technology of web pages and information attributes, which is applied in the fields of instruments, calculations, electrical digital data processing, etc., and can solve the problem of not being able to obtain accurate information from web pages in unstructured formats

Inactive Publication Date: 2012-08-08
PEKING UNIV +2
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0011] Embodiments of the present invention provide a method and device for extracting information from webpages to solve the problem in the prior art that accurate information cannot be obtained from webpages in unstructured formats

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for extracting information from webpage
  • Method and device for extracting information from webpage
  • Method and device for extracting information from webpage

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0048] Embodiment 1. Here, the information may specifically be news information. The candidate news attributes are obtained from the webpage and put into the corresponding attribute candidate set. See figure 2 ,include:

[0049] Step 201: Read a webpage containing news, and convert the read webpage into a Document Object Model (DOM, Document Object Model) tree structure.

[0050] In the embodiment of the present invention, the WebBRowsER component of C# language in Visual STudio2005 developed by Microsoft can be used to convert the read network element into a DOM tree structure.

[0051] Each node of the DOM tree structure is an attribute of news information, that is, it may be title, release time, author, comment link, source, text, subject, related news or pictures. The attributes corresponding to each node include sub-attributes. For example, if a node is a title, the node contains font information, label information, position information, text information, etc. of the t...

Embodiment 2

[0099] Embodiment 2 is a preferred embodiment of the present invention, which can extract more comprehensive information according to the positional relationship between every two types of information attributes in the webpage, so as to obtain more accurate information.

[0100] Of course, the embodiment of the present invention can also extract information attribute combinations that meet the positional relationship only according to the positional relationship between certain two types of information attributes, and output the extracted information attribute combinations as information.

[0101] According to the above method for extracting information from webpages, a device for extracting information from webpages can be constructed, see Image 6 , including: an acquisition unit 100, a determination unit 200 and an extraction unit 300.

[0102] The obtaining unit 100 is configured to search for each type of information attribute in the webpage, and obtain a candida...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method and a device for extracting information from a webpage, which are used for solving the problem that accurate information cannot be acquired from a webpage in an unstructured format in the prior art. The method comprises the following steps of: searching each information attribute in the webpage and acquiring an information attribute candidate set corresponding to each information attribute; searching at least one maximum layout relationship probability between at least two information attributes according to a stored corresponding relationship between positionrelationships and layout relationship probabilities among the information attributes, and determining a position relationship corresponding to the searched maximum layout relationship probability; and extracting an information attribute combination meeting the position relationship from information attribute candidate sets corresponding to the at least two information attributes.

Description

technical field [0001] The invention relates to the technical field of information retrieval and data integration, in particular to a method and device for extracting information from webpages. Background technique [0002] Since the birth of the Web in the early 1990s, it has developed at an astonishing speed. Up to now, the Web has become the largest information warehouse in the world, covering all fields of the real world, and has become the main way for human beings to obtain information in their work and life. The release of Web information is mainly implemented in the form of web pages. According to the latest estimates, the number of web pages on the Web has exceeded 550 billion, that is, 550 billion. [0003] It can be seen that although web pages are very important data sources of information, due to the large number of websites in the web, and the web pages where these information are located usually contain a lot of useless noise information, which seriously affec...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 刘伟万小军杨建武肖建国
Owner PEKING UNIV