Method and device for extracting information from webpage

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology of web pages and information attributes, which is applied in the fields of instruments, calculations, electrical digital data processing, etc., and can solve the problem of not being able to obtain accurate information from web pages in unstructured formats

Inactive Publication Date: 2012-08-08

PEKING UNIV +2

View PDF0 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0011] Embodiments of the present invention provide a method and device for extracting information from webpages to solve the problem in the prior art that accurate information cannot be obtained from webpages in unstructured formats

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0048] Embodiment 1. Here, the information may specifically be news information. The candidate news attributes are obtained from the webpage and put into the corresponding attribute candidate set. See figure 2 ,include:

[0049] Step 201: Read a webpage containing news, and convert the read webpage into a Document Object Model (DOM, Document Object Model) tree structure.

[0050] In the embodiment of the present invention, the WebBRowsER component of C# language in Visual STudio2005 developed by Microsoft can be used to convert the read network element into a DOM tree structure.

[0051] Each node of the DOM tree structure is an attribute of news information, that is, it may be title, release time, author, comment link, source, text, subject, related news or pictures. The attributes corresponding to each node include sub-attributes. For example, if a node is a title, the node contains font information, label information, position information, text information, etc. of the t...

Embodiment 2

[0099] Embodiment 2 is a preferred embodiment of the present invention, which can extract more comprehensive information according to the positional relationship between every two types of information attributes in the webpage, so as to obtain more accurate information.

[0100] Of course, the embodiment of the present invention can also extract information attribute combinations that meet the positional relationship only according to the positional relationship between certain two types of information attributes, and output the extracted information attribute combinations as information.

[0101] According to the above method for extracting information from webpages, a device for extracting information from webpages can be constructed, see Image 6 , including: an acquisition unit 100, a determination unit 200 and an extraction unit 300.

[0102] The obtaining unit 100 is configured to search for each type of information attribute in the webpage, and obtain a candida...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a method and a device for extracting information from a webpage, which are used for solving the problem that accurate information cannot be acquired from a webpage in an unstructured format in the prior art. The method comprises the following steps of: searching each information attribute in the webpage and acquiring an information attribute candidate set corresponding to each information attribute; searching at least one maximum layout relationship probability between at least two information attributes according to a stored corresponding relationship between positionrelationships and layout relationship probabilities among the information attributes, and determining a position relationship corresponding to the searched maximum layout relationship probability; and extracting an information attribute combination meeting the position relationship from information attribute candidate sets corresponding to the at least two information attributes.

Description

technical field [0001] The invention relates to the technical field of information retrieval and data integration, in particular to a method and device for extracting information from webpages. Background technique [0002] Since the birth of the Web in the early 1990s, it has developed at an astonishing speed. Up to now, the Web has become the largest information warehouse in the world, covering all fields of the real world, and has become the main way for human beings to obtain information in their work and life. The release of Web information is mainly implemented in the form of web pages. According to the latest estimates, the number of web pages on the Web has exceeded 550 billion, that is, 550 billion. [0003] It can be seen that although web pages are very important data sources of information, due to the large number of websites in the web, and the web pages where these information are located usually contain a lot of useless noise information, which seriously affec...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Patents(China)

IPC IPC(8): G06F17/30

Inventor 刘伟万小军杨建武肖建国

Owner PEKING UNIV

Method and device for extracting information from webpage

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology