Webpage extraction method based on attribute reproduction and labeled path

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A technology of labeling paths and attributes, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as insufficient reproducible entities and insufficient template abstraction.

Inactive Publication Date: 2012-10-31

NAT UNIV OF DEFENSE TECH

View PDF1 Cites 18 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0007] The problem to be solved by the present invention is to propose a more effective and general web page information extraction method based on the existing web page extraction technology, which encounters problems such as insufficient repetition of entities and insufficient effectiveness of template abstraction, that is, based on attribute reproduction and Label path to extract web pages

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0022] Such as figure 1 As shown, it is a flowchart of the realization of the open source software acquisition and search system and method, and specifically implements the following steps:

[0023] Step 1. Build a seed collection. By extracting the attribute value list pages of the target website or other websites in the same field, an attribute value seed set is constructed, which contains some values of the target attribute.

[0024] More and more websites support exploratory search, which combines query and browsing in the process of searching on the web. Compared with typical keyword search, exploratory search provides users with a hierarchical and multi-dimensional browsing option, especially for users who do not have a clear search goal, exploratory search provides a way to search while specifying needs The way. On the exploratory search page, the attributes of the searched entity are usually displayed in a list of hyperlinks, such as figure 1 As shown, the attrib...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a webpage extraction method based on attribute reproduction and labeled path. The web extraction method comprises the following steps of: constructing an attribute value seed set through extracting a target website or an attribute value list page, wherein part value of a target attribute is contained; acquiring a partial sample page, and determining a relative labeled path, between an attribute name and an attribute value, of each attribute; downloading a partial page, constructing a training sample base, and storing the acquired codes in a local database; inquiring and labeling all reproductions of each seed attribute value in the training webpage, recording to the labeled path corresponding to each reproduction; taking the labeled path with highest support to a same attribute webpage as an extraction rule for extracting other webpage information except the training samples; accessing other webpage HTML (Hypertext Markup Language) trees in the target website by using the acquired labeled path, locating the label where the attribute value is, and extracting a text character string; and deleting the attribute value without the attribute name or with an incorrect attribute name, and storing the correct attribute value into the local database, thereby finishing the attribute value extraction of page attribute.

Description

technical field [0001] The invention relates to a web page extraction method based on attribute recurrence and label path, especially for websites such as open source communities with less entity recurrence and more attribute recurrence, a template detection and extraction method different from traditional entity recurrence Web page extraction method. Background technique [0002] One of the key roles of the Internet is data presentation. It contains information composed of entities in various fields. Here, an entity refers to an object instance in a website data model, and often corresponds to a web page, such as an electronic product, an open source project, and so on. Extracting this type of information is valuable for building web applications such as comparative online shopping and vertical search engines. [0003] Different websites in the same domain often have the same data. For example, a user can find information about an iPod on apple.com that also appears on ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30

Inventor尹刚王怀民李翔朱沿旭史殿习王涛袁霖余跃

OwnerNAT UNIV OF DEFENSE TECH

Webpage extraction method based on attribute reproduction and labeled path

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology