Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Webpage extraction method based on attribute reproduction and labeled path

A technology of labeling paths and attributes, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as insufficient reproducible entities and insufficient template abstraction.

Inactive Publication Date: 2012-10-31
NAT UNIV OF DEFENSE TECH
View PDF1 Cites 18 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] The problem to be solved by the present invention is to propose a more effective and general web page information extraction method based on the existing web page extraction technology, which encounters problems such as insufficient repetition of entities and insufficient effectiveness of template abstraction, that is, based on attribute reproduction and Label path to extract web pages

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage extraction method based on attribute reproduction and labeled path
  • Webpage extraction method based on attribute reproduction and labeled path
  • Webpage extraction method based on attribute reproduction and labeled path

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0022] Such as figure 1 As shown, it is a flowchart of the realization of the open source software acquisition and search system and method, and specifically implements the following steps:

[0023] Step 1. Build a seed collection. By extracting the attribute value list pages of the target website or other websites in the same field, an attribute value seed set is constructed, which contains some values ​​of the target attribute.

[0024] More and more websites support exploratory search, which combines query and browsing in the process of searching on the web. Compared with typical keyword search, exploratory search provides users with a hierarchical and multi-dimensional browsing option, especially for users who do not have a clear search goal, exploratory search provides a way to search while specifying needs The way. On the exploratory search page, the attributes of the searched entity are usually displayed in a list of hyperlinks, such as figure 1 As shown, the attrib...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a webpage extraction method based on attribute reproduction and labeled path. The web extraction method comprises the following steps of: constructing an attribute value seed set through extracting a target website or an attribute value list page, wherein part value of a target attribute is contained; acquiring a partial sample page, and determining a relative labeled path, between an attribute name and an attribute value, of each attribute; downloading a partial page, constructing a training sample base, and storing the acquired codes in a local database; inquiring and labeling all reproductions of each seed attribute value in the training webpage, recording to the labeled path corresponding to each reproduction; taking the labeled path with highest support to a same attribute webpage as an extraction rule for extracting other webpage information except the training samples; accessing other webpage HTML (Hypertext Markup Language) trees in the target website by using the acquired labeled path, locating the label where the attribute value is, and extracting a text character string; and deleting the attribute value without the attribute name or with an incorrect attribute name, and storing the correct attribute value into the local database, thereby finishing the attribute value extraction of page attribute.

Description

technical field [0001] The invention relates to a web page extraction method based on attribute recurrence and label path, especially for websites such as open source communities with less entity recurrence and more attribute recurrence, a template detection and extraction method different from traditional entity recurrence Web page extraction method. Background technique [0002] One of the key roles of the Internet is data presentation. It contains information composed of entities in various fields. Here, an entity refers to an object instance in a website data model, and often corresponds to a web page, such as an electronic product, an open source project, and so on. Extracting this type of information is valuable for building web applications such as comparative online shopping and vertical search engines. [0003] Different websites in the same domain often have the same data. For example, a user can find information about an iPod on apple.com that also appears on ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 尹刚王怀民李翔朱沿旭史殿习王涛袁霖余跃
Owner NAT UNIV OF DEFENSE TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products