Method for automatically building classification tree from semi-structured data of Wikipedia

A semi-structured data and Wikipedia technology, applied in the field of knowledge acquisition, can solve problems such as unrecognizable relationships

Inactive Publication Date: 2014-05-07
XI AN JIAOTONG UNIV
View PDF7 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The method described in this patent relies on the domain knowledge base, and can only process numerical tables

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for automatically building classification tree from semi-structured data of Wikipedia
  • Method for automatically building classification tree from semi-structured data of Wikipedia
  • Method for automatically building classification tree from semi-structured data of Wikipedia

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0062] The present invention will be further described below in conjunction with accompanying drawings and examples.

[0063] see figure 1 Shown, a kind of method of the present invention automatically constructs classification tree from Wikipedia semi-structured data, is divided into following 3 processes:

[0064] Step 1: Semi-structured data extraction, including 2 steps.

[0065] Step 1.1: Starting from the homepage of the Wikipedia website www.wikipedia.org, crawl all pages layer by layer by analyzing the hyperlinks of the page, and obtain the entry page according to the page URL prefix "http: / / en.wikipedia.org / wiki / " , get the catalog page according to the URL prefix "http: / / en.wikipedia.org / wiki / Category:", each page corresponds to an entity, and the page title is the name of the entity;

[0066] Step 1.2: According to whether the entry page contains the HTML tag , filter out the entry pages containing the navigation table.

[0067] The flow of these steps is as fol...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for automatically building a classification tree from the semi-structured data of Wikipedia. The method comprises the steps: (1) extracting the semi-structured data, to be specific, acquiring the HTML of a page by analysis, and identifying the page containing the semi-structured data; (2) extracting a hyponymy relationship among the semi-structured data, to be specific, acquiring the hyponymy relationship contained in a Wikipedia catalog page according to the layout characteristics of the Wikipedia catalog page, analyzing an HTML element, and acquiring the hyponymy relationship contained in a navigation table according to the structure of the navigation table; (3) integrating the hyponymy relationships from different semi-structured data, to be specific, building a simple directed and unweighted graph according to the extracted hyponymy relationship set, and then generating a classification tree based on the depth-first traversal algorithm of the simple directed and unweighted graph. The method can automatically extract the hyponymy relationship in Wikipedia pages, and build the classification tree, thereby reducing the building cost by experts in the domain, and fully reusing the hyponymy relationship manually built by volunteers.

Description

technical field [0001] The invention relates to the technical field of knowledge acquisition, in particular to a method for automatically constructing a classification tree using Wikipedia semi-structured data. Background technique [0002] The Internet accelerates the process of information digitization, and the information on it increases exponentially. At present, digital information has shown the development trend of huge quantity, various types, and rapid update. The number of web pages indexed by the famous web search engine Google has reached 50 billion. The information age has brought massive amounts of digital texts, and the increasing accumulation of data has made it increasingly difficult to obtain information. [0003] A huge number of pages contain human-edited semi-structured data, which are scattered on different pages, making it impossible for people to quickly and accurately find these useful semi-structured information from a large number of pages. [00...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/8373
Inventor 刘均魏笔凡冯博琴郑庆华马健王晨晨吴蓓
Owner XI AN JIAOTONG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products