Method for automatically building classification tree from semi-structured data of Wikipedia

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A semi-structured data and Wikipedia technology, applied in the field of knowledge acquisition, can solve problems such as unrecognizable relationships

Inactive Publication Date: 2014-05-07

XI AN JIAOTONG UNIV

View PDF7 Cites 11 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

The method described in this patent relies on the domain knowledge base, and can only process numerical tables, and cannot recognize the entities represented by strings in the tables and the relationship between entities

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0062] The present invention will be further described below in conjunction with accompanying drawings and examples.

[0063] see figure 1 Shown, a kind of method of the present invention automatically constructs classification tree from Wikipedia semi-structured data, is divided into following 3 processes:

[0064] Step 1: Semi-structured data extraction, including 2 steps.

[0065] Step 1.1: Starting from the homepage of the Wikipedia website www.wikipedia.org, crawl all pages layer by layer by analyzing the hyperlinks of the page, and obtain the entry page according to the page URL prefix "http: / / en.wikipedia.org / wiki / " , get the catalog page according to the URL prefix "http: / / en.wikipedia.org / wiki / Category:", each page corresponds to an entity, and the page title is the name of the entity;

[0066] Step 1.2: According to whether the entry page contains the HTML tag , filter out the entry pages containing the navigation table.

[0067] The flow of these steps is as fol...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a method for automatically building a classification tree from the semi-structured data of Wikipedia. The method comprises the steps: (1) extracting the semi-structured data, to be specific, acquiring the HTML of a page by analysis, and identifying the page containing the semi-structured data; (2) extracting a hyponymy relationship among the semi-structured data, to be specific, acquiring the hyponymy relationship contained in a Wikipedia catalog page according to the layout characteristics of the Wikipedia catalog page, analyzing an HTML element, and acquiring the hyponymy relationship contained in a navigation table according to the structure of the navigation table; (3) integrating the hyponymy relationships from different semi-structured data, to be specific, building a simple directed and unweighted graph according to the extracted hyponymy relationship set, and then generating a classification tree based on the depth-first traversal algorithm of the simple directed and unweighted graph. The method can automatically extract the hyponymy relationship in Wikipedia pages, and build the classification tree, thereby reducing the building cost by experts in the domain, and fully reusing the hyponymy relationship manually built by volunteers.

Description

technical field [0001] The invention relates to the technical field of knowledge acquisition, in particular to a method for automatically constructing a classification tree using Wikipedia semi-structured data. Background technique [0002] The Internet accelerates the process of information digitization, and the information on it increases exponentially. At present, digital information has shown the development trend of huge quantity, various types, and rapid update. The number of web pages indexed by the famous web search engine Google has reached 50 billion. The information age has brought massive amounts of digital texts, and the increasing accumulation of data has made it increasingly difficult to obtain information. [0003] A huge number of pages contain human-edited semi-structured data, which are scattered on different pages, making it impossible for people to quickly and accurately find these useful semi-structured information from a large number of pages. [00...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30

CPCG06F16/8373

Inventor刘均魏笔凡冯博琴郑庆华马健王晨晨吴蓓

OwnerXI AN JIAOTONG UNIV

Method for automatically building classification tree from semi-structured data of Wikipedia

What is AI technical title? AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document. A semi-structured data and Wikipedia technology, applied in the field of knowledge acquisition, can solve problems such as unrecognizable relationships

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A semi-structured data and Wikipedia technology, applied in the field of knowledge acquisition, can solve problems such as unrecognizable relationships

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology