Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Semi-supervised mass data hierarchy classification method

A mass data and classification method technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve the problems of time and cost, and the inability to establish classification models for mass text data, and achieve the effect of expanding the scale

Inactive Publication Date: 2010-10-27
罗彤
View PDF0 Cites 21 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] 1. When the hierarchical structure is huge, a large amount of manual labeling is required to provide the training set for the classifier to achieve the required classification accuracy, and the time and cost required for this manual labeling are very large
[0006] 2. The training of high-precision text classifiers (regularized linear classifiers, including support vector machines) requires a lot of running time, and it is impossible to establish a classification model for massive text data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Semi-supervised mass data hierarchy classification method
  • Semi-supervised mass data hierarchy classification method
  • Semi-supervised mass data hierarchy classification method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036] A semi-supervised hierarchical classification method for massive data, which uses a semi-supervised learning method (semi-supervised learning) to reduce the workload of manually labeling the training set, and proposes a random progressive method to train regular linear classifiers, so that Classifiers can use massive text data to train and produce high-precision classification models.

[0037] The basic idea of ​​the present invention is to set up a classifier for each node (non-root node) in the hierarchical structure to classify the webpages flowing through its parent node to its child nodes; Improve the classification effect; in the training process, we use the stochastic gradient descent (Stochastic gradient decent) method to traverse the massive training set multiple times, reducing the computational complexity to O(N), thus solving the problem of large-scale data set training question. The classification steps of this hierarchical classifier are as follows:

[0...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

Aiming at the problems of long manual labeling time and large expenses of a mass data hierarchy classifier, the invention provides a semi-supervised mass data hierarchy classification method comprising the following steps of: carrying out feature extraction on webpages in a webpage base; automatically generating the training set of a leaf node on the existing hierarchy classification body base by utilizing a rule set and an expansion rule; clustering the webpage of each existing leaf node; classifying unlabelled sets on the basis of clustering, and adding examples in the unlabelled sets, which are similar to the training set, to the training set of the corresponding leaf node to expand the scale of the training set; training the canonical linear classifier of each node by using a random gradient descent method; if the result of the classifier meets stopping conditions, stopping; otherwise, entering the step F; classifying the unlabelled sets by using the hierarchy classifier established by the steps C and D; adding the high-confidence classified webpages to the training set; and returning to the step C and repeating the steps form C to F.

Description

technical field [0001] The invention relates to the fields of data mining, machine learning and natural language processing, and is a semi-supervised hierarchical classification method for massive text data, that is, a semi-supervised hierarchical classification method for massive data. Background technique [0002] As we enter the era of information explosion, the Internet has provided people with a lot of knowledge and content, and the knowledge on the Internet has provided great help to people's basic necessities of life. Websites such as Google, Baidu, Sogou and Youdao provide Chinese search services, and people can find relevant information webpages by searching keywords. However, keyword-based retrieval often fails to provide the webpages people need, and users need to browse a large number of search results by themselves to finally find the webpages they need. Therefore, semantic-based search engines have recently aroused great interest and become a hot spot in the i...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06N1/00
Inventor 罗彤
Owner 罗彤
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products