Spark-based parallel random label subset multi-label text classification method

A text classification and multi-label technology, which is applied in the field of parallel random label subset multi-label text classification algorithm, can solve problems such as inability to run downtime, memory overflow, long time, etc., to solve text classification problems, improve accuracy, reduce The effect of study time

Inactive Publication Date: 2017-06-20
CHONGQING UNIV OF POSTS & TELECOMM
View PDF3 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] The present invention aims at the shortcomings of the existing multi-label classification technology, such as easy memory overflow, long time and failure to run downtime when class

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Spark-based parallel random label subset multi-label text classification method
  • Spark-based parallel random label subset multi-label text classification method
  • Spark-based parallel random label subset multi-label text classification method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0037] The technical solutions in the embodiments of the present invention will be described clearly and in detail below with reference to the drawings in the embodiments of the present invention. The described embodiments are only some of the embodiments of the invention.

[0038] The technical scheme that the present invention solves the problems of the technologies described above is:

[0039] The parallelized multi-label text classification method based on the spark big data platform provided by the present invention-random label subset method includes the following three processes:

[0040] 1. Construct a multi-label data set according to the characteristics of the random label subset algorithm;

[0041] In order to reflect the efficiency of the parallel algorithm and the classification effect of the random label subset method, the text data set EUR-Lex (directory codes) was selected from the official website of mulan from the two perspectives of the number of labels and...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a parallel random label subset multi-label text classification method for Spark-based big data platform. First of all, large scale text data sets and configuration information files are read, the distributed data set RDD is created, the training data set and the prediction data set are cached in memory to complete the initialization operation. Secondly, the label subset of the required numbers is randomly generated in parallel, a new training set is generated for each label subset by the original training set, once again, multiple tags of the new training set through the tag power set are converted into single labels, the data sets are converted into single label multiple data sets, and a base classifier is trained in parallel for these data sets. Then the single label multiple prediction results obtained by prediction are converted into multi-label results. Finally, all the predicted results are collected and voted, to obtain the final multi-label prediction results of the test set. The multi-label text classification method improves the classification accuracy, dramatically reduces the learning time of handling large scale multi-label data.

Description

technical field [0001] The invention relates to the fields of information technology, cloud computing, data mining, text classification, etc., and provides a parallelized random label subset multi-label text classification algorithm based on the Spark big data platform. Background technique [0002] With the development of information technology, the scale of Internet data has grown massively, and the forms of expression have also been continuously enriched. Text is an important information carrier. The development of its automatic classification technology can improve the processing efficiency of massive information, save processing time, and facilitate the use of users. It has received extensive attention and rapid development in recent years. Traditional supervised learning believes that each sample has only one label, and lacks the ability to accurately express the complex semantic information of things. However, a sample may correspond to multiple labels related to it....

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/35G06F2216/03
Inventor 王进王鸿夏翠萍范磊欧阳卫华陈乔松雷大江李智星胡峰邓欣
Owner CHONGQING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products