Spark memory computing big data platform-based CLR multi-label data classification method

A big data platform, memory computing technology, applied in computing, data mining, electrical digital data processing and other directions, can solve the problem of not being able to process data quickly, not being able to use a large amount of historical data in a timely and effective manner, and model taking a lot of time, etc. problems, to achieve the effect of reducing the risk of downtime, reducing storage space, and reducing time efficiency

Active Publication Date: 2017-03-22
CHONGQING UNIV OF POSTS & TELECOMM
View PDF3 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0012] The present invention aims at the defects in the prior art that a large amount of historical data cannot be effectively used in a timely manner after the data is acquired and beneficial information can be quickly excavated from it, data processing cannot be performed quickly, and model building takes a lot of time.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Spark memory computing big data platform-based CLR multi-label data classification method
  • Spark memory computing big data platform-based CLR multi-label data classification method
  • Spark memory computing big data platform-based CLR multi-label data classification method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0032] figure 1 Adopt Spark to carry out CLR multi-label learning algorithm flowchart for the present invention, comprise the following steps;

[0033] (1) Data preprocessing stage

[0034] Including steps: data acquisition, transformation of non-nominal data, missing value compensation and normalization of data.

[0035] Obtaining data specifically includes: creating a SparkContext object (SparkContext is the external interface of Spark, which is responsible for providing various functions of Spark to the call. It functions as a container), SparkContext is the entrance of Spark, and is responsible for connecting to the Spark cluster; then use Spark's textFile(URL) (a function that serializes RDD to a distributed file system) reads the dataset, where the URL can be the address of a local data file (for example: C: / dataset.txt) or hdfs (Hadoop Distributed File System: Hadoop Distributed File System) above the address (for example: hdfs: / / n1:8090 / user / hdfs / dataset.txt), conver...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a Spark big data platform-based CLR multi-label classification method, and relates to a data mining technology. Each data set is divided into an area through a relationship between characteristics and labels; one part of data sets are randomly extracted from the data sets in a training set as test sets; base classifiers are built by the remaining data sets as the training set; and the test sets are used for testing a classifier and modifying parameters of the base classifiers thereof, thereby selecting the best base classifier from multiple trained base classifiers as the base classifier of the data set. A prediction set is predicted by using the final base classifiers. A CLR multi-label learning algorithm is combined with a memory-based efficient computing theory of Spark, no correlation after label transformation in a CLR algorithm is fully utilized, the interference among different base classifiers is reduced, the operating speed of a computing framework of the Spark is fully utilized and data can be effectively mined.

Description

technical field [0001] The invention relates to computer information processing and data mining technical fields, and provides a multi-label data mining method based on a calibration label ranking algorithm CLR (Calibrated Label Ranking) of a Spark big data platform. Background technique [0002] With the development of information technology, Internet data and resources are characterized by massive quantities. In order to effectively manage and utilize these massive amounts of information, content-based information retrieval and data mining have gradually become areas of concern. While the amount of data is increasing, the complexity of the data labeling structure is also increasing. The traditional single-label data mining can no longer meet the needs of technological development. The importance of multi-label data mining is gradually highlighted. The technology involved Applications are also increasing, such as semantic annotation of images and videos, gene functional gr...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/16G06F16/182G06F16/2465G06F2216/03
Inventor 胡峰张其龙邓维斌于洪张清华
Owner CHONGQING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products