Three-decision unbalanced data oversampling method based on Spark big data platform

A big data platform and oversampling technology, applied in the direction of instruments, character and pattern recognition, computer components, etc., can solve the problems of reducing efficiency, achieve the effect of solving classification problems, improving performance, and ensuring recognition rate

Active Publication Date: 2017-04-26
CHONGQING UNIV OF POSTS & TELECOMM
View PDF3 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Another feature of RDD is that it is elastic. When the memory of the machine overflows during the operation of the job, the RDD will interact with the hard disk data. Although it will reduce efficiency, it can ensure the normal operation of the job.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Three-decision unbalanced data oversampling method based on Spark big data platform
  • Three-decision unbalanced data oversampling method based on Spark big data platform
  • Three-decision unbalanced data oversampling method based on Spark big data platform

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0035] The technical solutions in the embodiments of the present invention will be described clearly and in detail below with reference to the drawings in the embodiments of the present invention. The described embodiments are only some of the embodiments of the invention.

[0036] The technical scheme that the present invention solves the problems of the technologies described above is:

[0037] Using the three-branch decision imbalance data oversampling method based on the Spark big data platform includes the following steps:

[0038] Obtain the sample set that needs to be sampled from the system, and HDFS automatically performs distributed storage, and then uses Spark to perform data transformation on the entire sample to obtain a normalized sample set in LabeledPoint format . Specific steps: first create a SparkContext object, and then use its textFile(URL) function to create a distributed dataset RDD. Once created, this distributed dataset can be operated in parallel; se...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a three-decision unbalanced data oversampling method based on a Spark big data platform, and relates to a Spark big data technology in the field of data excavation. The method comprises the following steps: firstly, carrying out data transformation with an RDD (Resilient Distributed Dataset) of Spark to obtain a normalized sample set with the LabeledPoint format <label: [features]>, and dividing the sample set into a training set and a test set; secondly, carrying out data variation by adopting the RDD of Spark, calculating a distance between samples, determining the radius of a domain, and classifying the samples in the whole training set into positive domain samples, boundary domain samples and negative domain samples according to a neighborhood three-decision model; then respectively oversampling the boundary domain samples and the negative domain samples; and finally, calling a Spark Mllib machine learning algorithm to verify a sampling result. According to the three-decision unbalanced data oversampling method based on the Spark big data platform, the problem of classification of a large-scale unbalanced data set in the field of machine learning and mode recognition is effectively solved.

Description

technical field [0001] The invention belongs to the fields of data mining, pattern recognition and big data processing, and specifically relates to a three-way decision-making unbalanced data oversampling method based on a Spark big data platform. Background technique [0002] In recent years, mobile phones have already become our daily necessities, and their replacements are quite frequent. It seems that it is becoming more and more common for users to replace their mobile phones. On the one hand, the faster users change their mobile phones, the greater the market value and the higher the manufacturer's income. Therefore, manufacturers need to do everything possible to design new products to stimulate users to replace their mobile phones. On the other hand, major operators are successively using data mining technology to improve marketing efficiency. In actual work, the analysis of customer terminal preferences in the current communication industry is simply based on busine...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62
CPCG06F18/217G06F18/24133G06F18/214
Inventor 胡峰王蕾欧阳卫华于洪王进雷大江李智星瞿原赵蕊张其龙
Owner CHONGQING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products