Unlock instant, AI-driven research and patent intelligence for your innovation.

A large data set clustering method based on MapReduce

A data set and data technology, applied in the field of big data processing, can solve problems such as low accuracy rate, inability to effectively dig out hidden information, and clustering method calculation overhead can not meet the actual needs, etc., to achieve the effect of performance improvement

Inactive Publication Date: 2019-01-25
CHONGQING UNIV OF EDUCATION
View PDF0 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] To sum up, the problems existing in the existing technology are: in the face of huge data scale, the traditional clustering method cannot meet the actual needs in terms of data storage and computing overhead, and the accuracy rate is low, so it cannot effectively dig out reliable and useful clustering methods. hidden information of
[0010] (4) The decrease speed of the sum of squared errors in the iterative process of the improved algorithm is not faster than that of the traditional algorithm

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A large data set clustering method based on MapReduce
  • A large data set clustering method based on MapReduce
  • A large data set clustering method based on MapReduce

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033] In order to make the object, technical solution and advantages of the present invention more clear, the present invention will be further described in detail below in conjunction with the examples. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0034] The application principle of the present invention will be further described below in conjunction with the accompanying drawings.

[0035] Such as figure 1 As shown, the MapReduce-based large-scale data set clustering method provided by the embodiment of the present invention includes the following steps:

[0036] S101: Input and format conversion of raw data; Hadoop defines three input data formatting methods: TextInputFormat, KeyValueInputFormat and SequenceFileInputFormat, the data for cluster analysis is in the form of high-dimensional vector, select SequenceFileInputFormat; call the InputDriver class that comes...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention belongs to the technical field of big data processing, and discloses a large data set clustering method based on MapReduce and an application thereof. Raw data input and format conversion; Canopy partition and screening to get the initial clustering partition; K-Means iteration, using the result of Canopy clustering as the initial clustering partition; data point allocation, after K-Means iteration, the complete information of k clusters is obtained. For the problem of initial cluster center selection and large iterative computation amount existing in a traditional K-Means algorithm. a K-Means improved algorithm based on Canopy partitioning and filtering is proposed, the algorithm is realized within the technical framework of MapReduce, and in-depth research is carried out. The results show that the improved algorithm has obvious performance improvement in clustering accuracy and computational overhead.

Description

technical field [0001] The invention belongs to the technical field of big data processing, and in particular relates to a clustering method for large data sets based on MapReduce. Background technique [0002] With the advent of the era of big data, in more and more application scenarios, the scale of data that people need to process expands to TB or even PB level, and they hope to quickly and effectively mine reliable and useful hidden information (AlexeyB et al.2018) . Therefore, how to quickly and accurately mine valuable information from big data is of great significance at present. As one of the core technologies in the field of data mining, cluster analysis can often be used as the pre-processing of other data mining algorithms (Treu T et al.2018). However, in the face of such a huge data scale, traditional clustering methods cannot meet the practical needs in terms of data storage and computing overhead (Efstathiou G et al. 2018). [0003] The MapReduce computing ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/2458G06F16/27
Inventor 韦鹏程蔡银应邹杨黄思行张艳霞
Owner CHONGQING UNIV OF EDUCATION