Clustering method of massive high-dimensional audio data based on central index

A technology of audio data and clustering method, applied in electrical digital data processing, special data processing applications, instruments, etc., can solve the problem of high cost of k-means calculation, so as to shorten the clustering time, reduce the calculation cost, and reduce the clustering cost. class cost effect

Inactive Publication Date: 2017-01-04
JIANGSU WEISHI TECH
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In this way, for massive high-dimensional data, the calculation cost of k-means is very expensive

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Clustering method of massive high-dimensional audio data based on central index
  • Clustering method of massive high-dimensional audio data based on central index
  • Clustering method of massive high-dimensional audio data based on central index

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0035] Figure 1 to Figure 4 It is a schematic diagram of a simulation example of the present invention, wherein k central points are clustered into 5 partitions, ie m=5. Calculate the center point and radius of each partition. For a given point P in a dataset containing massive high-dimensional data, calculate the distance between the given point P and the center point of each partition to obtain the center point of the partition closest to the given point P And the corresponding partition, that is, the selected partition and the center point of the selected partition can be obtained.

[0036] After obtaining the selected partition and the center point of the selected partition, calculate the distance between the given point P and the clustering points in the selected partition. According to the k-means clustering method, according to the distance between the given point P and the clustering points , which can realize the clustering of data points in the data set. In the em...

Embodiment 2

[0038] In this embodiment, high-dimensional audio data is used as the research correspondence, and the implementation environment includes a cluster of 14 computers, each of which has two chips, dual-core (2.70GHz), CPU is E5400, 4GB memory, and uses a linux operating system. Hadoop version is 0.20.3, and all experiments of MapReduce system use Java1.6.

[0039]Wherein, the audio database includes about 100,000 MP3 songs downloaded from the Internet, most of which are pop music, and the rest are classical and folk music. Main features are extracted from audio data and a 26-dimensional dataset is obtained. A point in a 26-dimensional space represents a frame of a song. The dataset includes a total of 167,876,767 26-dimensional vectors. The benchmark program is Mahout's k-means implementation, which is a well-known machine learning library developed by Apache.

[0040] Cluster the audio data of 10,000 songs, k are 50, 500, and 5,000 respectively. In order to remove the influe...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a clustering method for a huge amount of high-dimensional data based on center indexing. The method comprises steps as follows: a, k center points are selected from a data set containing the huge amount of high-dimensional data, and the selected k center point clusters are divided into m zones according to distances; b, center points and radii of the m zones are obtained, and a center point corresponding to each zone is taken as an index input; c, distances between data points in the data set and a center point corresponding to each zone in the m zones are compared, and needed selected zones and selected zone center points are obtained; the selected zone center points are taken as center points of the selected zones, so that distances between the selected zone center points and the data points in the data set are the shortest; and d, distances between the data points in the data set and clustering points in the selected zones are compared, so that clustering analysis is performed on data points in the data set. The clustering method for the huge amount of high-dimensional data based on center indexing can effectively shorten clustering time of huge amounts of data and reduce clustering costs of huge amounts of data.

Description

technical field [0001] The invention relates to a clustering method, in particular to a central index-based clustering method for massive high-dimensional audio data, belonging to the technical field of data clustering. Background technique [0002] Clustering is an important data analysis method. It distinguishes and classifies data objects in a data set according to certain requirements and rules, and then divides a data set without category marks into several subsets (classes) according to certain criteria. , and classify similar data objects into one class as much as possible, and classify dissimilar data objects into different classes as much as possible. [0003] At the same time, with the rapid development of information technology, clustering is not only faced with the problem of increasing data volume, but more importantly, the problem of high-dimensional data. In other words, due to the variety of data sources, graphics, audio, video and even video have gradually ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 李秋虹赵航涛
Owner JIANGSU WEISHI TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products