Mean clustering method and apparatus

By selecting a similarity threshold in the K-means clustering algorithm to form the sample cluster centers as initial centroids, and combining the min-max strategy to optimize the selection of centroids, the slow convergence speed of the K-means clustering algorithm is solved, and the clustering efficiency is improved.

CN114662578BActive Publication Date: 2026-06-19INDUSTRIAL AND COMMERCIAL BANK OF CHINA

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
INDUSTRIAL AND COMMERCIAL BANK OF CHINA
Filing Date
2022-03-10
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

The existing K-means clustering algorithm suffers from slow convergence speed due to the random selection of centroids during the initialization phase.

Method used

Sample clusters are formed by selecting sample data with similarity greater than or equal to a preset threshold from the sample data, and the center value of the cluster is used as the initial centroid. The initial centroid is determined by combining the min-max strategy, thus optimizing the centroid selection process.

Benefits of technology

It improves the convergence speed of the K-means clustering algorithm, reduces the time for calculating similarity, reduces the initialization time, and reduces the probability of outliers or noisy data.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN114662578B_ABST
    Figure CN114662578B_ABST
Patent Text Reader

Abstract

This application provides a mean clustering method and apparatus, which can be used in the field of big data. The mean clustering method includes: obtaining a first data set, which includes N sample data; obtaining N1 sample data from the N sample data to obtain a second data set; determining sample data from the second data set whose similarity to the first sample data is greater than or equal to a preset threshold, wherein the first sample data includes any sample data randomly selected from the second data set; forming a first sample cluster by combining the first sample data and the sample data whose similarity to the first sample data is greater than or equal to the preset threshold; determining the center value of the first sample cluster as a first initial centroid; and clustering the N sample data into K clusters based on the first initial centroid, wherein K indicates a preset number of clusters. The mean clustering method of this application can improve the convergence speed when using the K-means clustering algorithm.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of machine learning technology, and in particular to a mean clustering method and apparatus. Background Technology

[0002] K-means clustering is a commonly used sample classification method. Specifically, the K-means clustering algorithm works as follows: First, K samples are randomly selected as centroids, where K represents the number of clusters. Then, the distance between each sample and each centroid is calculated, and each sample is assigned to the nearest centroid, ultimately forming K sample clusters. Afterward, for each cluster, based on all the samples included in that cluster, the centroids are redefined until the termination condition of K-means clustering is met.

[0003] However, current K-means clustering algorithms select centroids randomly during the initialization phase. This results in the initial values ​​having a significant impact on the computational results of the K-means clustering algorithm. In such cases, poorly selected centroids may lead to slow convergence speed when using the K-means clustering algorithm.

[0004] Therefore, how to select the centroid in the initialization phase to improve the convergence speed when using the K-means clustering algorithm has become an urgent technical problem to be solved. Summary of the Invention

[0005] This application provides a mean clustering method and apparatus that can improve the convergence speed when using the K-means clustering algorithm.

[0006] In a first aspect, embodiments of this application provide a mean clustering method, comprising: obtaining a first data set, the first data set including N sample data, where N is a positive integer; obtaining N1 sample data from the N sample data to obtain a second data set, where N1 is less than or equal to N; determining sample data from the second data set whose similarity to the first sample data is greater than or equal to a preset threshold, wherein the first sample data includes any sample data randomly selected from the second data set; forming a first sample cluster by combining the first sample data and the sample data whose similarity to the first sample data is greater than or equal to the preset threshold; determining the center value of the first sample cluster as a first initial centroid; and clustering the N sample data into K clusters based on the first initial centroid, wherein K indicates a preset number of clusters.

[0007] In this embodiment, when determining the K initial centroids, one of the initial centroids is not selected randomly, but rather by using the center of the first sample cluster formed by the first sample data and sample data with a similarity greater than or equal to a preset threshold as the selected initial centroid. It can be understood that by using the center of the first sample cluster as the selected initial centroid, the probability that the selected initial centroid is an isolated point or noisy data can be reduced, thus improving the convergence speed when using the K-means clustering algorithm. Furthermore, since the first initial centroid is determined based on the second data set, and the amount of data in the second data set is smaller than that in the first data set, the time for calculating similarity is reduced, thereby helping to reduce the initialization time.

[0008] In conjunction with the first aspect, in one possible implementation, clustering the N sample data into K classes based on the first initial centroid includes: determining sample data whose similarity to second sample data is greater than or equal to a preset threshold, wherein the second sample data is the sample data in the second dataset that is excluding the sample data included in the first sample cluster and has the largest distance from the first initial centroid; forming a second sample cluster by combining the second sample data and the sample data in the second dataset that is excluding the sample data included in the first sample cluster and has a similarity to the second sample data greater than or equal to the preset threshold; and determining the center value of the second sample cluster as the second initial centroid. Correspondingly, clustering the N sample data into K classes based on the first initial centroid includes: clustering the N sample data into K classes based on the first initial centroid and the second initial centroid.

[0009] In this implementation, the center of the second sample cluster formed by the second sample data and the sample data whose similarity to the second sample data is greater than or equal to a preset threshold is used as the selected initial centroid. It can be understood that by using the center of the second sample cluster as the selected initial centroid, the probability that the selected second initial centroid is an isolated point or noisy data can be reduced. Therefore, the convergence speed of the K-means clustering algorithm can be further improved based on the determination of the first initial centroid.

[0010] In conjunction with the first aspect, in one possible implementation, the center value of the first sample cluster is equal to the average value of all samples included in the first sample cluster.

[0011] In conjunction with the first aspect, in one possible implementation, clustering the N sample data into K classes based on the first initial centroid and the second initial centroid includes: performing the following processing on the second dataset to obtain p initial centroids: deleting sample data included in each of the already determined first i-1 sample clusters to update the second data set, wherein the first i-1 sample clusters correspond one-to-one with the already determined first i-1 initial centroids; calculating the minimum distance between each data sample and the first i-1 initial centroids from the updated second data set to form a minimum distance value set; and assigning the number of samples in the updated second data set corresponding to the maximum value in the minimum distance value set to the maximum value. Sample data with a similarity greater than or equal to the preset threshold are classified into the i-th sample cluster; the center value of the i-th sample cluster is determined as the i-th initial centroid, and the sample data included in each of the first i-1 sample clusters and the sample data included in the i-th sample cluster are deleted from the second data set; the above process is repeated until i equals K or the updated second data set is an empty set, where i is a positive integer greater than 2 and less than or equal to K; correspondingly, based on the first initial centroid, the N sample data are clustered into K classes, including: based on the first initial centroid, the second initial centroid and the p initial centroids, the N sample data are clustered into K classes.

[0012] In conjunction with the first aspect, in one possible implementation, if the updated second data set is an empty set and the number of determined initial centroids is less than K, the method further includes: randomly selecting the remaining initial centroids from the sample data in the first data set excluding the sample data included in the second data set.

[0013] In conjunction with the first aspect, in one possible implementation, obtaining N1 sample data from the N sample data to obtain a second data set includes: obtaining N1 sample data from the N sample data according to a preset ratio value, wherein the preset ratio value indicates the ratio of N1 to N.

[0014] Secondly, this application provides a mean clustering apparatus, comprising: an acquisition module, configured to acquire a first data set, the first data set including N sample data, where N is a positive integer; the acquisition module is further configured to acquire N1 sample data from the N sample data to obtain a second data set, where N1 is less than or equal to N; a processing module, configured to determine sample data from the second data set whose similarity to the first sample data is greater than or equal to a preset threshold, the first sample data including any sample data randomly selected from the second data set; the processing module is further configured to form a first sample cluster from the first sample data and the sample data whose similarity to the first sample data is greater than or equal to the preset threshold; the processing module is further configured to determine the center value of the first sample cluster as a first initial centroid; the processing module is further configured to cluster the N sample data into K clusters based on the first initial centroid, where K indicates a preset number of clusters.

[0015] In conjunction with the second aspect, in one possible implementation, the processing module is further configured to: determine sample data whose similarity to the second sample data is greater than or equal to a preset threshold, wherein the second sample data is the sample data in the second dataset that is excluding the sample data included in the first sample cluster and has the largest distance from the first initial centroid; form a second sample cluster by combining the second sample data and the sample data in the second dataset that is excluding the sample data included in the first sample cluster and has a similarity to the second sample data greater than or equal to the preset threshold; determine the center value of the second sample cluster as the second initial centroid; correspondingly, the processing module is further configured to: cluster the N sample data into K classes based on the first initial centroid and the second initial centroid.

[0016] In conjunction with the second aspect, in one possible implementation, the center value of the first sample cluster is equal to the average value of all samples included in the first sample cluster.

[0017] In conjunction with the second aspect, in one possible implementation, the processing module is further configured to: perform the following processing on the second dataset to obtain p initial centroids: delete sample data included in each of the already determined first i-1 sample clusters to update the second data set, wherein the first i-1 sample clusters correspond one-to-one with the already determined first i-1 initial centroids; calculate the minimum distance between each data sample and the first i-1 initial centroids from the updated second data set, forming a minimum distance value set; and calculate the similarity between the sample data in the updated second data set and the sample data corresponding to the maximum value in the minimum distance value set. Sample data greater than or equal to the preset threshold are grouped into the i-th sample cluster; the center value of the i-th sample cluster is determined as the i-th initial centroid, and the sample data included in each of the first i-1 sample clusters and the sample data included in the i-th sample cluster are deleted from the second data set; the above process is repeated until i equals K or the updated second data set is an empty set, where i is a positive integer greater than 2 and less than or equal to K; correspondingly, the processing module is also used to: cluster the N sample data into K classes based on the first initial centroid, the second initial centroid and the p initial centroids.

[0018] In conjunction with the second aspect, in one possible implementation, if the updated second data set is an empty set and the number of determined initial centroids is less than K, the processing module is further configured to: randomly select the remaining initial centroids from the sample data in the first data set excluding the sample data included in the second data set.

[0019] In conjunction with the second aspect, in one possible implementation, the acquisition module is specifically used to: acquire N1 sample data from the N sample data according to a preset ratio value, wherein the preset ratio value indicates the ratio of N1 to N.

[0020] Thirdly, this application provides a mean clustering apparatus, comprising: a memory and a processor; the memory is used to store program instructions; the processor is used to invoke the program instructions in the memory to execute the method as described in the first aspect or any of the possible implementations thereof.

[0021] Fourthly, this application provides a chip including at least one processor and a communication interface, the communication interface and the at least one processor being interconnected via a line, the at least one processor being configured to run a computer program or instructions to perform the method as described in the first aspect or any of the possible implementations thereof.

[0022] Fifthly, this application provides a computer-readable medium storing program code for computer execution, the program code including methods for performing the methods described in the first aspect or any of the possible implementations thereof.

[0023] Sixthly, this application provides a computer program product including computer program code that, when executed on a computer, causes the computer to implement the method described in the first aspect or any of its possible implementations.

[0024] The technical effects of any of the implementation methods in aspects two through six can be found in the technical effects of any possible implementation method in aspect one above, and will not be elaborated upon further. Attached Figure Description

[0025] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.

[0026] Figure 1 This is a structural diagram illustrating an application scenario provided in one embodiment of this application;

[0027] Figure 2 This is a flowchart illustrating a mean clustering method provided in one embodiment of this application.

[0028] Figure 3 A flowchart illustrating a mean clustering method provided in another embodiment of this application;

[0029] Figure 4 This is a structural schematic diagram of a mean clustering device provided in one embodiment of this application;

[0030] Figure 5 This is a flowchart illustrating a mean clustering apparatus provided in another embodiment of this application. Detailed Implementation

[0031] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.

[0032] Currently, in the era of big data, how to analyze and utilize massive amounts of data has become a key focus. In the process of data analysis, it is often necessary to use classification algorithms to train large amounts of data to obtain a classification model, and finally use the classification model to predict the category of new data.

[0033] For example, Figure 1 This is a structural diagram illustrating an application scenario provided in one embodiment of this application. For example... Figure 1 As shown, this application scenario includes a sample database 101 and a classification model 102. The sample database 101 contains a large amount of sample data. By training the large amount of sample data in the sample database 101 using a classification algorithm, the classification model 102 can be obtained. For new sample data, the classification model 102 can be used to make predictions to obtain the category of the new sample data.

[0034] Currently, a classification algorithm used for the large amount of sample data in the sample database 101 is the K-means clustering algorithm. Specifically, the K-means clustering algorithm works as follows: First, K samples are randomly selected as centroids, where K represents the number of clusters; then, the distance between each sample data and each centroid is calculated, and each sample data is assigned to the nearest centroid, ultimately forming K sample clusters; then, for each sample cluster, the centroids are re-determined based on all the sample data included in the cluster, until the termination condition of K-means clustering is met.

[0035] However, current K-means clustering algorithms select centroids randomly during the initialization phase. This means that the results of K-means clustering are highly random; the final clustering result may differ depending on the initial randomly selected centroids. In this case, poorly selected centroids can lead to slow convergence speed when using K-means clustering. Therefore, how to select centroids during the initialization phase to improve the convergence speed of K-means clustering has become a pressing technical problem.

[0036] In view of this, embodiments of this application provide a mean clustering method to improve the slow convergence speed when using the K-means clustering algorithm.

[0037] Figure 2 This is a flowchart illustrating a mean clustering method provided in one embodiment of this application. Figure 2 As shown, the method in this embodiment may include steps S201, S202, S203, S204, S205, and S206. This method can be executed by a mean-based clustering apparatus.

[0038] S201, Obtain a first data set, which includes N sample data, where N is a positive integer.

[0039] In this embodiment, the first data set refers to a set including N sample data. These N sample data can be considered as the sample data used to train the classification model.

[0040] It should be noted that this embodiment does not limit the specific attributes of the sample data in the first data set. For example, it can be data with text attributes, data with image attributes, or data with other attributes.

[0041] S202, obtain N1 sample data from the N sample data to obtain a second data set, where N1 is less than or equal to N.

[0042] In one possible implementation, N1 sample data points can be obtained from the N sample data points according to a preset ratio value, where the preset ratio value indicates the ratio of N1 to N. That is, a percentage can be set first, and then N in the first data set can be multiplied by this percentage to determine the number Y of sample data points included in the second data set. Afterwards, Y sample data points can be randomly selected from the N sample data points.

[0043] For example, suppose the first dataset contains 1000 sample data points, and the percentage is set to 10%, meaning the second dataset contains 100 sample data points. Then, 100 sample data points can be randomly selected from the 1000 sample data points and added to the second dataset.

[0044] S203, from the second data set, determine sample data whose similarity to the first sample data is greater than or equal to a preset threshold, wherein the first sample data includes any sample data randomly selected from the second data set.

[0045] Here, the first sample data refers to a sample data randomly selected from the second data set. For example, if the first data set includes 1000 sample data, and 100 sample data are randomly selected from these 1000 sample data as the sample data in the second data set, then a sample data can be randomly selected from these 100 sample data in the second data set as the first sample data.

[0046] In this embodiment, after determining the first sample data, sample data with a similarity greater than or equal to a preset threshold with the first sample data are also determined from the second data set.

[0047] Taking the 100 sample data included in the second data set as an example, assuming the preset threshold is 70%, when a certain sample data in the 100 sample data is used as the first sample data, it is necessary to continue to calculate the similarity between each sample data in the 100 sample data and the first sample data in order to obtain sample data with a similarity greater than or equal to 70% with the first sample.

[0048] It should be noted that this embodiment does not impose any restrictions on the algorithm used to calculate similarity.

[0049] S204, the first sample data and sample data with a similarity to the first sample data greater than or equal to a preset threshold are formed into a first sample cluster.

[0050] It should be understood that when the similarity between two sample data is greater than or equal to a preset threshold, it indicates that the two sample data are very similar. Therefore, in this embodiment, if the similarity between a sample data in the second data set and the first sample data is greater than or equal to the preset threshold, it indicates that the sample data is very similar to the first sample data, meaning there is a probability that they might be classified into the same category. Based on this, in this embodiment, the first sample data and all sample data in the second data set whose similarity to the first sample data is greater than or equal to the preset threshold are grouped into one cluster.

[0051] S205, the center value of the first sample cluster is determined as the first initial centroid.

[0052] In this embodiment, after the first sample cluster is determined, the center value of the first sample cluster is determined as the first initial centroid.

[0053] For example, if the first sample cluster includes 4 sample data, then the center value of these 4 sample data can be determined as the first initial centroid.

[0054] It should be noted that this embodiment does not limit the method of determining the center value of the first sample cluster.

[0055] For example, in one possible implementation, the centroid of the first sample cluster is equal to the average of all samples included in the first sample cluster. For instance, taking the first sample cluster as having four sample data points, assuming these four sample data points are represented as w1, w2, w3, and w4 respectively, and w1 = 2, w2 = 3, w3 = 4, and w4 = 7, then the average of w1, w2, w3, and w4 can be used as the first initial centroid.

[0056] S206, based on the first initial centroid, the N sample data are clustered into K clusters, where K indicates the preset number of clusters.

[0057] In this embodiment, after determining the first initial centroid, the N sample data can be clustered into K classes based on the determined first initial centroid. It should be understood that, in this embodiment, clustering the N sample data into K classes based on the first initial centroid includes determining K-1 initial centroids, for example, these K-1 initial centroids are selected randomly, and then clustering the N samples based on the first initial centroid and the K-1 initial centroids. The algorithm principles for clustering can be found in related technical descriptions and will not be elaborated here.

[0058] In this embodiment, when determining the K initial centroids, one of the initial centroids is not selected randomly, but rather by using the center of the first sample cluster formed by the first sample data and sample data with a similarity greater than or equal to a preset threshold as the selected initial centroid. It can be understood that by using the center of the first sample cluster as the selected initial centroid, the probability that the selected initial centroid is an isolated point or noisy data can be reduced, thus improving the convergence speed when using the K-means clustering algorithm. Furthermore, since the first initial centroid is determined based on the second data set, and the amount of data in the second data set is smaller than that in the first data set, the time for calculating similarity is reduced, thereby helping to reduce the initialization time.

[0059] As an optional embodiment, in one possible implementation, clustering N sample data into K classes based on the first initial centroid includes: determining sample data whose similarity to second sample data is greater than or equal to a preset threshold, wherein the second sample data is the sample data in the second dataset that is excluding the sample data included in the first sample cluster and has the largest distance from the first initial centroid; forming a second sample cluster by combining the second sample data and the sample data in the second dataset that is excluding the sample data included in the first sample cluster and has a similarity to the second sample data greater than or equal to the preset threshold; determining the center value of the second sample cluster as the second initial centroid; correspondingly, clustering the N sample data into K classes based on the first initial centroid includes: clustering the N sample data into K classes based on the first initial centroid and the second initial centroid.

[0060] In this embodiment, after determining the first initial centroid, the distance between each sample data in the second data set (excluding the sample data included in the first sample cluster) and the first initial centroid is calculated. Then, the sample data with the largest distance from the first initial centroid is taken as the second sample data. The similarity between each sample data in the second data set (excluding the sample data included in the first sample cluster) and the second sample data is calculated. Then, the center value of the second sample cluster formed by the second sample data and the sample data with a similarity greater than or equal to a preset threshold is taken as the selected second initial centroid.

[0061] In this implementation, the center of the second sample cluster formed by the second sample data and sample data with a similarity greater than or equal to a preset threshold is used as the selected initial centroid. It is understandable that using the center of the second sample cluster as the selected initial centroid reduces the probability that the selected second initial centroid is an outlier or noisy data point. Therefore, based on the determination of the first initial centroid, the convergence speed of the K-means clustering algorithm can be further improved.

[0062] As an optional embodiment, based on the first and second initial centroids, N sample data are clustered into K classes, including: performing the following processing on the second dataset to obtain p initial centroids: deleting sample data included in each of the already determined first i-1 sample clusters to update the second dataset, wherein the first i-1 sample clusters correspond one-to-one with the already determined first i-1 initial centroids; calculating the minimum distance between each data sample and the first i-1 initial centroids from the updated second dataset, forming a minimum distance value set; and selecting the sample data in the updated second dataset that corresponds to the maximum value in the minimum distance value set. Sample data with similarity greater than or equal to a preset threshold are grouped into the i-th sample cluster; the center value of the i-th sample cluster is determined as the i-th initial centroid, and the sample data included in each of the first i-1 sample clusters and the sample data included in the i-th sample cluster are deleted from the second data set; the above process is repeated until i equals K or the updated second data set is an empty set, where i is a positive integer greater than 2 and less than or equal to K; correspondingly, based on the first initial centroid, N sample data are clustered into K classes, including: based on the first initial centroid, the second initial centroid and p initial centroids, N sample data are clustered into K classes.

[0063] For ease of understanding, Figure 3 This is a flowchart illustrating a mean clustering method provided in another embodiment of this application. Figure 3 As shown, this mean clustering method includes:

[0064] S301, Obtain the first data set, which includes N sample data.

[0065] S302, obtain N1 sample data from N sample data to obtain a second data set, where N1 is less than or equal to N.

[0066] S303, from the second data set, sample data with a similarity greater than or equal to a preset threshold with the first sample data are grouped into a first sample cluster; the center of the first sample cluster is taken as the first initial centroid C1, and all sample data included in the first sample cluster are deleted from the second data set.

[0067] For details on the implementation process of S301 to S303, please refer to [link / reference]. Figure 2 The embodiments shown are not described in detail here.

[0068] S304, find the sample data x that is farthest from the selected first initial centroid C1, and form a second sample cluster from the second data set after deleting all sample data included in the first sample cluster that has a similarity to x greater than or equal to a preset threshold; take the center of the second sample cluster as the second initial centroid C2.

[0069] The implementation of this step can be found in the previous description of how to determine the second initial centroid, which will not be repeated here.

[0070] S305, calculate the minimum distance between the remaining sample data in the second data set and all selected centroids, find the sample data whose similarity to the sample data corresponding to the largest minimum distance is greater than or equal to a preset threshold, and form the next sample cluster with these samples; take the center of the next sample cluster as the next initial centroid, and continue to delete all sample data included in the next sample cluster from the second data set.

[0071] In this step, the remaining sample data in the second data set refers to the sample data in the second data set after deleting all sample data included in the already formed sample clusters. The concept of a sample cluster can be referred to the description in the above embodiments, and will not be repeated here.

[0072] In this embodiment, when determining the initial centroid for the remaining sample data, the distance between each sample data in the remaining sample data and the selected initial centroid is first calculated, and the minimum distance is taken as the distance corresponding to each sample data. It can be understood that after the minimum distance corresponding to each sample data in the remaining sample data is calculated, a minimum distance set will be formed.

[0073] In this embodiment, after the minimum distance set is formed, all sample data in the remaining sample data that have a similarity greater than or equal to a preset threshold to the sample data corresponding to the largest minimum distance in the minimum distance set will form the next sample cluster. Then, the center of the next sample cluster will be used as the next initial centroid, and all sample data included in the next sample cluster will be deleted from the second data set.

[0074] For example, assuming that the first initial centroid is determined to be 3, the second initial centroid is 5, and the second data set after deleting the first sample cluster (the sample cluster corresponding to the first initial centroid) and the second sample cluster (the sample cluster corresponding to the second initial centroid) is {1, 2, 3, 4, 5, 6, 7, 8}, then firstly, the distance between each sample data in {1, 2, 3, 4, 5, 6, 7, 8} and the first initial centroid 3 and the second initial centroid 5 will be calculated. Then, the minimum distance value corresponding to each sample data will form a minimum distance set. It can be understood that, in this example, the minimum distance set will include 8 minimum distance values. After that, the sample data corresponding to the largest minimum distance value (let's call it the target data) will be selected from the 8 minimum distance values. All sample data in the second data set {1, 2, 3, 4, 5, 6, 7, 8} with a similarity greater than or equal to the target data will form the next sample cluster. The center value of the next sample cluster will be determined as the next initial centroid, and all sample data included in the next sample cluster will continue to be deleted from the second data set.

[0075] S306, determine whether the number of initial centroids has reached K. If yes, then execute S309; ​​otherwise, execute S307.

[0076] Understandably, when the number of initial centroids reaches K, that is, when K initial centroids are determined, these K initial centroids will be output so that they can be used to cluster N sample data.

[0077] S307, determine if the second dataset is empty. If yes, execute S308; otherwise, execute S305.

[0078] Understandably, if the number of initial centroids is less than K and the second dataset is empty, it means that the remaining initial centroids need to be determined, and S308 is executed again. However, if the number of initial centroids is less than K and the second dataset is not empty, execution can continue from S305 to determine the next initial centroid.

[0079] S308, the remaining initial centroids are randomly drawn from the sample data in the first dataset excluding all sample data included in the second dataset.

[0080] In this embodiment, the remaining initial centroids are randomly selected from the sample data in the first dataset, excluding all sample data included in the second dataset.

[0081] For example, the first data set is {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}.

[0082] The second data set is {7, 8, 9, 10, 11, 12, 13, 14, 15}. When the above process is executed until the second data set is empty and the number of initial centroids has not yet reached K, then the remaining centroids can be randomly selected from the sample data in the first data set excluding all sample data included in the second data set, i.e., randomly selected from {1, 2, 3, 4, 5, 6}.

[0083] S309 outputs K initial centroids.

[0084] In this embodiment, once K initial centroids are determined, these K initial centroids can be output so that they can be used to cluster N sample data.

[0085] The clustering method proposed in this embodiment improves the initialization phase of the k-means algorithm. During initialization, a method is used to isolate the centroid by deleting neighboring points around the already determined initial centroid. Simultaneously, a minimax strategy is employed to reasonably distribute the selected centroids, thereby accelerating the initialization convergence speed and improving the quality of initial centroid selection. Furthermore, since the center of the sample cluster is used as the initial centroid, isolated points or noisy data can be avoided from becoming centroids.

[0086] Figure 4 This is a structural schematic diagram of a mean clustering device provided in one embodiment of this application. Figure 4 As shown, the device 400 includes an acquisition module 401 and a processing module 402.

[0087] The acquisition module 401 is used to acquire a first data set, which includes N sample data, where N is a positive integer; the acquisition module 401 is also used to acquire N1 sample data from the N sample data to obtain a second data set, where N1 is less than or equal to N; the processing module 402 is used to determine sample data from the second data set whose similarity to the first sample data is greater than or equal to a preset threshold, where the first sample data includes any sample data randomly selected from the second data set; the processing module 402 is also used to form a first sample cluster from the first sample data and the sample data whose similarity to the first sample data is greater than or equal to the preset threshold; the processing module 402 is also used to determine the center value of the first sample cluster as a first initial centroid; the processing module 402 is also used to cluster the N sample data into K classes based on the first initial centroid, where K indicates a preset number of clusters.

[0088] As an example, module 401 can be used to perform... Figure 2 The method includes the step of obtaining N1 sample data from the N sample data. For example, the obtaining module 401 is used to execute S202.

[0089] In one possible implementation, the processing module 402 is further configured to: determine sample data whose similarity to the second sample data is greater than or equal to a preset threshold, wherein the second sample data is the sample data in the second dataset that is excluding the sample data included in the first sample cluster and has the largest distance from the first initial centroid; form a second sample cluster by combining the second sample data and the sample data in the second dataset that is excluding the sample data included in the first sample cluster and has a similarity to the second sample data greater than or equal to the preset threshold; determine the center value of the second sample cluster as the second initial centroid; accordingly, the processing module 402 is further configured to: cluster the N sample data into K classes based on the first initial centroid and the second initial centroid.

[0090] In one possible implementation, the center value of the first sample cluster is equal to the average value of all samples included in the first sample cluster.

[0091] In one possible implementation, the processing module 402 is further configured to: perform the following processing on the second dataset to obtain p initial centroids: delete sample data included in each of the already determined first i-1 sample clusters to update the second data set, wherein the first i-1 sample clusters correspond one-to-one with the already determined first i-1 initial centroids; calculate the minimum distance between each data sample and the first i-1 initial centroids from the updated second data set to form a minimum distance value set; and assign sample data in the updated second data set with a similarity greater than or equal to the maximum value in the minimum distance value set. The sample data at the preset threshold are grouped into the i-th sample cluster; the center value of the i-th sample cluster is determined as the i-th initial centroid, and the sample data included in each of the first i-1 sample clusters and the sample data included in the i-th sample cluster are deleted from the second data set; the above process is repeated until i equals K or the updated second data set is an empty set, where i is a positive integer greater than 2 and less than or equal to K; correspondingly, the processing module 402 is further configured to: cluster the N sample data into K classes based on the first initial centroid, the second initial centroid and the p initial centroids.

[0092] In one possible implementation, if the updated second data set is an empty set and the number of determined initial centroids is less than K, the processing module 402 is further configured to: randomly select the remaining initial centroids from the sample data in the first data set excluding the sample data included in the second data set.

[0093] In one possible implementation, the acquisition module 401 is specifically used to: acquire N1 sample data from the N sample data according to a preset ratio value, wherein the preset ratio value indicates the ratio of N1 to N.

[0094] Figure 5 This is a structural schematic diagram of a mean clustering device provided in another embodiment of this application. Figure 5 The apparatus shown can be used to perform the method described in any of the foregoing embodiments.

[0095] like Figure 5 As shown, the device 500 of this embodiment includes: a memory 501, a processor 502, a communication interface 503, and a bus 504. The memory 501, processor 502, and communication interface 503 are interconnected via the bus 504.

[0096] The memory 501 can be a read-only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM). The memory 501 can store programs, and when the program stored in the memory 501 is executed by the processor 502, the processor 502 uses it to execute... Figure 2 or Figure 3 The steps of the method shown.

[0097] The processor 502 may be a general-purpose central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits, used to execute relevant programs to implement this application. Figure 2 or Figure 3 The method shown.

[0098] The processor 502 can also be an integrated circuit chip with signal processing capabilities. In the implementation process, the embodiments of this application... Figure 2 or Figure 3 Each step of the method can be accomplished through integrated logic circuits in the hardware of the processor 502 or through instructions in software form.

[0099] The processor 502 described above can also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. It can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor can be a microprocessor or any conventional processor, etc.

[0100] The steps of the method disclosed in the embodiments of this application can be directly manifested as being executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software modules can reside in random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, or other mature storage media in the art. This storage medium is located in memory 501. The processor 502 reads the information in memory 501 and, in conjunction with its hardware, completes the functions required by the units included in the device of this application. For example, it can execute... Figure 2 or Figure 3 The various steps / functions of the illustrated embodiment.

[0101] The communication interface 503 can use, but is not limited to, transceivers to enable communication between the device 500 and other devices or communication networks.

[0102] Bus 504 may include a pathway for transmitting information between various components of device 500 (e.g., memory 501, processor 502, communication interface 503).

[0103] It should be understood that the device 500 shown in the embodiments of this application may be an electronic device, or it may be a chip configured in an electronic device.

[0104] It should be noted that the mean clustering method and apparatus in this application can be used in the field of big data, as well as in any other field. This application does not limit the application field of the mean clustering method and apparatus.

[0105] The above embodiments can be implemented, in whole or in part, by software, hardware, firmware, or any other combination thereof. When implemented using software, the above embodiments can be implemented, in whole or in part, as a computer program product. The computer program product includes one or more computer instructions or computer programs. When the computer instructions or computer programs are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that includes one or more sets of available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. A semiconductor medium can be a solid-state drive.

[0106] It should be understood that the term "and / or" in this article is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, or B existing alone. A and B can be singular or plural. Additionally, the character " / " in this article generally indicates an "or" relationship between the preceding and following related objects, but it can also represent an "and / or" relationship. Please refer to the context for a more accurate understanding.

[0107] In this application, "at least one" means one or more, and "more than one" means two or more. "At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or multiple items. For example, at least one of a, b, or c can mean: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or multiple.

[0108] It should be understood that in the various embodiments of this application, the order of the above-mentioned processes does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

[0109] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0110] Those skilled in the art will understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.

[0111] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.

[0112] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0113] In addition, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.

[0114] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory, random access memory, magnetic disks, or optical disks.

[0115] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A method of mean clustering, characterized by, include: Obtain a first data set, which includes N sample data, where N is a positive integer, and the sample data is image attribute data or text attribute data; From the N sample data, N1 sample data are obtained to obtain a second data set, where N1 is less than N; From the second data set, determine sample data whose similarity to the first sample data is greater than or equal to a preset threshold, wherein the first sample data includes any sample data randomly selected from the second data set; The first sample data and sample data with a similarity greater than or equal to a preset threshold are combined to form a first sample cluster; The center value of the first sample cluster is determined as the first initial centroid; The sample data with a similarity greater than or equal to a preset threshold with the second sample data is determined. The second sample data is the sample data in the second data set that is not included in the first sample cluster and has the largest distance from the first initial centroid. The second sample data and the sample data in the second data set other than the sample data included in the first sample cluster, and whose similarity to the second sample data is greater than or equal to a preset threshold, are used to form the second sample cluster. The center value of the second sample cluster is determined as the second initial centroid; The second data set is processed as follows to obtain p initial centroids: The sample data included in each of the previously determined first i-1 sample clusters are deleted to update the second data set, where each of the first i-1 sample clusters corresponds one-to-one with the previously determined first i-1 initial centroids; from the updated second data set, the minimum distance between each data sample and the first i-1 initial centroids is calculated to form a minimum distance value set; the sample data in the updated second data set whose similarity to the maximum value in the minimum distance value set is greater than or equal to the preset threshold is assigned to the i-th sample cluster; the center value of the i-th sample cluster is determined as the i-th initial centroid, and the sample data included in each of the previously determined first i-1 sample clusters and the sample data included in the i-th sample cluster are deleted from the second data set; the above process is repeated until i equals K or the updated second data set is empty, where i is a positive integer greater than 2 and less than or equal to K; Based on the first initial centroid, the second initial centroid, and the p initial centroids, the N sample data are clustered into K clusters, where K indicates the preset number of clusters.

2. The method of claim 1, wherein, The center value of the first sample cluster is equal to the average value of all samples included in the first sample cluster.

3. The method of claim 1, wherein, If the updated second data set is empty and the number of determined initial centroids is less than K, the method further includes: The remaining initial centroids are randomly selected from the sample data in the first data set excluding the sample data included in the second data set.

4. The method according to any one of claims 1 to 3, characterized in that, The step of obtaining N1 sample data from the N sample data to obtain the second data set includes: According to a preset ratio value, N1 sample data are obtained from the N sample data, where the preset ratio value indicates the ratio of N1 to N.

5. A means for mean shift clustering, characterized by, The mean clustering apparatus is used to implement the mean clustering method according to any one of claims 1-4, and the apparatus comprises: The acquisition module is used to acquire a first data set, which includes N sample data, where N is a positive integer, and the sample data is image attribute data or text attribute data; The acquisition module is further configured to acquire N1 sample data from the N sample data to obtain a second data set, where N1 is less than N; The processing module is used to determine sample data from the second data set that have a similarity to the first sample data greater than or equal to a preset threshold, wherein the first sample data includes any sample data randomly selected from the second data set; The processing module is further configured to form a first sample cluster by combining the first sample data and sample data whose similarity to the first sample data is greater than or equal to a preset threshold. The processing module is further configured to determine the center value of the first sample cluster as the first initial centroid; The processing module is further configured to cluster the N sample data into K clusters based on the initial centroids, where K indicates a preset number of clusters.

6. A mean-based clustering device, characterized in that, include: Memory and processor; The memory is used to store program instructions; The processor is used to invoke program instructions in the memory to execute the method as described in any one of claims 1 to 4.

7. A computer readable medium characterized by The computer-readable medium stores program code for computer execution, the program code including instructions for performing the method as described in any one of claims 1 to 4.

8. A computer program product, comprising therein computer program instructions, characterized in that, When the computer program instructions are executed on a computer, the computer causes the computer to perform the method as described in any one of claims 1 to 4.