Kmeans log classification method and apparatus based on distributed sample screening
By employing a distributed sample selection and centroid fusion method, the problem of uneven centroid selection in the initialization phase of the k-means algorithm is solved, thereby improving the convergence speed and effectiveness of the clustering algorithm. This method is particularly suitable for text clustering in big data environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INDUSTRIAL AND COMMERCIAL BANK OF CHINA
- Filing Date
- 2023-06-16
- Publication Date
- 2026-06-19
AI Technical Summary
Existing k-means clustering algorithms tend to have excessively dispersed or concentrated centroids when randomly selecting initial centroids during the initialization phase, resulting in slow clustering convergence and unsatisfactory performance, especially when dealing with large amounts of text data.
A distributed sample screening method is adopted to divide the massive log samples into N sets for distributed parallel processing. The quality of the sample sets is measured by calculating the distance between each sample set and the central centroid. The cluster centers are merged, and samples of high quality are extracted to form a high-quality sample set. The centroid is selected, and the centroid positions are reasonably distributed by combining the strategies of deleting neighboring points around the centroid and the maximum distance product.
It improves the quality of centroid selection and initialization convergence speed, reduces initialization time, and enhances clustering performance.
Smart Images

Figure CN116610987B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of big data technology, and in particular to a k-means log classification method and apparatus based on distributed sample screening. Background Technology
[0002] Log clustering aims to identify similar logs. By analyzing errors encountered by users while using the application, logs with the same error type are grouped together. Then, the system can be further categorized to uncover potential errors in user habits and provide suggestions for addressing similar issues in the future.
[0003] Current technologies typically employ the k-means clustering algorithm for text clustering. The steps are as follows: first, the data is pre-divided into K groups; then, K objects are randomly selected as initial cluster centers; next, the distance between each object and each seed cluster center is calculated, and each object is assigned to the nearest cluster center. The cluster centers and the objects assigned to them represent a cluster. Each time a sample is assigned, the cluster centers are recalculated based on the existing objects in the cluster. This process is repeated until a certain termination condition is met.
[0004] However, existing kmeans clustering algorithms randomly select initial centroids during the initialization phase. This can lead to the selected centroids being too scattered or concentrated, and not particularly uniform, resulting in slow clustering convergence and poor clustering performance. Summary of the Invention
[0005] This application provides a k-means log classification method and apparatus based on distributed sample screening to solve the problems of slow convergence speed and poor clustering effect of current clustering algorithms.
[0006] Firstly, this application provides a k-means log classification method based on distributed sample selection, including:
[0007] Obtain N log sample sets and a replica of each log sample set, and determine K centers in each log sample set. Each log sample set includes at least one log sample, and the replica is the same as the log sample in the corresponding log sample set. N and K are positive integers.
[0008] Based on the K centers in each log sample set, the log samples in the replica of the log sample set are divided into clusters to obtain the K clusters of the replica and the cluster center of each cluster;
[0009] The cluster centers of all the replicas are combined into an initial center set. The cluster centers are then merged according to the cosine distance between each cluster center in the initial center set until the preset fusion termination condition is met, thus obtaining the first center set.
[0010] Based on the cosine distances between the cluster centers in the first center set and the K centers in the log sample set, the K minimum distances of the log sample set are calculated.
[0011] Determine the level label of the log sample set based on the K minimum distances of the log sample set;
[0012] Based on the level label of each log sample set, a target number of log samples are extracted from all log sample sets to form a sample dataset;
[0013] K centroids are determined from the sample dataset, and K-means clustering is performed.
[0014] Secondly, this application provides a k-means log classification device based on distributed sample screening, comprising:
[0015] The acquisition module is used to acquire N log sample sets and the corresponding replicas of each log sample set, and to determine K centers in each log sample set. The log sample set includes at least one log sample, and the replica is the same as the log sample in the corresponding log sample set. N and K are positive integers.
[0016] The center determination module is used to divide the log samples in the replica of each log sample set into clusters based on the K centers in each log sample set, so as to obtain the K clusters of the replica and the cluster center of each cluster;
[0017] The center set module is used to form an initial center set from the cluster centers of all replicas. Based on the cosine distance between the cluster centers in the initial center set, the cluster centers are merged until the preset fusion end condition is met, and the first center set is obtained.
[0018] The distance calculation module is used to calculate the K minimum distances of the log sample set based on the cosine distances between the cluster centers in the first center set and the K centers in the log sample set.
[0019] The label determination module is used to determine the level label of the log sample set based on the K minimum distances of the log sample set;
[0020] The dataset composition module is used to extract a target number of log samples from all log sample sets based on the level label of each log sample set, and form a sample dataset.
[0021] The clustering module is used to determine K centroids from the sample dataset and perform K-means clustering.
[0022] Thirdly, this application provides an electronic device, including: a processor, and a memory communicatively connected to the processor; the memory stores computer-executable instructions; the processor executes the computer-executable instructions stored in the memory to implement the method described above.
[0023] Fourthly, this application provides a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, are used to implement the method described above.
[0024] Fifthly, this application provides a computer program product that, when executed by a processor, is used to implement the method described above.
[0025] This application provides a k-means log classification method and apparatus based on distributed sample selection. By dividing a massive amount of log samples into N different log sample sets, it enables distributed and parallel processing of log samples, reducing initialization time. Furthermore, it improves the initialization phase of the k-means algorithm by proposing a method to merge the centroids of each sample set into a central centroid. This obtains the centroid of the overall sample set, and the quality of each sample set is measured by calculating the distance between each sample set and the central centroid. Subsequently, samples of different proportions are extracted according to their quality level to form a high-quality sample set, which is used to select the centroid, improving the quality of centroid selection and accelerating the initialization convergence speed. Attached Figure Description
[0026] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.
[0027] Figure 1 A schematic diagram of the clustering algorithm provided in the embodiments of this application;
[0028] Figure 2 A flowchart illustrating the k-means log classification method based on distributed sample screening provided in this application embodiment;
[0029] Figure 3 This is a schematic diagram of a log sample provided in an embodiment of this application;
[0030] Figure 4 A schematic diagram of the one-stage MaPReduce processing flow of the distributed sample screening k-means log classification method provided in the embodiments of this application;
[0031] Figure 5A schematic diagram of the two-stage MaPReduce processing flow of the kmeans log classification method for distributed sample screening provided in this application embodiment;
[0032] Figure 6 A schematic diagram illustrating the initial centroid acquisition process of the k-means log classification method based on distributed sample screening provided in this application embodiment;
[0033] Figure 7 A schematic diagram of the structure of the k-means log classification device based on distributed sample screening provided in the embodiments of this application;
[0034] Figure 8 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application.
[0035] The accompanying drawings illustrate specific embodiments of this application, which will be described in more detail below. These drawings and descriptions are not intended to limit the scope of the concept in any way, but rather to illustrate the concept of this application to those skilled in the art through reference to particular embodiments. Detailed Implementation
[0036] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.
[0037] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use and processing of the relevant data must comply with relevant laws, regulations and standards, and corresponding operation entry points are provided for users to choose to authorize or refuse.
[0038] It should be noted that the k-means log classification method and apparatus based on distributed sample screening provided in this application can be used in the field of big data technology, or in any field other than big data technology. The application fields of the k-means log classification method and apparatus based on distributed sample screening provided in this application are not limited.
[0039] The k-means clustering algorithm is an iterative clustering analysis algorithm. Its steps are as follows: First, the data is pre-divided into K groups. Then, K objects are randomly selected as initial cluster centers. Next, the distance between each object and each seed cluster center is calculated, and each object is assigned to the nearest cluster center. The cluster centers and the objects assigned to them represent a cluster. Each time a sample is assigned, the cluster centers are recalculated based on the existing objects in the cluster. This process is repeated until a termination condition is met. The termination condition could be that no (or a minimum number) objects have been reassigned to different clusters.
[0040] Text clustering is the process of finding similar texts, which is very meaningful for data mining. Traditional clustering algorithms have the following drawbacks: (1) The selection of centroids in the initialization stage is random. The initial centroids are selected too close together or scattered, which is not uniform enough, which will lead to a slow convergence speed of the algorithm. (2) Selecting discrete points or noisy data as the initial centroids is not conducive to the convergence of subsequent clustering, resulting in an unsatisfactory clustering effect. (3) The larger the data, the more obvious this defect becomes, and the initialization time is very long.
[0041] To address the issues of slow convergence and unsatisfactory clustering results in existing K-means clustering algorithms when performing K-means clustering on a large number of texts, this application provides a k-means log classification method and apparatus based on distributed sample selection. The initialization process for massive log samples is processed in a distributed parallel manner, reducing initialization time. Furthermore, the initialization phase of the k-means algorithm is improved. During initialization, a method is proposed to merge the centroids of each sample set into a central centroid, obtaining the centroid of the entire sample set. The quality of the sample set is measured by calculating the distance between each sample set and the central centroid. Then, samples of different proportions are extracted according to the quality level to form a high-quality sample set, which is used to select centroids. Finally, in the centroid selection process, a method of deleting neighboring points around the centroid is used to isolate the centroid, and a maximum distance product strategy combined with a virtual centroid method is used to reasonably distribute the selected centroid positions. This accelerates the initialization convergence speed and improves the quality of the initial centroid selection.
[0042] The technical solution of this application and how the technical solution of this application solves the above-mentioned technical problems are described in detail below with specific embodiments. These specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments. The embodiments of this application will be described below with reference to the accompanying drawings.
[0043] For example, Figure 1 This is a schematic diagram of the clustering algorithm provided in the embodiments of this application, as shown below. Figure 1As shown, taking 10 sample logs (Z1 to Z10) as an example, Z1 to Z8 are clustered into one cluster, and Z9 to Z10 are clustered into another. If there is an outlier that does not belong to either of the above two clusters, the sample log represented by that outlier can be identified as an abnormal log. Here, "log" can be a term from the computer science field; application platforms generate logs during operation. Each log line records the date, time, user, and action description. This clustering algorithm can be applied to text clustering scenarios, clustering massive amounts of historical error logs into multiple groups. After obtaining high-quality centroids from the large amount of input log data using this algorithm, K log groups can be quickly and effectively obtained. By organizing the error logs of each group, the user's incorrect habits and corresponding solutions during the use of our application are analyzed, forming solution assets that provide reference ideas for subsequent maintenance personnel when analyzing similar problems.
[0044] Figure 2 This is a flowchart illustrating the k-means log classification method based on distributed sample screening provided in an embodiment of this application. This method can be applied to text clustering scenarios, such as... Figure 2 As shown, the method may specifically include the following steps:
[0045] Step S201: Obtain N log sample sets and the corresponding replicas of each log sample set, and determine K centers in each log sample set. Each log sample set includes at least one log sample, and the replicas are the same as the log samples in the corresponding log sample sets. N and K are positive integers.
[0046] In this embodiment, the log samples contained in each log sample set may be different. For example, the log sample set R1 includes log samples R11, R12, and R13, while the log sample set R2 includes log samples R21, R22, and R23.
[0047] The number of log sample sets can be more than two. Before proceeding to the next step, each log sample set can be copied to obtain a corresponding copy, which is exactly the same as the log sample set. When processing the log sample set later, the log sample set may change, for example, the number of log samples may decrease. This change can be observed through the copy, which makes it easier to perform targeted operations on the copy. This will be explained in detail later.
[0048] In this embodiment, for a massive number of log samples, these massive log samples can be classified into different sets, thereby forming N log sample sets in this embodiment.
[0049] In this embodiment, since each log sample set may contain multiple log samples, there will be a central log sample among these log samples, which can be called the center. The log samples in the log sample set can be filtered by region, for example, divided into region A, region B, and region C. Thus, region A has a center, region B has a center, and region C has a center.
[0050] Step S202: Based on the K centers in each log sample set, the log samples in the replica of the log sample set are divided into clusters to obtain the K clusters of the replica and the cluster center of each cluster.
[0051] In this embodiment, the log sample set has K centers (each center can be understood as a log sample). In addition, there may be other log samples in the log sample set. These log samples can be clustered based on the K centers to obtain K clusters.
[0052] For example, taking a log sample set R1 with two centers (e.g., log samples R11 and R12 are both centers), when there are log samples R13, R14, and R15 in the replica corresponding to the log sample set R1, if log sample R13 is near the center R12, then log sample R13 and center R12 are assigned to the first cluster. If log sample R14 and log sample R15 are near the center R11, then log samples R14, R11, and R15 are assigned to the second cluster. In this way, the replica has two clusters (i.e., the first cluster and the second cluster).
[0053] For example, Figure 3 This is a schematic diagram of a log sample provided in an embodiment of this application, such as... Figure 3 As shown, by digitizing the log samples and replacing each log sample with a coordinate point, we can select Z11 and Z12 as the center. Log sample Z13 is near the center Z12, so log sample Z13 and the center Z12 are assigned to the first cluster. Log samples Z14 and Z15 are near the center Z11, so log samples Z14, Z11, and Z15 are assigned to the second cluster.
[0054] Step S203: The cluster centers of all the replicas are combined into an initial center set. The cluster centers are then merged according to the cosine distance between each cluster center in the initial center set until the preset fusion end condition is met, thus obtaining the first center set.
[0055] In this embodiment, the cluster centers and log samples can be vectorized, and the cosine distance can be calculated using the vectorized cluster centers and log samples. For example, when the cosine distance between two cluster centers is too small (e.g., less than a certain threshold), the two clusters can be merged to obtain a new cluster center.
[0056] Among them, a fusion termination condition can be set. For example, when the number of cluster centers in the initial center set is reduced to a certain threshold, the fusion stops, and the initial center set is then used as the first center set.
[0057] Step S204: Calculate the K minimum distances of the log sample set based on the cosine distances between the cluster centers in the first center set and the K centers in the log sample set.
[0058] In this embodiment, the first set of centers may include multiple cluster centers. The cosine distance between each center in the log sample set and these cluster centers can be calculated. Then, the cluster center closest to the center can be found. The cosine distance between the cluster center and the center is the minimum distance corresponding to the center, which can be represented by disti. That is, the minimum distance of the i-th center is disti, and thus the minimum distance of the K-th center is distk.
[0059] Step S205: Determine the level label of the log sample set based on the K minimum distances of the log sample set.
[0060] In this embodiment, since the massive log samples are divided into N different log sample sets, each log sample set has K corresponding minimum distances. Based on these K minimum distances, the level label of the corresponding log sample set can be determined.
[0061] For example, the level labels of the log sample set can be divided into level 1, level 2, level 3, etc. The larger the K minimum distances are, the higher the level of the level label of the log sample set corresponding to the K minimum distances.
[0062] Step S206: Based on the level label of each log sample set, extract the target number of log samples from all log sample sets to form a sample dataset.
[0063] In this embodiment, extraction ratios can be set. For example, the extraction ratio for level label 1 can be set to 60%, the extraction ratio for level label 2 to 40%, and so on. Since the level labels of each log sample set are different, the target number of samples extracted will also be different. The final extracted log samples are combined to form a sample dataset.
[0064] Step S207: Determine K centroids from the sample dataset and perform K-means clustering.
[0065] In this embodiment, the k-means clustering algorithm has been described above. It initially selects K objects randomly as the initial cluster centers. In this embodiment, the selected cluster centers (i.e., the initial centroids) are used as the K objects instead of the k-means clustering algorithm randomly selecting K objects. This can improve the convergence speed of the algorithm and improve the clustering effect.
[0066] This application's embodiments divide massive log samples into N different log sample sets, enabling distributed and parallel processing of log samples and reducing initialization time. Simultaneously, the initialization phase of the k-means algorithm is improved by proposing a method to merge the centroids of each sample set into a central centroid during initialization. This obtains the centroid of the overall sample set, and the quality of the sample set is measured by calculating the distance between each sample set and the central centroid. Subsequently, samples of different proportions are extracted according to their quality level to form a high-quality sample set, which is used to select the centroid, improving the quality of centroid selection and accelerating the initialization convergence speed.
[0067] In some embodiments, step S201 can be implemented through the following steps: obtaining an initial log set, which includes at least one log sample; dividing the initial log set equally into N log sample sets; backing up each log sample set to obtain a copy of each log sample set; randomly selecting a first target sample from the log sample set and deleting a first log sample from the log sample set whose cosine distance to the first target sample is less than a first preset threshold; obtaining the center of the deleted first log sample as the first center; randomly selecting a second target sample from the log sample set and deleting a second log sample from the log sample set whose cosine distance to the second target sample is less than a first preset threshold; obtaining the center of the deleted second log sample as the second center; randomly selecting a Kth target sample from the log sample set and deleting a Kth log sample from the log sample set whose cosine distance to the Kth target sample is less than a first preset threshold; obtaining the center of the deleted Kth log sample as the Kth center.
[0068] In this embodiment, the initial log set is divided into several datasets i, such as dataset1, dataset2, etc., and then the i-th log sample set dataseti is backed up to obtain a copy of the log sample set dataseti as dataset_baki.
[0069] Specifically, a first target sample Cn1 can be randomly selected from dataset1. Log samples in dataset1 whose cosine distance to the first target sample Cn1 is less than a first preset threshold T1 are deleted, and the center of these first log samples is denoted as C1. Then, a second target sample Cn2 can be randomly selected from dataset1. Second log samples in dataset1 whose cosine distance to the second target sample Cn2 is less than the first preset threshold T1 are deleted, and the center of these second log samples is denoted as C2. This process continues until the Kth center Ck is obtained. The K centers are backed up as klist_bak1.
[0070] This application's embodiments improve processing speed and shorten processing time by processing each log sample set separately. Simultaneously, by continuously filtering log samples in the log sample set using cosine distance, it is possible to accurately obtain K centroid samples, ensuring the accuracy of subsequent initial centroid selection and improving the K-means clustering effect.
[0071] In some embodiments, step S202 can be implemented by the following steps: calculating the cosine distance between each log sample and each center in the replica; assigning the log sample with the closest cosine distance to the Kth center and the Kth center to the same cluster to obtain K clusters and the cluster center of each cluster.
[0072] In this embodiment, the log samples in the replica corresponding to each log sample set are all complete (not deleted; as mentioned in step S201 above, the first log sample, the second log sample, etc., will be deleted). After K centers are identified in the log sample set, the corresponding centers can be found in the replicas, and the cluster to which a log sample belongs can be determined based on the cosine distance between each log sample and each center.
[0073] For example, you can continue to refer to the above. Figure 3 For example, if the cosine distance between log sample Z13 and center Z12 is small, then log sample Z13 and center Z12 form a cluster. Similarly, if the cosine distance between log sample Z14 and log sample Z15 and center Z11 is small, then log sample Z14 and Z15 and center Z11 form a cluster.
[0074] This application embodiment divides log samples into corresponding clusters based on the cosine distance, which can prevent log samples that are far from a certain center from being divided into the same cluster as other centers, effectively improving the accuracy of clustering and further improving the accuracy of initial centroid selection.
[0075] In some embodiments, step S203 can be implemented by the following steps: comparing the cosine distances of each cluster center in the initial center set, and obtaining the first and second cluster centers with the closest cosine distances; merging the first and second cluster centers as new cluster centers, and calculating the total number of all cluster centers in the current initial center set; if the total number is K, stopping the merging process of cluster centers to obtain the first center set; if the total number is not K, continuing the merging process to obtain new cluster centers.
[0076] In this embodiment, assume there are `Map` functions. Each `Map` function partitions the log samples in a replica into clusters, resulting in `K` clusters for each replica and cluster centers for each cluster. Thus, the initial center set consists of the number of `Map` functions multiplied by `K` cluster centers. Let the initial center set be `list_cen`. The elements in `list_cen` are then merged:
[0077] (1) Calculate the cosine distance between cluster centers in the list_cen set pairwise, find the two best points with the closest cosine distance, and merge the cluster centers of these two best points. The merging method is: cluster center C1, cluster center C2, and the new cluster center after merging Cnew = (C1 + C2) / 2.
[0078] (2) Delete the two cluster centers that were just found and merged from the list_cen set, and put the new cluster centers after merging into the list_cen set. At this time, the number of cluster centers in the list_cen set is the number of maps * K - 1.
[0079] Based on the above principle, we again calculate the pairwise cosine distance between the two cluster centers in the `list_cen` set and remove these two cluster centers from the `list_cen` set. Then, we merge these two cluster centers into a new cluster center and add it to the `list_cen` set. We repeat this process until the number of centers in the `list_cen` set is the set of K. Finally, we output the K centers, denoted as the central center set `Kcenter`.
[0080] This application embodiment utilizes the cosine distance to merge the cluster centers in the replicas corresponding to each log sample set into a new cluster center, which can improve the overall quality of cluster center acquisition and further improve the quality of subsequent initial centroid selection.
[0081] In some embodiments, step S204 can be implemented by the following steps: selecting the Kth log sample from the replicas of the log sample set as the Kth element based on the K centers in the log sample set; obtaining the target cluster center corresponding to the Kth element from the first center set based on the Kth element; and calculating the cosine distance between the target cluster center and the Kth element as the Kth minimum distance.
[0082] In this embodiment, the processing is performed using the Map function. The input is the central center set Kcenter output from the first stage reduce in the above embodiment. The log sample set is the K center backups of dataset1 (taking dataset1 as an example, the other datasets are processed according to the ideas described below) as klist_bak1 (taking klist_bak1 as an example, the other klist_bak1 are processed according to the ideas described below).
[0083] (1) Calculate the minimum distance between klist_bak1 and Kcenter using the following method:
[0084] Take an element from klist_bak1 (i.e., select the Kth element from the copy of the log sample set), find the cluster center in Kcenter that is closest to this element (i.e., obtain the cluster center in the first center set (i.e., Kcenter) that is closest to the Kth element in terms of cosine distance), and calculate the cosine distance dist1 between the cluster center and the element. At the same time, mark the center taken from Kcenter this time with the use tag.
[0085] (2) Take the next element from klist_bak1, find the nearest center in Kcenter. If this center does not have the use tag, calculate the cosine distance dist2 between the two elements, and add the use tag to the center taken from Kcenter this time. If the center taken from Kcenter this time already has the use tag, stop the search.
[0086] Following this principle, continue until all elements of klist_bak1 have been retrieved, obtaining K minimum distances disti.
[0087] The output of the Map function:
[0088] Case 1: The center searched from Kcenter is different each time, that is, the elements in klist_bak1 have a one-to-one relationship with the center in Kcenter. Then output the sum of the K minimum distances, dist_sum1.
[0089] Case 2: If the center found from Kcenter is duplicated each time, that is, the elements in klist_bak1 are not in a one-to-one relationship with the centers in Kcenter, then the output will be empty.
[0090] This application embodiment calculates the minimum distance between each sample set and the central centroid, which can be used to measure the quality of the sample set and assign different level labels to log sample sets of different quality. This can improve the quality of subsequent initial centroid selection and further improve the clustering effect of the K-means algorithm.
[0091] In some embodiments, step S205 can be implemented by the following steps: summing the K minimum distances of the log sample set to obtain the cumulative distance sum corresponding to the log sample set; forming a distance set from the cumulative distance sums of all log sample sets, and sorting the cumulative distance sums in the distance set according to their magnitude; and determining the level label of the log sample set corresponding to each cumulative distance sum according to its sorting order.
[0092] In this embodiment, the process can be handled using the concept of the Reduce function:
[0093] (1) Input the set of minimum distance sums output by all map functions, list_dist_sum.
[0094] (2) Sort the elements in list_dist_sum in ascending order, take the first one (i.e., the distance with the smallest cumulative sum), and label the sample set dataseti to which this cumulative sum belongs as level 1. Then, using the same principle, take the second one (i.e., the distance between the second smallest cumulative sum), and label the sample set dataseti to which this cumulative sum belongs as level 2. After that, label the sample set dataseti to which the third and subsequent cumulative sums belong as level 3.
[0095] The Reduce function processes the dataset and outputs the label levels of the dataset i, which already has label levels.
[0096] In this embodiment, each log sample set is sorted by K minimum distances, thereby determining the quality of each log sample set. This allows for targeted log sample extraction from log sample sets of different quality, improving the quality of the log samples used to select the initial centroid and ensuring the quality of the initial centroid selection.
[0097] In some embodiments, step S206 can be implemented by the following steps: extracting a proportion of log samples corresponding to the level label from the log sample set; forming a high-quality sample set from the log samples extracted from all log sample sets; and extracting a target proportion of log samples from the high-quality sample set to form a sample dataset.
[0098] In this embodiment, for sampling the dataseti samples in stage one, if the rank label of dataseti is 1, then 80% of the log samples in the dataseti are randomly selected as input. If the rank label of dataseti is 2, then 60% of the log samples in the dataseti are randomly selected as input. If the rank label of dataseti is 3, then 40% of the log samples in the dataseti are randomly selected as input. Finally, the selected samples are combined into a high-quality sample set list_g. From list_g, a sample dataset list_g1 is randomly selected (accounting for 30% of list_g, i.e., the target proportion is 30% at this time).
[0099] This application's embodiments determine the quality of each log sample set based on the level of its rank label, thereby extracting samples of different proportions to form a high-quality sample set for selecting initial centroids. This improves the accuracy of initial centroid selection and enhances the clustering performance of the subsequent K-means clustering algorithm.
[0100] In some embodiments, step S207 can be implemented by the following steps: randomly selecting a centroid neighboring point from the sample dataset, deleting a first log sample from the sample dataset whose cosine distance to the centroid neighboring point is less than a preset second threshold, and obtaining the center of the first log sample as the first centroid; obtaining the farthest centroid neighboring point from the sample dataset that is farthest from the first centroid, deleting a second log sample from the sample dataset whose cosine distance to the farthest centroid neighboring point is less than a preset second threshold, and obtaining the center of the second log sample as the second centroid; determining K centroids from the sample dataset based on the distances between the remaining log samples in the sample dataset and the first and second centroids.
[0101] In this embodiment, a sample centroid neighboring point Cen1 is randomly selected from the set list_g1. The first log samples in list_g1 whose cosine distance to Cen1 is less than a preset second threshold T2 are deleted, and the center of these first log samples is denoted as centroid Cz1. Then, the sample centroid neighboring point Cen2 that is farthest from Cz1 is found from the set list_g1. The second log samples in list_g1 whose cosine distance to Cen2 is less than a preset second threshold T2 are deleted, and the center of these second log samples is denoted as centroid Cz2.
[0102] Furthermore, in some other embodiments, K centroids can be determined by the following steps: multiply the distance of the remaining log sample by the distance of all centroids to obtain the distance product of the log sample; form a distance set from the distance products of all log samples; select the maximum distance product from the distance set; take the log sample corresponding to the maximum distance product as the next new centroid neighbor; delete the third log sample in the sample dataset whose cosine distance to the new centroid neighbor is less than a preset second threshold, and obtain the center of the third log sample as the third centroid; determine whether the number of centroids in the sample dataset is K; if it is not K, continue to select new centroid neighbors and determine the new centroid based on the new centroid neighbors.
[0103] In this embodiment, the specific steps can be divided into the following:
[0104] Step 1: Calculate the distance between each remaining sample data point in list_g1 and all selected centroids. Record the product of the distances between a log sample and all centroids as the distance product of that log sample. The distance products of all log samples form a set d. Then, select the distance product with the largest value from set d and use the log sample data corresponding to this distance product as the next centroid neighbor. Simultaneously, delete the log samples surrounding the newly selected centroid neighbor and use their centers as the centroids. The distance product formula is expressed as:
[0105] The distance product set d is: list{(distance between node v and centroid Cz1 * (distance between node v and centroid Cz2) * ... * (distance between node v and centroid Czk)}, where centroid Czk is the currently selected centroid.
[0106] In this embodiment, step 1 is repeated until K centroids are found. If K centroids are not found after deleting all the centroids from list_g1, the remaining required centroids are randomly selected from (list_g - list_g1). Finally, the obtained K centroids are used for K-means clustering.
[0107] This application embodiment isolates the centroid by deleting neighboring points around it, and uses a maximum distance product strategy combined with a virtual centroid to reasonably disperse the selected centroid positions, thereby accelerating the initial convergence speed and improving the quality of the initial centroid selection.
[0108] For example, the overall process of the k-means log classification method based on distributed sample screening provided in this application includes a Map function processing stage and a Reduce function processing stage. Multiple Map functions can be configured, each performing the same processing on a log sample set simultaneously. Specifically, this k-means log classification method based on distributed sample screening can be divided into two stages, including a first-stage MapReduce processing. Figure 4 A schematic diagram of the one-stage MapReduce processing flow of the k-means log classification method for distributed sample screening provided in this application embodiment is shown below. Figure 4 As shown, the first-stage MapReduce process is divided into Map functions and Reduce functions. The Map function processing stage includes steps S4011, S4012, ..., S401i (where i takes values from 1 to N). The Reduce function processing stage includes step S402.
[0109] For example, Figure 5 A schematic diagram of the two-stage MapReduce processing flow of the k-means log classification method for distributed sample screening provided in this application embodiment is shown below. Figure 5 As shown, the two-stage MapReduce process is divided into Map functions and Reduce functions. The Map function processing stage includes steps S5011, S5012, ..., S501i (where i takes values from 1 to N). The Reduce function processing stage includes step S502.
[0110] For example, Figure 6 This is a schematic diagram illustrating the initial centroid acquisition process of the k-means log classification method based on distributed sample screening provided in this application embodiment, as shown below. Figure 6As shown, it includes step S601, where different proportions of log samples are extracted from the log sample set dataseti in the first stage based on its different level labels, and all extracted log samples are denoted as a high-quality sample set list_g. A sample dataset list_g1 is randomly extracted from list_g. Step S602, a sample centroid neighboring point Cen1 is randomly selected from the sample dataset list_g1, and log samples in the sample dataset list_g1 whose cosine distance to cen1 is less than a threshold T2 are deleted. The center of these samples is denoted as centroid Cz1. Then, the sample centroid neighboring point Cen2, which is farthest from Cz1, is found from the sample dataset list_g1, and log samples in the sample dataset list_g1 whose cosine distance to Cen2 is less than a threshold T2 are deleted. The center of these log samples is denoted as centroid Cz2. Step S603: Calculate the distance between each remaining log sample in the sample dataset list_g1 and all selected centroids. Record the product of the distances between each log sample and all centroids as the distance product of that log sample. All distance products form a distance product set d. Then, select the distance product with the largest value from set d and use the log sample corresponding to this distance product as the next centroid neighbor. Simultaneously, delete the log samples surrounding the newly selected centroid neighbor and use their centers as the centroids. Step S604: Determine if the number of centroids is K. Step S605: Determine if the sample dataset list_g1 is empty. Step S606: Randomly select the remaining required centroids from list_g - list_g1. Step S607: Output the initial centroids.
[0111] This application's embodiments reduce initialization time by implementing distributed parallel processing for the initialization of massive log samples. Simultaneously, improvements are made to the initialization phase of the k-means algorithm. During initialization, a method is proposed to merge the centroids of each sample set into a central centroid, obtaining the centroid of the entire sample set. The quality of the sample set is measured by calculating the distance between each sample set and the central centroid. Then, samples of different proportions are extracted according to their quality level to form a high-quality sample set, which is used to select the initial centroid. Furthermore, during the centroid selection process, a method of deleting neighboring points around the centroid is used to isolate it. Simultaneously, a maximum distance product strategy combined with a virtual centroid method is used to reasonably distribute the selected centroid positions. This accelerates the initialization convergence speed and improves the quality of the initial centroid selection.
[0112] The following are embodiments of the apparatus described in this application, which can be used to execute the embodiments of the method described in this application. For details not disclosed in the apparatus embodiments of this application, please refer to the embodiments of the method described in this application.
[0113] Figure 7 A schematic diagram of the k-means log classification device based on distributed sample screening provided in this application embodiment is shown below. Figure 7 As shown, the log classification device 700 includes an acquisition module 710, a center determination module 720, a center set module 730, a distance calculation module 740, a label determination module 750, a dataset composition module 760, and a clustering module 770. The acquisition module 710 acquires N log sample sets and corresponding replicas of each log sample set, determining K centers in each log sample set. Each log sample set includes at least one log sample, and the replicas are identical to the log samples in their corresponding log sample sets. N and K are positive integers. The center determination module 720, based on the K centers in each log sample set, clusters the log samples in the replicas of that log sample set, obtaining K clusters in that replica and the cluster center of each cluster. The center set module 730 forms an initial center set from the cluster centers of all replicas, and performs a fusion process on the cluster centers based on the cosine distance between each cluster center in the initial center set until a preset fusion termination condition is met, resulting in a first center set. The distance calculation module 740 calculates the K minimum distances of the log sample set based on the cosine distances between the cluster centers in the first center set and the K centers in the log sample set. The label determination module 750 determines the rank label of the log sample set based on the K minimum distances. The dataset composition module 760 extracts a target number of log samples from all log sample sets based on the rank label of each log sample set to form a sample dataset. The clustering module 770 determines the K centroids from the sample dataset and performs K-means clustering.
[0114] Optionally, the acquisition module can be specifically used for: acquiring an initial log set, which includes at least one log sample; dividing the initial log set equally into N log sample sets; backing up each log sample set to obtain a copy of each log sample set; randomly selecting a first target sample from the log sample set and deleting the first log sample in the log sample set whose cosine distance to the first target sample is less than a first preset threshold; acquiring the center of the deleted first log sample as the first center; randomly selecting a second target sample from the log sample set and deleting the second log sample in the log sample set whose cosine distance to the second target sample is less than a first preset threshold; acquiring the center of the deleted second log sample as the second center; randomly selecting a Kth target sample from the log sample set and deleting the Kth log sample in the log sample set whose cosine distance to the Kth target sample is less than a first preset threshold; acquiring the center of the deleted Kth log sample as the Kth center.
[0115] Optionally, the center determination module can be used to: calculate the cosine distance between each log sample in the replica and each center; and group the log sample with the closest cosine distance to the Kth center and the Kth center into the same cluster to obtain K clusters and the cluster center of each cluster.
[0116] Optionally, the center set module can be used to: compare the cosine distances of each cluster center in the initial center set, and obtain the first and second cluster centers with the closest cosine distances; merge the first and second cluster centers as new cluster centers, and calculate the total number of all cluster centers in the current initial center set; if the total number is K, stop the fusion process of cluster centers and obtain the first center set; if the total number is not K, continue the fusion process to obtain new cluster centers.
[0117] Optionally, the center set module can be used to calculate new cluster centers, Cnew = (C1 + C2) / 2, where Cnew is the new cluster center, C1 is the first cluster center, and C2 is the second cluster center.
[0118] Optionally, the distance calculation module can be used to: select the Kth log sample from the replicas of the log sample set based on the K centers in the log sample set, as the Kth element; obtain the target cluster center corresponding to the Kth element from the first center set based on the Kth element; and calculate the cosine distance between the target cluster center and the Kth element as the Kth minimum distance.
[0119] Optionally, the distance calculation module can be used to: obtain the cluster center in the first center set that has the closest cosine distance to the Kth element, and use it as the target cluster center.
[0120] Optionally, the label determination module can be used to: sum the K minimum distances of the log sample set to obtain the cumulative distance sum corresponding to the log sample set; form a distance set from the cumulative distance sums of all log sample sets, and sort the cumulative distance sums in the distance set according to their magnitude; and determine the level label of the log sample set corresponding to each cumulative distance sum according to the sorting order of each cumulative distance sum.
[0121] Optionally, the dataset composition module can be used to: extract a proportion of log samples corresponding to the level label from the log sample set; form a high-quality sample set from the log samples extracted from all log sample sets; and extract a target proportion of log samples from the high-quality sample set to form a sample dataset.
[0122] Optionally, the clustering module can be used to: randomly select a centroid neighbor from the sample dataset, delete a first log sample in the sample dataset whose cosine distance to the centroid neighbor is less than a preset second threshold, and obtain the center of the first log sample as the first centroid; obtain the farthest centroid neighbor from the sample dataset that is farthest from the first centroid, delete a second log sample in the sample dataset whose cosine distance to the farthest centroid neighbor is less than a preset second threshold, and obtain the center of the second log sample as the second centroid; and determine K centroids from the sample dataset based on the distances between the remaining log samples in the sample dataset and the first and second centroids.
[0123] Optionally, the clustering module can be used to: multiply the distances of the remaining log sample with all centroids to obtain the distance product of the log sample; form a distance set from the distance products of all log samples; select the maximum distance product from the distance set; take the log sample corresponding to the maximum distance product as the next new centroid neighbor; delete the third log sample in the sample dataset whose cosine distance to the new centroid neighbor is less than a preset second threshold; obtain the center of the third log sample as the third centroid; determine whether the number of centroids in the sample dataset is K; if not, continue to select new centroid neighbors and determine the new centroid based on the new centroid neighbors.
[0124] The apparatus provided in this application embodiment can be used to execute the methods in the above embodiments, and its implementation principle and technical effect are similar, so they will not be described again here.
[0125] It should be noted that the division of the various modules in the above device is merely a logical functional division. In actual implementation, they can be fully or partially integrated into a single physical entity, or they can be physically separated. Furthermore, these modules can be implemented entirely in software via processing element calls; they can be fully implemented in hardware; or some modules can be implemented by processing element calls to software, while others are implemented in hardware. For example, the acquisition module can be a separate processing element, or it can be integrated into a chip in the above device. Alternatively, it can be stored as program code in the memory of the above device, and its function can be called and executed by a processing element. The implementation of other modules is similar. Moreover, these modules can be fully or partially integrated together, or they can be implemented independently. The processing element here can be an integrated circuit with signal processing capabilities. In the implementation process, each step of the above method or each of the above modules can be completed through integrated logic circuits in the hardware of the processor element or through software instructions.
[0126] Figure 8 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Figure 8 As shown, the electronic device 800 includes at least one processor 801, a memory 802, a bus 803, and a communication interface 804. The processor, communication interface, and memory communicate with each other via the bus. The communication interface is used to communicate with other devices. This communication interface includes a communication interface for data transmission and a display interface or operation interface for human-computer interaction. The processor executes computer instructions stored in the memory, specifically performing the relevant steps in the methods described in the above embodiments.
[0127] The processor may be a central processing unit, an application-specific integrated circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention. The electronic device may include one or more processors of the same type, such as one or more CPUs; or it may include processors of different types, such as one or more CPUs and one or more ASICs.
[0128] Memory is used to store instructions executed by a computer. Memory may include high-speed RAM, and may also include non-volatile memory, such as at least one disk drive.
[0129] This embodiment also provides a computer-readable storage medium storing computer instructions, which, when executed by at least one processor of an electronic device, enable the electronic device to perform the methods provided in the various embodiments described above.
[0130] This embodiment also provides a computer program product including computer instructions stored in a readable storage medium. At least one processor of an electronic device can read the computer instructions from the readable storage medium, and the at least one processor executes the computer instructions to cause the electronic device to perform the methods provided in the various embodiments described above.
[0131] In this application, "at least one" means one or more, and "more than one" means two or more. "And / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, or B alone, where A and B can be singular or plural. The character " / " generally indicates an "or" relationship between the preceding and following related objects; in formulas, the character " / " indicates a "division" relationship. "At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one of a, b, or c can represent: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or multiple.
[0132] It is understood that the various numerical designations used in the embodiments of this application are merely for descriptive convenience and are not intended to limit the scope of the embodiments of this application. In the embodiments of this application, the order of the above-mentioned process numbers does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.
[0133] Other embodiments of this application will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this application are indicated by the following claims.
[0134] It should be understood that this application is not limited to the precise structure described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this application is limited only by the appended claims.
Claims
1. A kmeans log classification method based on distributed sample screening, characterized in that, include: Obtain N log sample sets and a copy of each log sample set. By deleting log samples whose cosine distance to the centroid of a randomly selected sample is less than a first preset threshold, obtain the center of the deleted log samples to determine K centers in each log sample set. The log sample set includes at least one log sample. The copy is the same as the log sample in the corresponding log sample set. N and K are positive integers. Based on the K centers in each log sample set, the log samples in the replica of the log sample set are divided into clusters to obtain the K clusters of the replica and the cluster center of each cluster; The cluster centers of all replicas are combined into an initial center set. The cosine distances of the cluster centers in the initial center set are compared, and the first and second cluster centers with the closest cosine distances are obtained. The first and second cluster centers are merged to form a new cluster center, and the total number of cluster centers in the current initial center set is calculated. If the total number is K, the merging process is stopped, and a first center set is obtained. If the total number is not K, the merging process continues to obtain a new cluster center. Based on the cosine distances between the cluster centers in the first center set and the K centers in the log sample set, the K minimum distances of the log sample set are calculated. Sum the K minimum distances of the log sample set to obtain the cumulative distance sum of the log sample set; The distance sums of all log sample sets are combined into a distance set, and the distance sums in the distance set are sorted according to their magnitude. Based on the sorting order of each cumulative distance sum, the grade label of the log sample set corresponding to that cumulative distance sum is determined; where, the smaller the cumulative distance sum, the higher the sample quality grade indicated by the corresponding grade label; Extract a proportion of log samples corresponding to the level label from the log sample set; Log samples extracted from all log sample sets will form a high-quality sample set. Log samples of a target proportion will be extracted from this high-quality sample set to form a sample dataset. K centroids are determined from the sample dataset, and K-means clustering is performed.
2. The method of claim 1, wherein, The step of merging the first cluster center and the second cluster center to form a new cluster center includes: Cnew = (C1 + C2) / 2 In the above formula, Cnew is the new cluster center, C1 is the first cluster center, and C2 is the second cluster center.
3. The method of claim 1, wherein, The step of calculating the K minimum distances of the log sample set based on the cosine distances between the cluster centers in the first center set and the K centers in the log sample set includes: Based on the K centers in the log sample set, select the Kth log sample from the replicas of the log sample set as the Kth element; Based on the Kth element, obtain the target cluster center corresponding to the Kth element from the first center set; Calculate the cosine distance between the center of the target cluster and the Kth element, and use it as the Kth minimum distance.
4. The method of claim 3, wherein, The step of obtaining the target cluster center corresponding to the Kth element from the first center set based on the Kth element includes: Obtain the cluster center in the first center set that has the closest cosine distance to the Kth element, and use it as the target cluster center.
5. The method according to claim 1, characterized in that, The step of determining K centroids from the sample dataset includes: Randomly select a centroid neighboring point from the sample dataset, delete the first log sample in the sample dataset whose cosine distance to the centroid neighboring point is less than a preset second threshold, and obtain the center of the first log sample as the first centroid; Obtain the nearest centroid of the farthest sample from the sample dataset that is farthest from the first centroid; delete the second log sample from the sample dataset whose cosine distance to the nearest centroid of the farthest sample is less than the preset second threshold; and obtain the center of the second log sample as the second centroid. Based on the distances between the remaining log samples in the sample dataset and the first and second centroids, K centroids are determined from the sample dataset.
6. The method of claim 5, wherein, The step of determining K centroids from the sample dataset based on the distances between the remaining log samples in the sample dataset and the first and second centroids includes: Multiply the distance of the remaining log sample by the distance of all centroids to obtain the distance product of the log sample. Collect the distance products of all log samples into a distance set, and select the maximum distance product from this distance set; The log sample corresponding to the maximum distance product is taken as the next new centroid neighbor point. The third log sample in the sample dataset whose cosine distance to the new centroid neighbor point is less than the preset second threshold is deleted, and the center of the third log sample is obtained as the third centroid. Determine if the number of centroids in the sample dataset is K. If it is not K, continue to select new centroid neighbors and determine new centroids based on these new neighbors.
7. A k-means log classification device based on distributed sample screening, characterized in that, The distributed sample screening-based k-means log classification device is used to implement the distributed sample screening-based k-means log classification method according to any one of claims 1-6, and the device includes: The acquisition module is used to acquire N log sample sets and the corresponding replicas of each log sample set, and to determine K centers in each log sample set. The log sample set includes at least one log sample, and the replica is the same as the log sample in the corresponding log sample set. N and K are positive integers. The center determination module is used to divide the log samples in the replica of each log sample set into clusters based on the K centers in each log sample set, so as to obtain the K clusters of the replica and the cluster center of each cluster; The center set module is used to form an initial center set from the cluster centers of all replicas. Based on the cosine distance between the cluster centers in the initial center set, the cluster centers are merged until the preset fusion end condition is met, and the first center set is obtained. The distance calculation module is used to calculate the K minimum distances of the log sample set based on the cosine distances between the cluster centers in the first center set and the K centers in the log sample set. The label determination module is used to determine the level label of the log sample set based on the K minimum distances of the log sample set; The dataset composition module is used to extract a target number of log samples from all log sample sets based on the level label of each log sample set, and form a sample dataset. The clustering module is used to determine K centroids from the sample dataset and perform K-means clustering.
8. An electronic device, characterized in that, include: A processor, and a memory communicatively connected to the processor; The memory stores computer-executed instructions; The processor executes computer execution instructions stored in the memory to implement the method as described in any one of claims 1 to 6.
9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, are used to implement the method as described in any one of claims 1 to 6.
10. A computer program product, characterised in that, When executed by a processor, the computer program product is used to implement the method as described in any one of claims 1 to 6.