A log clustering method and device, electronic equipment and storage medium

By using a preset support threshold to determine frequent words and a similarity threshold to merge candidate clusters in log clustering, the log clustering process is optimized, solving the problem of poor clustering accuracy in existing technologies and improving the accuracy and reliability of log clustering.

CN117332083BActive Publication Date: 2026-06-19CHINA UNIONPAY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA UNIONPAY
Filing Date
2023-09-18
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing log clustering methods rely on user-defined support thresholds, resulting in poor clustering accuracy. Furthermore, different sizes of log data require different support threshold settings, which can easily lead to cluster dispersion or overfitting.

Method used

Frequent words in log data are identified by setting a support threshold. Candidate clusters and linear templates are generated based on the frequent words. Candidate clusters with high similarity are merged by using a similarity threshold to generate target clusters and target linear templates, thus optimizing the log clustering process.

🎯Benefits of technology

It improves the accuracy of log clustering, solves the problem of cluster dispersion or overfitting caused by inaccurate support thresholds, and enhances the reliability of clustering results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117332083B_ABST
    Figure CN117332083B_ABST
Patent Text Reader

Abstract

This application discloses a log clustering method, apparatus, electronic device, and storage medium. In this application, frequent words in each log data are determined based on a preset support threshold. After determining each first candidate cluster and its corresponding first linear template based on the frequent words in each log data, the similarity between each first linear template is determined based on the frequent words in each first linear template. Then, the first candidate clusters corresponding to the first linear templates with similarity greater than a preset similarity threshold are merged to determine each target cluster and its corresponding target linear template. By merging the first candidate clusters corresponding to the first linear templates with similarity greater than the preset similarity threshold, the problem of scattered or overfitted log clustering caused by inaccurate preset support thresholds can be solved, thus improving the accuracy of log clustering.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of data processing technology, and in particular to a log clustering method, apparatus, electronic device and storage medium. Background Technology

[0002] With the technological advancements in data processing capabilities, scale, and complexity of data center and computer applications, application systems typically generate massive amounts of log data daily, exceeding tens or even hundreds of gigabytes. For example, a security log management system might receive nearly 100 million events daily. This places immense pressure on operations and maintenance personnel. To simplify log data management, many studies recommend using data mining methods to discover event patterns in event logs. These methods can ultimately be used for a variety of purposes, such as: developing log event correlation rules, detecting system failures and network anomalies, visualizing event correlation patterns, identifying and reporting network traffic, and automatically building intrusion detection system alert classifiers.

[0003] Existing log clustering methods, such as SLCT, are designed to mine linear patterns and anomalous events from logs. During clustering, SLCT assigns log lines that match the same linear pattern to the same cluster and reports all detected clusters as linear patterns to the user. To find clusters in the log data, the user needs to define a support threshold `s` for each cluster, where `s` defines the minimum number of lines in each cluster. The SLCT method begins clustering by passing the input dataset to identify frequent words that appear in at least `s` lines. Furthermore, each frequent word includes its position within the log line.

[0004] The problem with existing technologies is that log clustering effectiveness relies on user-defined support thresholds, which are set based on user experience. The accuracy of these thresholds cannot be guaranteed, and the support thresholds differ for logs of different sizes. Improperly set support thresholds can lead to scattered or overfitting log clustering, resulting in poor accuracy. Summary of the Invention

[0005] This application provides a log clustering method, apparatus, electronic device, and storage medium to solve the problem of poor accuracy in existing log clustering techniques.

[0006] Firstly, this application provides a log clustering method, the method comprising:

[0007] Obtain each log data entry to be clustered, and determine the frequent words in each log data entry based on a preset support threshold;

[0008] Based on the frequent words in each log data, determine each first candidate cluster and its corresponding first linear template;

[0009] Based on the frequent words in each of the first linear templates, the similarity between the first linear templates is determined, and the first candidate clusters corresponding to the first linear templates with similarity greater than a preset similarity threshold are merged to determine each target cluster and its corresponding target linear template.

[0010] Secondly, this application provides a log clustering apparatus, the apparatus comprising:

[0011] The first determining module is used to acquire each log data to be clustered and determine the frequent words in each log data according to a preset support threshold.

[0012] The second determining module is used to determine each first candidate cluster and its corresponding first linear template based on the frequent words in each log data.

[0013] The third determining module is used to determine the similarity between the first linear templates based on the frequent words in each first linear template, merge the first candidate clusters corresponding to the first linear templates whose similarity is greater than a preset similarity threshold, and determine each target cluster and its corresponding target linear template.

[0014] Thirdly, this application provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus;

[0015] Memory, used to store computer programs;

[0016] A processor, when executing a program stored in memory, implements the steps of the method described.

[0017] Fourthly, this application provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the method described.

[0018] This application provides a log clustering method, apparatus, electronic device, and storage medium. The method includes: acquiring log data to be clustered; determining frequent words in each log data according to a preset support threshold; determining each first candidate cluster and its corresponding first linear template according to the frequent words in each log data; determining the similarity between the first linear templates according to the frequent words in each first linear template; merging the first candidate clusters corresponding to the first linear templates whose similarity is greater than a preset similarity threshold; and determining each target cluster and its corresponding target linear template.

[0019] The above technical solution has the following advantages or beneficial effects:

[0020] In this application, frequent words in each log data are determined based on a preset support threshold. After determining each first candidate cluster and its corresponding first linear template based on the frequent words in each log data, the similarity between each first linear template is determined based on the frequent words in each first linear template. Then, the first candidate clusters corresponding to the first linear templates with similarity greater than a preset similarity threshold are merged to determine each target cluster and its corresponding target linear template. By merging the first candidate clusters corresponding to the first linear templates with similarity greater than the preset similarity threshold, the problem of scattered or overfitted log clustering caused by inaccurate preset support thresholds can be solved, thus improving the accuracy of log clustering. Attached Figure Description

[0021] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0022] Figure 1 A schematic diagram of the log clustering process provided in this application;

[0023] Figure 2 Log clustering process diagram provided for this application;

[0024] Figure 3 A schematic diagram illustrating the generation of candidate class clusters and linear templates provided in this application;

[0025] Figure 4 This is a schematic diagram illustrating an example of merging candidate class clusters provided in this application;

[0026] Figure 5 This is an example diagram illustrating the connection of candidate clusters provided in this application;

[0027] Figure 6 A schematic diagram of the log clustering device provided in this application;

[0028] Figure 7 A schematic diagram of the electronic device structure provided in this application. Detailed Implementation

[0029] To make the objectives and implementation methods of this application clearer, the exemplary implementation methods of this application will be clearly and completely described below with reference to the accompanying drawings of the exemplary embodiments of this application. Obviously, the exemplary embodiments described are only some embodiments of this application, and not all embodiments.

[0030] It should be noted that the brief descriptions of terms in this application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of this application. Unless otherwise stated, these terms should be understood in their ordinary and common meaning.

[0031] The terms "first," "second," "third," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar or related objects or entities, and do not necessarily imply a specific order or sequence, unless otherwise specified. It should be understood that such terms are interchangeable where appropriate.

[0032] The terms “comprising” and “having”, and any variations thereof, are intended to cover but not exclude inclusion, for example, a product or device that includes a range of components is not necessarily limited to all of the components that are clearly listed, but may include other components that are not clearly listed or that are inherent to such product or device.

[0033] The term "module" refers to any known or subsequently developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware and / or software code that is capable of performing the functions associated with that element.

[0034] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of this application.

[0035] For ease of explanation, the above description has been provided in conjunction with specific embodiments. However, the above exemplary discussion is not intended to be exhaustive or to limit the embodiments to the specific forms disclosed above. Various modifications and variations can be obtained based on the above teachings. The selection and description of the above embodiments are for the purpose of better explaining the principles and practical applications, thereby enabling those skilled in the art to better utilize the described embodiments and various different variations of embodiments suitable for specific use considerations.

[0036] Figure 1 The log clustering process provided in this application includes the following steps:

[0037] S101: Obtain each log data to be clustered, and determine the frequent words in each log data according to the preset support threshold.

[0038] S102: Based on the frequent words in each log data, determine each first candidate cluster and its corresponding first linear template.

[0039] S103: Based on the frequent words in each of the first linear templates, determine the similarity between the first linear templates, merge the first candidate clusters corresponding to the first linear templates whose similarity is greater than a preset similarity threshold, and determine each target cluster and its corresponding target linear template.

[0040] The log clustering method provided in this application is applied to electronic devices, which may be PCs, tablets, or servers.

[0041] The electronic device first acquires the log data to be clustered, while maintaining a preset support threshold. Frequent words in each log data are determined based on the preset support threshold. Specifically, words whose frequency exceeds the preset support threshold are considered frequent words. Based on these frequent words, first candidate clusters are determined. For example, if the preset support threshold is 10, and the frequent words in each log data that exceed the preset support threshold include "Application", "start", "at", and "node", then the log data "ApplicationGateway start at node Sfmmbkas" and the log data "Application NG app start at nodeNgap" belong to the same first candidate cluster. Preferably, while determining the frequent words in each log data based on the preset support threshold, variable words can also be determined; that is, words whose frequency does not exceed the preset support threshold are considered variable words. Wildcards are used to replace variable words, and these wildcards are also included in the first linear template corresponding to the first candidate cluster. For example, if the wildcard is represented by the wildcard symbol "*", then the first linear template corresponding to the first candidate cluster can be represented as "Application*start at node*". To further improve the accuracy of the representation of the first linear template corresponding to the first candidate cluster, the wildcard can include not only the wildcard symbol but also a range of the number of replacement variable words. For example, the first linear template corresponding to the first candidate cluster can be represented as "Application*{1,2}start at node*{1,1}", where {1,2} and {1,1} are both ranges of the number of replacement variable words. {1,2} means that in each log data entry, there is at least one and at most two variable words between "Application" and "start". {1,1} means that in each log data entry, there is at least one and at most one variable word after "node".

[0042] After determining each first candidate cluster and its corresponding first linear template, the similarity between the first linear templates is determined based on the frequent words in each template. Specifically, for any two first linear templates, the greater the number of identical frequent words, the greater the similarity; conversely, the fewer identical frequent words, the smaller the similarity. Optionally, for any two first linear templates, the number of identical frequent words and the total number of word segments contained in each template can be counted. The ratio of this ratio to the total number of word segments is used as the similarity between the two first linear templates. Alternatively, for any two first linear templates, the number of identical frequent words, the number of identical wildcards, and the total number of word segments can be counted. The sum of these two sums is then used as the ratio to the total number of word segments and wildcards in each template, and this ratio is used as the similarity between the two first linear templates. Wildcards whose number ranges are the same as those of their replacement variable words are considered the same wildcard.

[0043] For example, consider the first linear template "remote_drr 172.18.179.21access@timetamp*{1,1}" and the first linear template "remote_drr*{1,1}access@timetamp 20210517164312". "remote_drr172.18.179.21access@timetamp*{1,1}" contains the tokens "remote_drr", "172.18.179.21", "access", "@timetamp", and "*{1,1}". "remote_drr*{1,1}access@timetamp20210517164312" contains the tokens "remote_drr", "*{1,1}", "access", "@timetamp", and "20210517164312". If the similarity between the two first linear templates is determined by the ratio of the number of identical frequent words to the total number of word segments contained in both templates, then the similarity is determined to be 3 / 6 = 0.5. If the similarity is determined by the ratio of the sum of the number of identical frequent words and the number of identical wildcards to the total number of word segments contained in both templates, then the similarity is determined to be 4 / 6 = 0.67.

[0044] Electronic devices store preset similarity thresholds, such as 0.4 or 0.5. The similarity between each first linear template is determined. First candidate clusters corresponding to first linear templates with similarities greater than the preset similarity threshold are merged to determine each target cluster and its corresponding target linear template. Using the example above, the similarity between the first linear templates “remote_drr172.18.179.21access@timetamp*{1,1}” and “remote_drr*{1,1}access@timetamp 20210517164312” is greater than the preset similarity threshold. The merged target linear template is represented as “remote_drr*{1,1}access@timetamp*{1,1}”. The first candidate cluster corresponding to the first linear template “remote_drr 172.18.179.21access@timetamp*{1,1}” and the first candidate cluster in the first linear template “remote_drr*{1,1}access@timetamp20210517164312” are merged to obtain the target cluster.

[0045] In this application, frequent words in each log data are determined based on a preset support threshold. After determining each first candidate cluster and its corresponding first linear template based on the frequent words in each log data, the similarity between each first linear template is determined based on the frequent words in each first linear template. Then, the first candidate clusters corresponding to the first linear templates with similarity greater than a preset similarity threshold are merged to determine each target cluster and its corresponding target linear template. By merging the first candidate clusters corresponding to the first linear templates with similarity greater than the preset similarity threshold, the problem of scattered or overfitted log clustering caused by inaccurate preset support thresholds can be solved, thus improving the accuracy of log clustering.

[0046] In this application, the step of obtaining each log data to be clustered and determining the frequent words in each log data according to a preset support threshold includes:

[0047] Obtain each log data entry to be clustered, determine each word in the log data using a word segmentation algorithm, and identify the word segments whose frequency is greater than a preset support threshold as the frequent words.

[0048] After obtaining the log data to be clustered, each log data is first segmented using a word segmentation algorithm to determine the individual words in each log data. Then, words whose frequency exceeds a preset support threshold are identified as frequent words. Word segmentation algorithms include, but are not limited to, maximum matching algorithm, shortest path algorithm, N-gram language model algorithm, HMM algorithm, and CRF algorithm.

[0049] To make the identification of frequent words more accurate, word segments whose frequency exceeds a preset support threshold are included as frequent words:

[0050] For each log data entry, each word segment in the log data is deduplicated, and the word segments whose frequency after deduplication is greater than a preset support threshold are taken as frequent words.

[0051] In this application, for each log data entry, word segmentation and deduplication are performed, meaning that only one instance of the same word is retained within that log data entry. Then, the frequency of each word in each log data entry after word segmentation and deduplication is counted to determine frequent words. Specifically, words whose frequency after word segmentation and deduplication is greater than a preset support threshold are considered frequent words. It should be noted that word segmentation and deduplication are only performed on each log data entry when determining frequent words; the determination of each first candidate cluster and its corresponding first linear template is still performed within the original log data.

[0052] Obtain each log data entry to be clustered, and determine the frequent words in each log data entry based on a preset support threshold, including:

[0053] Obtain each log data to be clustered, determine each word in the log data through a word segmentation algorithm, perform word segmentation and deduplication processing on each word, and designate the word segment whose occurrence frequency after word segmentation and deduplication processing is greater than a preset support threshold as the frequent word; designate the word segment whose occurrence frequency after word segmentation and deduplication processing is not greater than the preset support threshold as the variable word.

[0054] Based on the frequent words in each log data entry, the determination of each first candidate cluster and its corresponding first linear template includes:

[0055] Log data with the same frequent words and the same relative position information of the frequent words in the log data are regarded as log data in the same first candidate cluster; wildcards are used to replace the variable words in the log data in the same first candidate cluster to obtain the first linear template corresponding to the first candidate cluster; wherein, the first linear template includes frequent words and wildcards, and the wildcards include wildcard symbols and the range of the number of replacement variable words.

[0056] The relative position information of frequent words in log data refers to the relative positions of frequent words within the log data. For example, if the frequent words are "Application", "start", "at", and "node", the log data "ApplicationGateway start at node Sfmmbkas" and the log data "Application NG app start at nodeNgap" contain the same frequent words and have the same relative position information. Therefore, these two log data belong to the same first candidate cluster. Replacing the variable words in the log data with wildcards yields the first linear template corresponding to the first candidate cluster. As another example, if the frequent words are "Application", "start", "at", and "node", the log data "Application Gateway start at node Sfmmbkas" and the log data "NG app start at nodeApplication Ngap" contain the same frequent words, but their relative position information is different, they do not belong to the same first candidate cluster.

[0057] Determining the similarity between the first linear templates based on the frequent words in each of the first linear templates includes:

[0058] For any two first linear templates, the similarity between them is determined based on the number of common frequent words, the number of common wildcards, and the total number of word segments in the two first linear templates.

[0059] Optionally, the sum of the number of identical frequent words and the number of identical wildcards in any two first linear templates is determined, and the ratio of the sum to the total number of word segments is used as the similarity between the two first linear templates.

[0060] To further improve the accuracy of the identified target clusters, this application merges the first candidate clusters corresponding to the first linear templates whose similarity is greater than a preset similarity threshold, and determines each target cluster and its corresponding target linear template as follows:

[0061] The first candidate clusters corresponding to the first linear templates whose similarity is greater than the preset similarity threshold are merged to obtain each second candidate cluster and its corresponding second linear template.

[0062] For each frequent word in each of the second linear templates, the matching degree of the frequent word is determined based on the frequency of the frequent word appearing simultaneously with each frequent word in the same second candidate cluster in the second candidate cluster corresponding to the second linear template and the total frequency of each frequent word. The frequent words to be merged are determined based on the matching degree of the frequent words and a preset matching degree threshold.

[0063] The second candidate clusters corresponding to the second linear templates with the same frequent words but different frequent words to be merged are merged to obtain each target cluster and its corresponding target linear template.

[0064] In this application, the first candidate clusters corresponding to the first linear templates with similarity greater than a preset similarity threshold are merged, and the merged clusters are used as the second candidate clusters. The first linear templates are also merged to obtain the second linear templates. For example, the second linear template is “Interface*{1,1}down at node router1”, where “Interface”, “down”, “at”, “node” and “router1” are frequent words. For "Interface", based on the log data in the second candidate cluster corresponding to the second linear template, determine the frequencies of simultaneous occurrence of "Interface" and "down", "Interface" and "at", "Interface" and "node", and "Interface" and "router1". Also determine the frequencies of occurrence of "Interface", "down", "at", "node", and "router1". Optionally, calculate the first sum of the frequencies of simultaneous occurrence of "Interface" and "down", "Interface" and "at", "Interface" and "node", and "Interface" and "router1", and calculate the second sum of the frequencies of occurrence of "Interface", "down", "at", "node", and "router1". The ratio of the first sum to the second sum is used as the fit of "Interface". Preferably, the average frequency of "Interface" and "down" appearing together, the average frequency of "Interface" and "at" appearing together, the average frequency of "Interface" and "node" appearing together, and the average frequency of "Interface" and "router1" appearing together are calculated. The second sum of the frequencies of "Interface", "down", "at", "node" and "router1" is then calculated, and the ratio of the average to the second sum is taken as the fit of "Interface".

[0065] For the second linear template "Interface*{1,1}down at node router1", the matching degrees of the frequent words "Interface", "down", "at", "node", and "router1" can be determined respectively through the above method. The electronic device stores a preset matching degree threshold, and the frequent words in the second linear template with matching degrees less than the preset matching degree threshold are used as the frequent words to be merged.

[0066] Merge the second candidate clusters corresponding to the second linear templates with the same frequent words but different frequent words to be merged, to obtain each target cluster and its corresponding target linear template. For example, the frequent word to be merged in the second linear template "Interface*{1,1}down at node router1" is "router1", and the frequent word to be merged in the second linear template "Interface*{2,3}down at node router2" is "router2". These two second linear templates have the same frequent words but different frequent words to be merged. Merge these two second linear templates to obtain the target linear template "Interface*{1,3}down at node router1丨router2". Merge the second candidate clusters corresponding to these two second linear templates into one cluster, that is, the target cluster corresponding to the target linear template. In the target linear template, "{1,3}" is the result of merging the wildcards of these two second linear templates, and "router1丨router2" is the result of merging the frequent words to be merged of these two second linear templates, where "丨" represents an or relationship.

[0067] This application proposes a method for clustering log data. Based on the previous clustering methods, it introduces candidate cluster merging optimization and candidate cluster connection strategies, which can optimize the candidate clusters (hereinafter referred to as linear templates) discovered and parsed from the logs, as well as abnormal log data, and improve the accuracy of log clustering. In this application, linear templates and candidate clusters are generated, the similarity between the randomly selected linear templates is calculated, the candidate clusters corresponding to the linear templates with higher similarity are merged, and the linear templates are updated at the same time. Then cluster selection is performed. After the cluster selection is completed, the matching degree of the frequent words is calculated according to the dependency relationship between the frequent words, the frequent words with lower matching degrees are replaced with auxiliary identifiers, and finally the clusters with exactly the same linear templates after being identified will be merged into a new cluster.

[0068] The frequent words in the linear templates determined by this application do not include the frequent word position information and are insensitive to the change of the word segmentation position. After determining the candidate clusters, use the candidate cluster merging optimization and candidate cluster connection strategies to ensure that the linear templates will not be overfitted.

[0069] This application provides a log parsing clustering method based on a linear template of frequent words. The method can be summarized in four steps: Step 1: Traverse the event data and count frequent words based on a support threshold; Step 2: Traverse the event data again to create a linear template of frequent words and candidate clusters; Step 3: To address the problem of excessively fragmented candidate clusters due to improper support threshold settings, merge and optimize the candidate clusters after generation; Step 4: Perform connection optimization on the merged and optimized clusters.

[0070] Figure 2 The log clustering process diagram provided for this application includes the following steps:

[0071] S201: Input the log data L = [I1, I2, ..., In] to be clustered. And input the support threshold s.

[0072] S202: Count frequent words based on the support threshold and create a frequent word set.

[0073] S203: Determine the first linear template based on each log data and the set of frequent words.

[0074] S204: Based on the frequent words in each first linear template, determine the similarity between each first linear template, merge the first candidate clusters corresponding to the first linear templates with similarity greater than a preset similarity threshold, and obtain each second candidate cluster and its corresponding second linear template.

[0075] S205: For each frequent word in each second linear template, determine the matching degree of the frequent words based on the frequency of the frequent words appearing simultaneously with each frequent word in the same second candidate cluster corresponding to the second linear template and the total frequency of each frequent word. Determine the frequent words to be merged based on the matching degree of the frequent words and a preset matching degree threshold. Merge the second candidate clusters corresponding to the second linear templates with the same frequent words but different frequent words to be merged to obtain each target cluster and its corresponding target linear template.

[0076] The construction of a frequent word set is explained below:

[0077] This application considers the frequency (number of occurrences or frequency of occurrence) of each word in the log data, but does not include the positional information of the words. To find a linear template that reaches the support threshold s, the frequent words of each linear template must occur in at least s log data entries. After each line of log data is segmented, the log data in the same line is segmented and deduplicated. The word frequency of all words is counted. Words with a frequency exceeding the support threshold s are considered frequent words and are included in the frequent word set.

[0078] The following explains how to create a linear template to generate candidate class clusters:

[0079] After constructing the frequent word set, candidate clusters (a class of log clusters with the same linear template) are generated. For each line of log data, all frequent words are extracted, processed into tuples, and their original positions in the original line are preserved. These tuples serve as identifiers for candidate clusters, and the corresponding line is assigned to that cluster. If a given candidate cluster does not exist, its support count (for lines conforming to that linear template) is initialized and set to 1, and a new linear template is created for that line. If a candidate cluster exists, its support count is incremented, and the linear template is adjusted to cover the current line. The support count refers to the number of log entries that conform to the linear template.

[0080] Figure 3 This diagram illustrates the generation of candidate clusters and linear templates provided in this application. The identified frequent words include "Application", "start", "at", and "node". For log data 1 "Application Gateway start atnode Sfmmbkas", it is determined whether a linear template exists. If not, the linear template "Application*{1,1}start at node*{1,1}" is generated. If it does exist, the support count of the linear template "Application*{1,1}start at node*{1,1}" is incremented by 1. For log data 2 "Application NG app start at node Ngap", a linear template already exists, so the support count of the linear template is incremented by 1, and the linear template is updated to "Application*{1,2}start at node*{1,1}". *{1,2} indicates that at least one but no more than two segments are between "Application" and "start".

[0081] In practice, log data with the same frequent words and the same relative position information can be merged based on the frequent words in each log data, and the corresponding linear template can be extracted. Then the linear template extraction result can be obtained.

[0082] The optimization for merging candidate clusters is explained below:

[0083] After constructing candidate clusters using all log data, it's necessary to adjust these clusters to prevent overfitting of the linear template. This can be achieved through a heuristic strategy: after candidate clusters are generated, more detailed linear templates are discovered for each cluster and merged into the current given candidate linear template. The support count of the current linear template is then updated. Linear templates with support counts less than the support threshold s after merging and optimization are deleted. The extracted linear template results are then obtained.

[0084] Figure 4Schematic diagram of candidate cluster merging provided for this application. As Figure 4 shown, the support count of the linear template "remote_drr172.18.179.21access@timetamp*{1,1}" is 10, the support count of the linear template "remote_drr*{1,1}access@timetamp 20210517164312" is 25, and the support count of the linear template "remote_drr*{1,1}access@timetamp*{1,1}" is 115. The linear template after merging the three linear templates is "remote_drr*{1,1}access@timetamp*{1,1}", and the support count is 150.

[0085] The description of candidate cluster connection is as follows:

[0086] For each frequent word in the linear template, calculate the fitness (the frequency of co-occurrence) of each frequent word and other frequent words, and the fitness is identified by the fitness weight; and set the threshold t as the fitness threshold. For the linear template of each candidate cluster, an auxiliary identifier will be created according to the weight model to identify the frequent words with lower fitness. Finally, the clusters with exactly the same linear templates that are identified will be merged into a new cluster.

[0087] When two or more clusters are connected, set the support count of the combined cluster as the sum of the support counts of the original clusters, and at the same time adjust the linear template of the combined cluster to cover all the linear templates in the original clusters.

[0088] Since the linear template of the combined cluster consists of highly correlated words, it is not easily prone to overfitting. In addition, frequent words with insufficient dependence weights are merged into the linear template as an alternative list without losing data. Finally, adding clusters will reduce the number of linear templates, making it easier for human experts to conduct cluster reviews.

[0089] Figure 5 Schematic diagram of candidate cluster connection provided for this application, as Figure 5 shown, the support count of the linear template "Interface*{1,1}down at node router1" is 50, the support count of the linear template "Interface*{2,3}down at node router2" is 50, and the target linear template obtained after connection is "Interface*{1,3}down at node router1丨router2", and the support count is 100.

[0090] Figure 6The schematic diagram of the log clustering device provided in this application includes:

[0091] The first determining module 61 is used to acquire each log data to be clustered and determine the frequent words in each log data according to a preset support threshold.

[0092] The second determining module 62 is used to determine each first candidate cluster and its corresponding first linear template based on the frequent words in each log data.

[0093] The third determining module 63 is used to determine the similarity between the first linear templates based on the frequent words in each first linear template, merge the first candidate clusters corresponding to the first linear templates whose similarity is greater than a preset similarity threshold, and determine each target cluster and its corresponding target linear template.

[0094] The first determining module 61 is used to acquire each log data to be clustered, determine each word in the log data through a word segmentation algorithm, and take the word whose frequency of occurrence is greater than a preset support threshold as the frequent word.

[0095] The first determining module 61 is used to perform word segmentation and deduplication processing on each word segment in the log data for each log data, and to take the word segment whose occurrence frequency after word segmentation and deduplication processing is greater than a preset support threshold as the frequent word.

[0096] The first determining module 61 is used to acquire each log data to be clustered, determine each word in the log data through a word segmentation algorithm, perform word segmentation deduplication processing on each word, and take the word segmentation with a frequency greater than a preset support threshold as the frequent word; and take the word segmentation with a frequency not greater than the preset support threshold as the variable word.

[0097] The second determining module 62 is used to classify log data in which the frequent words are the same and the relative position information of the frequent words in the log data is the same as the log data in the same first candidate cluster; and to replace the variable words in the log data in the same first candidate cluster with wildcards to obtain the first linear template corresponding to the first candidate cluster; wherein, the first linear template includes frequent words and wildcards, and the wildcards include wildcard symbols and a range of the number of replacement variable words.

[0098] The third determining module 63 is used to determine the similarity between any two first linear templates based on the number of identical frequent words, the number of identical wildcards, and the total number of word segments in the two first linear templates.

[0099] The third determining module 63 is used to merge the first candidate clusters corresponding to the first linear templates whose similarity is greater than a preset similarity threshold to obtain each second candidate cluster and its corresponding second linear template; for each frequent word in each second linear template, the fitting degree of the frequent word is determined according to the frequency of the frequent word appearing simultaneously with each frequent word in the second candidate cluster corresponding to the second linear template and the total frequency of each frequent word; the frequent words to be merged are determined according to the fitting degree of the frequent words and the preset fitting degree threshold; and the second candidate clusters corresponding to the second linear templates with the same frequent words but different frequent words to be merged are merged to obtain each target cluster and its corresponding target linear template.

[0100] This application also provides an electronic device, such as Figure 7 As shown, it includes: processor 71, communication interface 72, memory 73 and communication bus 74, wherein processor 71, communication interface 72 and memory 73 communicate with each other through communication bus 74;

[0101] The memory 73 stores a computer program, which, when executed by the processor 71, causes the processor 71 to perform any of the above method steps.

[0102] The communication bus mentioned in the above electronic devices can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. This communication bus can be divided into address bus, data bus, control bus, etc. For ease of illustration, only one thick line is used to represent it in the diagram, but this does not mean that there is only one bus or one type of bus.

[0103] Communication interface 72 is used for communication between the above-mentioned electronic device and other devices.

[0104] The memory may include random access memory (RAM) or non-volatile memory (NVM), such as at least one disk storage device. Optionally, the memory may also be at least one storage device located remotely from the aforementioned processor.

[0105] The processors mentioned above can be general-purpose processors, including central processing units, network processors (NPs), etc.; they can also be digital signal processors (DSPs), application-specific integrated circuits, field-programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.

[0106] This application also provides a computer-readable storage medium storing a computer program executable by an electronic device, which, when run on the electronic device, causes the electronic device to perform any of the above method steps.

[0107] Although preferred embodiments of this application have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of this application.

[0108] Obviously, those skilled in the art can make various modifications and variations to this application without departing from the spirit and scope of this application. Therefore, if such modifications and variations fall within the scope of the claims of this application and their equivalents, this application also intends to include such modifications and variations.

Claims

1. A log clustering method characterized by, The method includes: Obtain each log data entry to be clustered, and determine the frequent words in each log data entry based on a preset support threshold; Based on the frequent words in each log data, determine each first candidate cluster and its corresponding first linear template; Based on the frequent words in each of the first linear templates, the similarity between the first linear templates is determined, and the first candidate clusters corresponding to the first linear templates with similarity greater than a preset similarity threshold are merged to determine each target cluster and its corresponding target linear template. The first candidate clusters corresponding to the first linear templates with similarity greater than a preset similarity threshold are merged to determine each target cluster and its corresponding target linear template, including: The first candidate clusters corresponding to the first linear templates whose similarity is greater than the preset similarity threshold are merged to obtain each second candidate cluster and its corresponding second linear template. For each frequent word in each of the second linear templates, the matching degree of the frequent word is determined based on the sum of the frequencies of the frequent word and each frequent word in the same second candidate cluster in the second candidate cluster corresponding to the second linear template, and the sum of the total frequencies of each frequent word. The frequent words to be merged are determined based on the matching degree of the frequent words and a preset matching degree threshold. The second candidate clusters corresponding to the second linear templates with the same frequent words but different frequent words to be merged are merged to obtain each target cluster and its corresponding target linear template.

2. The method of claim 1, wherein, The process of obtaining each log data entry to be clustered, and determining the frequent words in each log data entry based on a preset support threshold, includes: Obtain each log data entry to be clustered, determine each word in the log data using a word segmentation algorithm, and identify the word segments whose frequency is greater than a preset support threshold as the frequent words.

3. The method of claim 2, wherein, The frequent words are those whose frequency in each word segment is greater than a preset support threshold. For each log data entry, each word segment in the log data is deduplicated, and the word segments whose frequency after deduplication is greater than a preset support threshold are taken as frequent words.

4. The method of claim 3, wherein, Obtain each log data entry to be clustered, and determine the frequent words in each log data entry based on a preset support threshold, including: Obtain each log data to be clustered, determine each word in the log data through a word segmentation algorithm, perform word segmentation and deduplication processing on each word, and designate the word segment whose occurrence frequency after word segmentation and deduplication processing is greater than a preset support threshold as the frequent word; designate the word segment whose occurrence frequency after word segmentation and deduplication processing is not greater than the preset support threshold as the variable word. Based on the frequent words in each log data entry, the determination of each first candidate cluster and its corresponding first linear template includes: Log data with the same frequent words and the same relative position information of the frequent words in the log data are regarded as log data in the same first candidate cluster; wildcards are used to replace the variable words in the log data in the same first candidate cluster to obtain the first linear template corresponding to the first candidate cluster; wherein, the first linear template includes frequent words and wildcards, and the wildcards include wildcard symbols and the range of the number of replacement variable words.

5. The method as described in claim 4, characterized in that, Determining the similarity between the first linear templates based on the frequent words in each of the first linear templates includes: For any two first linear templates, the similarity between them is determined based on the number of common frequent words, the number of common wildcards, and the total number of word segments in the two first linear templates.

6. A log clustering apparatus characterized by comprising: The device includes: The first determining module is used to acquire each log data to be clustered and determine the frequent words in each log data according to a preset support threshold. The second determining module is used to determine each first candidate cluster and its corresponding first linear template based on the frequent words in each log data. The third determining module is used to determine the similarity between the first linear templates based on the frequent words in each first linear template, merge the first candidate clusters corresponding to the first linear templates whose similarity is greater than a preset similarity threshold, and determine each target cluster and its corresponding target linear template. The third determining module is used to merge the first candidate clusters corresponding to the first linear templates whose similarity is greater than a preset similarity threshold to obtain each second candidate cluster and its corresponding second linear template; for each frequent word in each second linear template, the matching degree of the frequent word is determined based on the sum of the frequencies of the frequent word and the frequent words in the same second candidate cluster in the second candidate cluster corresponding to the second linear template and the sum of the total frequencies of each frequent word; the frequent words to be merged are determined based on the matching degree of the frequent words and the preset matching degree threshold; and the second candidate clusters corresponding to the second linear templates with the same frequent words but different frequent words to be merged are merged to obtain each target cluster and its corresponding target linear template.

7. The apparatus as claimed in claim 6, characterized in that, The first determining module is used to acquire each log data to be clustered, determine each word in the log data through a word segmentation algorithm, and take the word segment that appears more frequently than a preset support threshold as the frequent word.

8. The apparatus of claim 7, wherein, The first determining module is used to perform word segmentation and deduplication processing on each word segment in the log data for each log data, and to take the word segment whose occurrence frequency after word segmentation and deduplication processing is greater than a preset support threshold as the frequent word.

9. The apparatus of claim 8, wherein, The first determining module is used to acquire each log data to be clustered, determine each word in the log data through a word segmentation algorithm, perform word segmentation deduplication processing on each word, and take the word segmentation frequency after word segmentation deduplication processing that is greater than a preset support threshold as the frequent word. Words whose frequency after deduplication is no greater than a preset support threshold are used as variable words; The second determining module is used to classify log data in which the frequent words are the same and the relative position information of the frequent words in the log data is the same as log data in the same first candidate cluster. By replacing variable words in log data within the same first candidate cluster with wildcards, a first linear template corresponding to the first candidate cluster is obtained; wherein, the first linear template includes frequent words and wildcards, and the wildcards include wildcard symbols and a range of the number of replacement variable words.

10. The apparatus of claim 9, wherein, The third determining module is used to determine the similarity between any two first linear templates based on the number of identical frequent words, the number of identical wildcards, and the total number of word segments in the two first linear templates.

11. An electronic device, comprising: It includes a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus; Memory, used to store computer programs; A processor, when executing a program stored in memory, implements the steps of the method described in any one of claims 1-5.

12. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the steps of the method described in any one of claims 1-5.

Citation Information

Patent Citations

  • Frequent set mining method

    CN110297853A

  • Log parsing template generation

    US11243834B1