Abnormal behavior detection method, device, equipment and storage medium
By calculating the density and clustering of user behavior data, abnormal behavior data is identified, solving the problem that existing anti-fraud methods are easily bypassed, and achieving effective detection and protection against abnormal behavior.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING QIYI CENTURY SCI & TECH CO LTD
- Filing Date
- 2023-08-15
- Publication Date
- 2026-06-23
AI Technical Summary
Existing anti-fraud methods are easily bypassed by illegal tools, leading to financial losses for companies and users. Therefore, it is crucial to utilize user behavior data to identify disguised user behavior.
By calculating the density between data nodes corresponding to user behavior data, multiple data nodes are clustered based on density. Cluster sets with fewer than a preset threshold number of data nodes are selected, or the target cluster set is determined based on the fluctuation of the number of data nodes in each cluster set. The user behavior data corresponding to the target cluster set is then identified as abnormal behavior data.
It enables effective detection of abnormal behavior, uncovers patterns in user behavior data, identifies a minority of user behaviors that differ from the majority, and protects the assets of the company and users.
Smart Images

Figure CN117272297B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer application technology, and in particular to an abnormal behavior detection method, apparatus, device, and storage medium. Background Technology
[0002] Currently, many applications allow users to earn points through various activities, which incentivizes user engagement and promotes the application. However, points, as a resource, are constantly threatened by various emerging illegal tools. Current anti-fraud measures mostly rely on rule engines built from general filter chains, primarily using methods such as preventing tampering with users' Internet Protocol Addresses (IP addresses), symmetric encryption (DFP, quasi-Newton's method), and call frequency limits. These rule engines are easily breached, leading to financial losses for both companies and users. Therefore, the key to solving this problem lies in utilizing user behavior data to discover patterns and identify deceptive user behavior. Summary of the Invention
[0003] The purpose of this invention is to provide an abnormal behavior detection method, apparatus, device, and storage medium to mine patterns in user behavior data and achieve the detection of abnormal behavior. The specific technical solution is as follows:
[0004] In a first aspect of this invention, an abnormal behavior detection method is provided, comprising:
[0005] Acquire multiple first data nodes, each of which corresponds to a user behavior data;
[0006] For each first data node, the neighbor density of the first data node is calculated, wherein the neighbor density represents the minimum distance between the first data node and the second data node, the mutual density of the second data node is greater than the mutual density of the first data node, the mutual density of the first data node represents the closeness between the first data node and the third data node, the mutual density of the second data node represents the closeness between the second data node and the fourth data node, the third data node includes data nodes whose distance from the first data node meets a first preset condition, and the fourth data node includes data nodes whose distance from the second data node meets a second preset condition;
[0007] Multiple first data nodes are clustered based on their neighbor density to obtain multiple cluster sets;
[0008] Select a cluster set with fewer than a preset threshold number of data nodes as the target cluster set, or determine the target cluster set based on the fluctuation of the number of data nodes in each cluster set;
[0009] The user behavior data corresponding to the first data node included in the target cluster set is identified as abnormal behavior data.
[0010] Optionally, the clustering of multiple first data nodes based on their neighbor densities yields multiple cluster sets, including:
[0011] The first data node whose neighbor density is greater than a preset density threshold is selected as the cluster center.
[0012] Iterate through each first data node in turn, and find the fifth data node with the smallest neighbor density to the first data node. If the fifth data node is a cluster center or the fifth data node has been clustered, then cluster the fifth data node and the first data node into the cluster set corresponding to the cluster center.
[0013] Optionally, determining the target cluster set based on the fluctuation of the number of data nodes in each cluster set includes:
[0014] Sort the clusters in ascending order of the number of data nodes in each cluster set, or in descending order of the number of data nodes in each cluster set.
[0015] For each cluster set in the sorted cluster sets except the last cluster set, calculate the fluctuation value of the number of data nodes of each cluster set relative to the next cluster set. Wherein, for each cluster set in the sorted cluster sets except the last cluster set, the next cluster set corresponding to the cluster set is the cluster set that is ranked after the sorted cluster set, and the last cluster set is the last cluster set in the sorted cluster sets.
[0016] Select the value with the largest fluctuation in the number of data nodes;
[0017] The outlier cluster set is determined based on the maximum fluctuation value of the number of data nodes.
[0018] Optionally, determining the outlier cluster set based on the maximum data node number fluctuation value includes:
[0019] If the cluster sets are sorted in ascending order of the number of data nodes in each cluster set, then the target cluster set corresponding to the maximum fluctuation value of the number of data nodes, and the cluster sets ranked before the target cluster set in each sorted cluster set, are taken as the outlier cluster set.
[0020] If the cluster sets are sorted in descending order of the number of data nodes in each cluster set, then the target cluster set corresponding to the maximum fluctuation value of the number of data nodes, and the cluster sets in each sorted cluster set that are ranked after the target cluster set, are taken as the outlier cluster set.
[0021] Optionally, the mutual density is calculated using the following formula:
[0022]
[0023] in, First data node The corresponding set of third data nodes, First data node With the third data node distance, First data node mutual density;
[0024] The neighbor density of the first data node is calculated using the following formula:
[0025]
[0026] First data node Neighbor density, First data node With the second data node The distance.
[0027] Optionally, after determining the user behavior data corresponding to the first data node included in the target cluster set as abnormal behavior data, the method further includes:
[0028] For each abnormal behavior data, determine the user equipment identifier corresponding to the abnormal behavior data;
[0029] The user equipment identifier is sent to the service terminal so that the service terminal can intercept the operation of the device corresponding to the user equipment identifier.
[0030] Optionally, the method further includes:
[0031] Retrieve task execution records corresponding to multiple user identifiers within a preset time range;
[0032] For each user identifier, the number of times a task is executed, the duration of the task, the time point of the task, the number of times the account is changed, and the task type are counted within the preset time range. The number of times a task is executed, the duration of the task, the time point of the task, the number of times the account is changed, and the task type corresponding to a user identifier are collected as user behavior data.
[0033] In a second aspect of the invention, an abnormal behavior detection device is also provided, comprising:
[0034] The acquisition module is used to acquire multiple first data nodes, each of which corresponds to a user behavior data.
[0035] The calculation module is used to calculate the neighbor density of each first data node, wherein the neighbor density represents the minimum distance between the first data node and the second data node, the mutual density of the second data node is greater than the mutual density of the first data node, the mutual density of the first data node represents the closeness between the first data node and the third data node, the mutual density of the second data node represents the closeness between the second data node and the fourth data node, the third data node includes data nodes whose distance from the first data node meets a first preset condition, and the fourth data node includes data nodes whose distance from the second data node meets a second preset condition;
[0036] The clustering module is used to cluster multiple first data nodes based on the neighbor density of multiple first data nodes to obtain multiple cluster sets;
[0037] The anomaly detection module is used to select cluster sets with fewer than a preset threshold number of data nodes as target cluster sets, or to determine target cluster sets based on the fluctuation of the number of data nodes in each cluster set; and to determine the user behavior data corresponding to the first data node included in the target cluster set as abnormal behavior data.
[0038] In a third aspect of the present invention, an electronic device is also provided, including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus.
[0039] Memory, used to store computer programs;
[0040] When a processor executes a program stored in memory, it implements any of the steps described in the first aspect.
[0041] In another aspect of the present invention, a computer-readable storage medium is also provided, wherein a computer program is stored therein, and when the computer program is executed by a processor, it implements any of the above-described abnormal behavior detection methods.
[0042] In another aspect of the present invention, a computer program product containing instructions is also provided, which, when run on a computer, causes the computer to execute any of the above-described abnormal behavior detection methods.
[0043] The abnormal behavior detection method, apparatus, device, and storage medium provided in this invention calculate the density of data nodes by calculating the distance between data nodes corresponding to user behavior data. Based on the density of data nodes, multiple data nodes are clustered, and a cluster set with the number of data nodes less than a preset threshold is selected as the target cluster set. Alternatively, the target cluster set is determined based on the fluctuation of the number of data nodes in each cluster set. The user behavior data corresponding to the first data node included in the target cluster set is identified as abnormal behavior data. In this way, abnormal behavior data can be identified, that is, by mining the patterns between user behavior data, a minority of behavior data that is different from the majority of user behavior data can be identified, thus realizing the detection of abnormal behavior. Attached Figure Description
[0044] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced below.
[0045] Figure 1 This is a flowchart of the abnormal behavior detection method in an embodiment of the present invention;
[0046] Figure 2A This is a schematic diagram of the clustering results in an embodiment of the present invention;
[0047] Figure 2B This is a schematic diagram of a decision graph in an embodiment of the present invention;
[0048] Figure 3 A flowchart illustrating the abnormal behavior detection method provided in the embodiments of the present invention;
[0049] Figure 4 for Figure 3 A schematic diagram of data acquisition in China;
[0050] Figure 5 for Figure 3 A schematic diagram of abnormal detection behavior in the middle;
[0051] Figure 6 This is a schematic diagram of the abnormal behavior detection device provided in an embodiment of the present invention;
[0052] Figure 7 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention. Detailed Implementation
[0053] The technical solutions of the present invention will now be described with reference to the accompanying drawings in the embodiments of the present invention.
[0054] Points and coins earned by users through various activities within the app are constantly threatened by a plethora of malicious tools. Current anti-fraud measures largely rely on rule engines built from general filter chains, primarily employing risk control methods such as preventing tampering with users' Internet Protocol Addresses (IP addresses), using symmetric encryption (DFP, quasi-Newton's method), and limiting call frequency. However, professional hackers or malicious actors can easily breach these rule engines, causing losses to both companies and users. Therefore, the key to solving this problem lies in leveraging user behavior data to uncover patterns and identify deceptive user behavior.
[0055] This invention provides a method for detecting abnormal behavior, such as... Figure 1 As shown, it includes:
[0056] S101, acquire multiple first data nodes, each first data node corresponding to a user behavior data;
[0057] S102, For each first data node, calculate the neighbor density of the first data node, where the neighbor density represents the minimum distance between the first data node and the second data node, the mutual density of the second data node is greater than the mutual density of the first data node, the mutual density of the first data node represents the closeness between the first data node and the third data node, and the mutual density of the second data node represents the closeness between the second data node and the fourth data node.
[0058] S103, cluster the multiple first data nodes based on the neighbor density of the multiple first data nodes to obtain multiple cluster sets;
[0059] S104, Select a cluster set with fewer than a preset threshold number of data nodes as the target cluster set, or determine the target cluster set based on the fluctuation of the number of data nodes in each cluster set;
[0060] S105, the user behavior data corresponding to the first data node included in the target cluster set is identified as abnormal behavior data.
[0061] In this embodiment of the invention, the density of data nodes is calculated by calculating the distance between data nodes corresponding to user behavior data. Multiple data nodes are clustered based on the density of data nodes, and a cluster set with the number of data nodes less than a preset threshold is selected as the target cluster set. Alternatively, the target cluster set is determined based on the fluctuation of the number of data nodes in each cluster set. The user behavior data corresponding to the first data node included in the target cluster set is identified as abnormal behavior data. In this way, abnormal behavior data can be identified. That is, by mining the patterns between user behavior data, a minority of behavior data that is different from the majority of user behavior data can be identified, thus realizing the detection of abnormal behavior.
[0062] The abnormal behavior detection method provided in this embodiment of the invention can be applied to the server providing the application.
[0063] In S101, each piece of user behavior data can generate a corresponding first data node. User behavior data can be understood as behavioral data generated during the user's operation of the application through their device. For example, in a video playback application, this includes data on the user watching videos.
[0064] In one possible approach, task execution records corresponding to multiple user identifiers within a preset time range are obtained; for each user identifier, the number of times the user identifier performs tasks, the duration of the tasks, the time point of the tasks, the number of times the account is changed, and the task type are counted within the preset time range, and the number of times the user identifier performs tasks, the duration of the tasks, the time point of the tasks, the number of times the account is changed, and the task type are treated as user behavior data.
[0065] The preset time range can be determined according to actual needs, such as 1 hour, 2 hours, etc.
[0066] This can be understood as the process of the server periodically detecting abnormal behavior. It can periodically obtain the task execution records corresponding to multiple user identifiers within a certain period of time, and for each user identifier, it can count the number of times the user identifier executed tasks, the duration of the task execution, the time point of the task execution, the number of times the account was changed, and the task type within a preset time range.
[0067] The tasks can include watching videos. Specifically, performing a task might involve actions related to the video, such as clicking on a video.
[0068] When a user watches a video on their device, such as a terminal, they can record video information, such as "User watched video XX at XX o'clock." The user device uploads this video information to the server, which then stores it. The user device can periodically send the recorded video information for that time period back to the server.
[0069] The acquired task execution records, such as video information, can be understood as feature data. In order to more accurately reflect the information related to the operation task, feature data can be abstracted, or feature attributes can be extracted. For example, for each user identifier, the number of times the user identifier performs tasks, the duration of the task, the time point of the task, the number of times the account is changed, and the task type can be counted within a preset time range. The feature attribute corresponding to a user identifier can be understood as a piece of user behavior data.
[0070] In S102, the density of the first data nodes is calculated based on the distance between the first data nodes.
[0071] In this process, each user behavior data can be encoded to obtain a feature vector, and the distance between data nodes corresponding to user behavior data can be represented by the distance between the feature vectors corresponding to user behavior data.
[0072] The distance between feature vectors corresponding to user behavior data can include Euclidean distance, etc.
[0073] In the process of calculating mutual density, the mutual density of the first data node represents the degree of closeness between the first data node and the third data node, and the mutual density of the second data node represents the degree of closeness between the second data node and the fourth data node.
[0074] Here, the third data node represents a data node whose distance from the first data node meets a first preset condition. For example, a data node whose distance from the first data node is less than a first preset distance threshold; or, a third data node represents a data node that is a neighbor of the first data node under the same k value (based on the number of nearest neighbors in the density clustering process, such as 2, 3, etc.). For example, the third data node corresponding to the first data node p represents a data node that is a k-nearest neighbor of q under the same k value (based on the number of nearest neighbors in the density clustering process, such as 2, 3, etc.). If p is a k-nearest neighbor of q, and q is a k-nearest neighbor of p, then p and q are a pair of mutual neighbors, and q is the third data node corresponding to p. The mutual density of the first data nodes represents the closeness between the first data node and the third data node. In simple terms, the third data node can be understood as a neighboring data node of the first data node.
[0075] Similarly, the fourth data node represents a data node whose distance from the second data node satisfies a second preset condition. For example, a data node whose distance from the second data node is less than a second preset distance threshold. This second preset distance threshold can be the same as or different from the first preset distance threshold. Alternatively, the fourth data node represents a data node that is a neighbor of the second data node under the same k-value (based on the number of nearest neighbors in the density clustering process, such as 2 or 3). For example, the fourth data node corresponding to the second data node p1 represents a data node that is a k-nearest neighbor of q1 under the same k-value (based on the number of nearest neighbors in the density clustering process, such as 2 or 3). If p1 is a k-nearest neighbor of q1, and q1 is also a k-nearest neighbor of p1, then p1 and q1 are a pair of mutual neighbors, and q1 is the fourth data node corresponding to p1. The mutual density of the second data nodes represents the closeness between the second and fourth data nodes. Here, the k-value is the same as the k-value corresponding to the third data node mentioned above, but different k-values can also be used. In simple terms, the fourth data node can be understood as a neighboring data node of the second data node.
[0076] In one possible implementation, the mutual density is calculated using the following formula:
[0077]
[0078] in, First data node The set consisting of the corresponding third data nodes, First data node With the third data node distance, First data node The mutual density.
[0079] Mutual density can also be called mutual density. The corresponding third data node can also be understood as the first data node. The corresponding neighbor data nodes.
[0080] In a dataset consisting of multiple first data nodes, dense regions have a large number of mutual neighbors, the distance between data points and their mutual neighbors is relatively small, and the mutual density of data points is high. Similarly, sparse regions have low mutual density.
[0081] The neighbor density of the first data node is calculated using the following formula:
[0082]
[0083] First data node Neighbor density, First data node With the second data node The distance.
[0084] γ density (Gamma-density Of p). Density describes the minimum distance between the first data node p and data points whose density is greater than p's.
[0085] S103, cluster multiple first data nodes based on the neighbor density of multiple first data nodes to obtain multiple cluster sets.
[0086] In this embodiment of the invention, density and distance-based clustering (DDC) can be used for clustering.
[0087] In this embodiment of the invention, the neighbor density used for clustering via the DDC algorithm, specifically the mutual density and γ density, is used to characterize the density. It can be understood that this embodiment of the invention is based on an improved DDC algorithm for clustering.
[0088] S104. Select a cluster set with fewer than a preset threshold number of data nodes as the target cluster set, or determine the target cluster set based on the fluctuation of the number of data nodes in each cluster set.
[0089] In this embodiment of the invention, a cluster set with fewer than a preset threshold number of data nodes can be selected as the target cluster set, also known as an outlier cluster set. The preset threshold number can be determined based on actual needs, selecting a set with a significantly smaller number of data nodes than other cluster sets as the outlier cluster set.
[0090] In this embodiment of the invention, the fluctuation of the number of data nodes in each cluster set can also be calculated, and the outlier cluster set can be determined based on the fluctuation.
[0091] This fluctuation can also be referred to as an outlier.
[0092] The cluster sets can be sorted in ascending or descending order of the number of data nodes. For each sorted cluster set except the last one, the fluctuation value of the number of data nodes in each cluster set relative to the next cluster set is calculated. For each cluster set, the next cluster set is the cluster set that is ranked after the current cluster set in the sorted cluster set. The maximum fluctuation value of the number of data nodes is selected. The outlier cluster set is determined based on the maximum fluctuation value of the number of data nodes.
[0093] The last cluster set is the last cluster set among all the sorted cluster sets.
[0094] If the cluster sets are sorted in ascending order of the number of data nodes in each cluster set, then after selecting the maximum fluctuation value of the number of data nodes, the target cluster set corresponding to the maximum fluctuation value of the number of data nodes, as well as the cluster sets in each sorted cluster set that are ranked before the target cluster set, are taken as the outlier cluster set.
[0095] If the cluster sets are sorted in descending order of the number of data nodes in each cluster set, then after selecting the maximum fluctuation value of the number of data nodes, the target cluster set corresponding to the maximum fluctuation value of the number of data nodes, as well as the cluster sets in each sorted cluster set that are ranked after the target cluster set, are taken as the outlier cluster set.
[0096] The fluctuation in the number of data nodes, also known as the fluctuation value, can be calculated using the following formula:
[0097]
[0098] in, =1,2,……n-1,COF(C i ) is the cluster set C i With cluster set C i+1 The fluctuation in the number of data nodes between them, with a value range of (0,1]. The larger the value, the higher the values of C1, C2, ..., C. i The greater the likelihood of it becoming an outlier cluster, n is the number of cluster sets.
[0099] For the case where cluster sets are sorted in ascending order of the number of data nodes in each cluster set, for each cluster set except the last one after sorting, the fluctuation value of the number of data nodes in each cluster set relative to the next cluster set is calculated. The next cluster set is the cluster set that follows the first cluster set after the first cluster set in the sorted cluster sets. The largest fluctuation value of the number of data nodes is selected. The cluster set corresponding to the largest fluctuation value of the number of data nodes, as well as the cluster set that precedes the first cluster set in the sorted cluster sets, are taken as outlier cluster sets.
[0100] Cluster outlier factor. For example, a cluster set C = {C1, C2, ..., C}. n The condition is based on |C1|<=|C2|<=...<=|C n If the data nodes are sorted, the outlier factor of the cluster, i.e., the fluctuation in the number of data nodes, can be calculated using the following formula:
[0101]
[0102] Among them, COF(C i ) is C i With C i+1 The fluctuation in the number of data nodes between them, also known as the fluctuation value, ranges from (0,1]. A larger value indicates higher values for C1, C2...C... i The greater the likelihood of it becoming an outlier cluster, n is the number of cluster sets.
[0103] like Figure 2A As shown, there are four clusters, C1, C2, C3, and C4, with 3, 3, 300, and 1000 data points respectively. COF(C1) = 0.6321, COF(C2) = 1, COF(C3) = 0.0328, and the maximum value of COF(C2) is 1. C1 and C2 are outlier clusters. Figure 2A The horizontal and vertical axes represent the two-dimensional attributes of data points in a two-dimensional space.
[0104] In another scenario, the cluster sets can be sorted in descending order of the number of data nodes. For each cluster set except the last one, the fluctuation value of the number of data nodes in each cluster set relative to the next cluster set is calculated. The next cluster set is the cluster set that follows the first cluster set in the sorted cluster sets. The maximum fluctuation value of the number of data nodes is selected. The cluster set corresponding to the maximum fluctuation value of the number of data nodes, as well as the cluster set that follows the first cluster set in the sorted cluster sets, are taken as the outlier cluster sets.
[0105] In this embodiment of the invention, small clusters are used as the dimension, that is, outlier cluster sets are determined by clustering and based on the number of data nodes in the cluster sets, with a time complexity of O(n^2). The logn algorithm is efficient.
[0106] S105, the user behavior data corresponding to the first data node included in the target cluster set is identified as abnormal behavior data.
[0107] In this embodiment of the invention, the density of data nodes is calculated by calculating the distance between data nodes corresponding to user behavior data. Multiple data nodes are clustered based on the density of data nodes, and a cluster set with the number of data nodes less than a preset threshold is selected as the target cluster set. Alternatively, the target cluster set is determined based on the fluctuation of the number of data nodes in each cluster set. The user behavior data corresponding to the first data node included in the target cluster set is identified as abnormal behavior data. In this way, abnormal behavior data can be identified. That is, by mining the patterns between user behavior data, a minority of behavior data that is different from the majority of user behavior data can be identified, thus realizing the detection of abnormal behavior.
[0108] In an alternative embodiment, after S105, the following may also be included:
[0109] For each abnormal behavior data, determine the user device identifier corresponding to the abnormal behavior data; send the user device identifier to the business end so that the business end can intercept the operation of the device corresponding to the user device identifier.
[0110] The business side can be notified asynchronously, where the user device identifier can be the device number.
[0111] The business unit stores the device number corresponding to the abnormal behavior data in Redis (a persistent log database).
[0112] To improve business response speed, the obtained device number is stored in Redis. When a user initiates a request using the user device corresponding to the device number, it will be intercepted by the interceptor. Here, the interceptor can use the preHandle of handlerInterceptor. When the intercepted behavior occurs, the developers and operations personnel will be notified in the form of an alert email.
[0113] Simply put, in this embodiment of the invention, the detected user equipment is notified to the business party, which then blocks it, thereby protecting company assets.
[0114] This invention utilizes user behavior data to identify users exhibiting abnormal behavior, thereby addressing vulnerabilities in the risk control system's rule engine and safeguarding the assets of both the company and its users.
[0115] This invention proposes mutual density and γ density (neighbor density). The cluster center and the number of clusters are determined based on the abnormally large γ density. The remaining dataset is assigned to the cluster of the nearest neighbor with a density greater than its own, thus completing the clustering in one step.
[0116] Based on the abnormally high γ density, the cluster centers and the number of clusters are determined. The remaining dataset is then assigned to the clusters containing the nearest neighbors with a density greater than its own, thus completing the clustering in one step.
[0117] In one possible implementation, S103 includes selecting a first data node whose neighbor density is greater than a preset density threshold as a cluster center; sequentially traversing each first data node, finding the fifth data node with the smallest neighbor density with the first data node; if the fifth data node is a cluster center or the fifth data node has already been clustered, then clustering the fifth data node and the first data node into the cluster set corresponding to the cluster center.
[0118] Specifically, this can be achieved through the following process:
[0119] Points with exceptionally high densities and gamma densities are identified as cluster centers, and clustering is performed using the DDC algorithm:
[0120] Input: Dataset D (including multiple user behavior data, one user behavior data corresponds to one data node);
[0121] Output: Outlier cluster OC = {C1, C2, ..., Cb};
[0122] COF(D) (the fluctuation in the number of data nodes in each cluster relative to the next cluster in outlier clustering);
[0123] (1) BEGIN
[0124] Initialization: Clusters= templist= Ci= ,OC=
[0125] (2) Run the MuN-Searching algorithm to calculate the mutual neighbors of each data node.
[0126] (3) For each data node p∈D
[0127] (4) Calculate md(p) according to equation (1)
[0128] (5) End For
[0129] (6) For each data node q∈D
[0130] (7) Calculate the γ density of q according to equation (2).
[0131] (8) End For
[0132] (9) Construct a decision graph using mutual density and γ density;
[0133] The decision diagram can be a two-dimensional graph with mutual density and γ density as the x and y axes, respectively, such as... Figure 2B As shown in the figure, the horizontal axis represents the neighbor density and the vertical axis represents the mutual density.
[0134] (10) Identify the data nodes mi with abnormally high γ density in the decision graph as cluster centers;
[0135] Select data nodes whose γ density is greater than a preset density threshold;
[0136] (11) Ci = Ci∪mi , i = 1, 2, ..., k
[0137] / / Initially, each cluster Ci has only one cluster center mi.
[0138] (12) For any data node p∈D
[0139] / / Loop through all data nodes
[0140] (13) If visited(p)≠true
[0141] / / If p has not been clustered
[0142] (14) templist=templist∪p
[0143] / / Place p in a temporary container
[0144] (15) Find the nearest neighbor q with a density greater than that of point p.
[0145] / / Q, the point whose mutual density is greater than p and is closest to p
[0146] (16) If q=mi || q∈Ci
[0147] / / If q is a cluster center of Ci or q has been clustered before.
[0148] (17) Ci=Ci∪templist
[0149] / / Copy all data nodes of the temporary container to Ci
[0150] (18) templist= / / Clear the temporary container for later use.
[0151] (19) Else
[0152] (20) templist=templist∪q
[0153] / / If q is not a cluster center of Ci, then put q into a temporary container (that is, q is clustered into ci).
[0154] (21) visited(q)=true
[0155] / / q indicates that the site has been visited
[0156] (22) p=q
[0157] / / Set p to q (that is, continue to find the nearest neighbor with a density greater than q, and keep recursively searching).
[0158] (23) goto step(15)
[0159] (24) End If
[0160] (25) End If
[0161] (26) End For
[0162] (27) Sort C1, Ci, ..., Ck in ascending order of the number of data nodes.
[0163] (28) According to COF(Cb)=max{COF(Ci)}
[0164] Find the boundary b of outlier clusters
[0165] (29) OC = OC∪Ci, i = 1, 2, ..., b
[0166] (30) Output OC
[0167] (31) END
[0168] In the pseudocode, Ci is the i-th cluster, mi is the cluster center, templist stores the temporary cluster where point p is located, and the algorithm output OC is the union of the outlier clusters. The points in the outlier clusters are the outlier points.
[0169] Figure 3 A flowchart illustrating the abnormal behavior detection method provided in the embodiments of the present invention.
[0170] Step 1, access the app (application);
[0171] User device access to applications.
[0172] Step 2, collect data;
[0173] To detect abnormal behavior using business data, the selection of feature data is crucial. Attributes that are strongly correlated with the behavior of the executed tasks should be chosen. The acquisition and storage of feature attribute data need to consider data volume, real-time performance, and high availability.
[0174] After obtaining the feature data, a high-performance and accurate detection algorithm is selected. In this embodiment of the invention, an improved DDC algorithm is used to cluster the data nodes corresponding to these feature data. Based on the outlier factor of the clusters, outlier small clusters are detected. That is, cluster sets with fewer than a preset threshold number of data nodes are selected as target cluster sets. Alternatively, the target cluster set is determined based on the fluctuation of the number of data nodes in each cluster set. The user behavior data corresponding to the first data node included in the target cluster set is identified as abnormal behavior data. Finally, the detected devices are notified to the business side. That is, for each abnormal behavior data, the user device identifier corresponding to the abnormal behavior data is determined. The user device identifier is sent to the business side for blocking, thereby protecting company assets.
[0175] Data collection in the process, such as Figure 4 As shown, during the process of a user watching a video, the process of reporting video information can be carried out. The video information, which can also be called feature data such as viewing duration and number of views, can be reported in the form of a message queue. After being abstracted into feature attributes, it can be stored uniformly, such as by the company's big data department.
[0176] Feature data is reported and stored incrementally as users view content.
[0177] Step 3, data abstraction;
[0178] Feature attribute selection focuses on attributes highly relevant to the task, such as watching videos. The original feature data is then extracted to include the following indicators: number of task executions within a specified time period (abnormal behavior involves frequent task executions), task execution time points (abnormal behavior is characterized by concentrated time distribution), number of times the user device changed accounts during task execution (abnormal behavior typically involves multiple accounts), and the type of activity participated in (abnormal behavior generally involves concentrated and frequent participation in activities such as watching videos). Each data point is abstracted into these five dimensions.
[0179] Step 4, scheduled task;
[0180] Step 5, DDC algorithm detection;
[0181] The anomaly detection behavior based on the improved DDC algorithm provided in this invention is performed using an offline scheduled task.
[0182] The process of anomaly detection behavior is as follows: Figure 5 As shown, a scheduled task is executed once every hour daily. The feature data from that hour is input into the DDC algorithm detection module, where an improved DDC algorithm performs clustering, outputting outlier clusters. These outlier clusters, also known as the outlier cluster set, are the target cluster set mentioned above. These outlier clusters are labeled as anomalous clusters, and user behavior data within these outlier clusters is marked as anomalous behavior data. The device IDs corresponding to the user behavior data labeled as outlier clusters are output. The process for determining anomalous behavior data is detailed above. Figure 1 The illustrated embodiment.
[0183] Step 6: Report abnormal behavior;
[0184] Step 7, store the faulty device number;
[0185] The above annotations are the device IDs corresponding to user behavior data in outlier clusters, which are also the abnormal device IDs.
[0186] Step 8: Block abnormal behavior.
[0187] If an abnormal device number is detected performing a task, the abnormal behavior is blocked.
[0188] The interceptor can intercept requests made by users using the user device corresponding to the abnormal device number. The interceptor can use the preHandle of handlerInterceptor. When the intercepted behavior occurs, the developers and operations personnel will be notified in the form of an alert email.
[0189] The embodiments of this invention utilize business data to perform tasks and obtain points, thereby mitigating abnormal behaviors that may occur during these processes. This can compensate for the shortcomings of general risk control rule engines, intercept malicious tools, and improve the security of the company's sensitive assets.
[0190] Corresponding to the abnormal behavior detection method provided in the above embodiments, this invention also provides an abnormal behavior detection device, such as... Figure 6 As shown, it includes:
[0191] The acquisition module 601 is used to acquire multiple first data nodes, each first data node corresponding to a user behavior data;
[0192] The calculation module 602 is used to calculate the neighbor density of each first data node, wherein the neighbor density represents the minimum distance between the first data node and the second data node, the mutual density of the second data node is greater than the mutual density of the first data node, the mutual density of the first data node represents the closeness between the first data node and the third data node, the mutual density of the second data node represents the closeness between the second data node and the fourth data node, the third data node includes data nodes whose distance from the first data node meets a first preset condition, and the fourth data node includes data nodes whose distance from the second data node meets a second preset condition;
[0193] Clustering module 603 is used to cluster multiple first data nodes based on the neighbor density of multiple first data nodes to obtain multiple cluster sets;
[0194] The anomaly detection module 604 is used to select cluster sets with fewer than a preset threshold number of data nodes as target cluster sets, or to determine target cluster sets based on the fluctuation of the number of data nodes in each cluster set; and to determine the user behavior data corresponding to the first data node included in the target cluster set as abnormal behavior data.
[0195] Optionally, the clustering module 603 is specifically used to select a first data node whose neighbor density is greater than a preset density threshold as the cluster center; sequentially traverse each first data node, find the fifth data node with the smallest neighbor density with the first data node; if the fifth data node is the cluster center or the fifth data node has been clustered, then the fifth data node and the first data node are clustered into the cluster set corresponding to the cluster center.
[0196] Optionally, the anomaly detection module 604 is specifically used to sort the cluster sets according to the ascending order of the number of data nodes in each cluster set or according to the descending order of the number of data nodes in each cluster set; for each cluster set except the last cluster set in the sorted cluster sets, calculate the fluctuation value of the number of data nodes of each cluster set relative to the next cluster set, wherein, for each cluster set except the last cluster set in the sorted cluster sets, the next cluster set corresponding to the cluster set is the cluster set that follows the cluster set in the sorted cluster sets, and the last cluster set is the last cluster set in the sorted cluster sets; select the maximum fluctuation value of the number of data nodes; and determine the outlier cluster set based on the maximum fluctuation value of the number of data nodes.
[0197] Optionally, the anomaly detection module 604 is specifically configured to, if the cluster sets are sorted in ascending order of the number of data nodes in each cluster set, take the target cluster set corresponding to the largest fluctuation value of the number of data nodes, and the cluster sets in the sorted cluster sets that are ranked before the target cluster set, as outlier cluster sets; if the cluster sets are sorted in descending order of the number of data nodes in each cluster set, take the target cluster set corresponding to the largest fluctuation value of the number of data nodes, and the cluster sets in the sorted cluster sets that are ranked after the target cluster set, as outlier cluster sets.
[0198] Alternatively, the mutual density is calculated using the following formula:
[0199]
[0200] in, First data node The corresponding set of third data nodes, First data node With the third data node distance, First data node mutual density;
[0201] The neighbor density of the first data node is calculated using the following formula:
[0202]
[0203] First data node Neighbor density, First data node With the second data node The distance.
[0204] Optionally, the device further includes:
[0205] The determination module is used to determine the user device identifier corresponding to each abnormal behavior data after determining the user behavior data corresponding to the first data node included in the target cluster set as abnormal behavior data.
[0206] The sending module is used to send the user equipment identifier to the service end so that the service end can intercept the operations of the device corresponding to the user equipment identifier.
[0207] Optionally, the device further includes:
[0208] The statistics module is used to obtain task execution records corresponding to multiple user identifiers within a preset time range. For each user identifier, it counts the number of tasks executed, the duration of the tasks, the time point of the tasks, the number of times the account was changed, and the task type within the preset time range. The number of tasks executed, the duration of the tasks, the time point of the tasks, the number of times the account was changed, and the task type corresponding to a user identifier are collected as user behavior data.
[0209] This invention also provides an electronic device, such as... Figure 7 As shown, it includes a processor 701, a communication interface 702, a memory 703, and a communication bus 704, wherein the processor 701, the communication interface 702, and the memory 703 communicate with each other through the communication bus 704.
[0210] Memory 703 is used to store computer programs;
[0211] The processor 701, when executing the program stored in the memory 703, implements the above-described abnormal behavior detection method steps.
[0212] The communication bus mentioned above can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. This communication bus can be divided into address bus, data bus, control bus, etc. For ease of illustration, only one thick line is used to represent it in the diagram, but this does not mean that there is only one bus or one type of bus.
[0213] The communication interface is used for communication between the aforementioned terminal and other devices.
[0214] The memory may include random access memory (RAM) or non-volatile memory, such as at least one disk storage device. Optionally, the memory may also be at least one storage device located remotely from the aforementioned processor.
[0215] The processors mentioned above can be general-purpose processors, including central processing units (CPUs), network processors (NPs), etc.; they can also be digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.
[0216] In another embodiment of the present invention, a computer-readable storage medium is also provided, wherein a computer program is stored therein, and when the computer program is executed by a processor, it implements any of the abnormal behavior detection methods described in the above embodiments.
[0217] In another embodiment of the present invention, a computer program product containing instructions is also provided, which, when run on a computer, causes the computer to execute any of the abnormal behavior detection methods described in the above embodiments.
[0218] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present invention are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium accessible to a computer or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state disk (SSD)).
[0219] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0220] The various embodiments in this specification are described in a related manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the embodiments of apparatus, electronic devices, computer-readable storage media, and computer program products are basically similar to the method embodiments, and therefore the descriptions are relatively simple; relevant parts can be referred to the descriptions of the method embodiments.
[0221] The above description is merely a preferred embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention are included within the scope of protection of the present invention.
Claims
1. A method for detecting abnormal behavior, characterized in that, include: Acquire multiple first data nodes, each of which corresponds to a user behavior data; For each first data node, the neighbor density of the first data node is calculated, wherein the neighbor density represents the minimum distance between the first data node and the second data node, the mutual density of the second data node is greater than the mutual density of the first data node, the mutual density of the first data node represents the closeness between the first data node and the third data node, the mutual density of the second data node represents the closeness between the second data node and the fourth data node, the third data node includes data nodes whose distance from the first data node meets a first preset condition, and the fourth data node includes data nodes whose distance from the second data node meets a second preset condition; Multiple first data nodes are clustered based on their neighbor density to obtain multiple cluster sets; Select a cluster set with fewer than a preset threshold number of data nodes as the target cluster set, or determine the target cluster set based on the fluctuation of the number of data nodes in each cluster set; The user behavior data corresponding to the first data node included in the target cluster set is identified as abnormal behavior data; The mutual density is calculated using the following formula: ; in, First data node The set consisting of the corresponding third data nodes, First data node With the third data node distance, First data node mutual density; The neighbor density of the first data node is calculated using the following formula: ; First data node Neighbor density, First data node With the second data node The distance.
2. The method according to claim 1, characterized in that, The clustering of multiple first data nodes based on their neighbor densities yields multiple cluster sets, including: The first data node whose neighbor density is greater than a preset density threshold is selected as the cluster center. Iterate through each first data node in turn, and find the fifth data node with the smallest neighbor density to the first data node. If the fifth data node is a cluster center or the fifth data node has been clustered, then cluster the fifth data node and the first data node into the cluster set corresponding to the cluster center.
3. The method according to claim 1, characterized in that, The process of determining the target cluster set based on the fluctuation of the number of data nodes in each cluster set includes: Sort the clusters in ascending order of the number of data nodes in each cluster set, or in descending order of the number of data nodes in each cluster set. For each cluster set in the sorted cluster sets except the last cluster set, calculate the fluctuation value of the number of data nodes of each cluster set relative to the next cluster set. Wherein, for each cluster set in the sorted cluster sets except the last cluster set, the next cluster set corresponding to the cluster set is the cluster set that is ranked after the sorted cluster set, and the last cluster set is the last cluster set in the sorted cluster sets. Select the value with the largest fluctuation in the number of data nodes; The outlier cluster set is determined based on the maximum fluctuation value of the number of data nodes.
4. The method according to claim 3, characterized in that, The process of determining the outlier cluster set based on the maximum data node number fluctuation value includes: If the cluster sets are sorted in ascending order of the number of data nodes in each cluster set, then the target cluster set corresponding to the maximum fluctuation value of the number of data nodes, and the cluster sets ranked before the target cluster set in each sorted cluster set, are taken as the outlier cluster set. If the cluster sets are sorted in descending order of the number of data nodes in each cluster set, then the target cluster set corresponding to the maximum fluctuation value of the number of data nodes, and the cluster sets in each sorted cluster set that are ranked after the target cluster set, are taken as the outlier cluster set.
5. The method according to any one of claims 1 to 4, characterized in that, After identifying the user behavior data corresponding to the first data node included in the target cluster set as abnormal behavior data, the method further includes: For each abnormal behavior data, determine the user equipment identifier corresponding to the abnormal behavior data; The user equipment identifier is sent to the service terminal so that the service terminal can intercept the operation of the device corresponding to the user equipment identifier.
6. The method according to any one of claims 1 to 4, characterized in that, The method further includes: Retrieve task execution records corresponding to multiple user identifiers within a preset time range; For each user identifier, the number of times a task is executed, the duration of the task, the time point of the task, the number of times the account is changed, and the task type are counted within the preset time range. The number of times a task is executed, the duration of the task, the time point of the task, the number of times the account is changed, and the task type corresponding to a user identifier are collected as user behavior data.
7. An abnormal behavior detection device, characterized in that, include: The acquisition module is used to acquire multiple first data nodes, each of which corresponds to a user behavior data. The calculation module is used to calculate the neighbor density of each first data node, wherein the neighbor density represents the minimum distance between the first data node and the second data node, the mutual density of the second data node is greater than the mutual density of the first data node, the mutual density of the first data node represents the closeness between the first data node and the third data node, the mutual density of the second data node represents the closeness between the second data node and the fourth data node, the third data node includes data nodes whose distance from the first data node meets a first preset condition, and the fourth data node includes data nodes whose distance from the second data node meets a second preset condition; The clustering module is used to cluster multiple first data nodes based on the neighbor density of multiple first data nodes to obtain multiple cluster sets; The anomaly detection module is used to select cluster sets with fewer than a preset threshold number of data nodes as target cluster sets, or to determine target cluster sets based on the fluctuation of the number of data nodes in each cluster set; and to determine the user behavior data corresponding to the first data node included in the target cluster set as abnormal behavior data. The mutual density is calculated using the following formula: ; in, First data node The set consisting of the corresponding third data nodes, First data node With the third data node distance, First data node mutual density; The neighbor density of the first data node is calculated using the following formula: ; First data node Neighbor density, First data node With the second data node The distance.
8. An electronic device, characterized in that, It includes a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus; Memory, used to store computer programs; A processor, when executing a program stored in memory, implements the method of any one of claims 1-6.
9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the method described in any one of claims 1-6.