Abnormal account identification method and device, electronic equipment and storage medium
By constructing a directed acyclic graph for abnormal account detection, the problem of frequent account registration through system reloading was solved, the recall rate of abnormal accounts was improved, and the effectiveness and trustworthiness of user growth activities on the application platform were ensured.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING DAJIA INTERNET INFORMATION TECH CO LTD
- Filing Date
- 2023-04-10
- Publication Date
- 2026-06-12
AI Technical Summary
In existing technologies, when users frequently restore their devices to factory settings and register accounts on new devices, the response rate for abnormal accounts is low, making it difficult to effectively identify and handle abnormal accounts, which affects the effectiveness and trustworthiness of application platform user growth activities.
By performing coarse-grained filtering based on account login information, a directed acyclic graph (DAG) is constructed. The sequences in the DAG are then used for anomaly detection to quickly identify abnormal accounts and improve the recall rate.
It effectively improved the recall rate of abnormal accounts, significantly enhanced the quality and reliability of user growth activities on the application platform, and prevented cheating.
Smart Images

Figure CN116383465B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of computer technology, and more specifically, to methods, apparatus, electronic devices, and storage media for identifying abnormal accounts. Background Technology
[0002] In the application (App) business, there are some activities that promote the growth of the number of users on the application platform. That is, the application platform can give some rewards to the accounts that actively bring in new users to become new users of the App, thereby increasing the number of daily active users (DAU) of the application platform.
[0003] However, some users, in order to get more rewards, will frequently restore their devices to factory settings and use these devices to continuously register new accounts on the application platform. At this time, each newly registered account looks like a first registration on a brand new device, making it virtually impossible to identify abnormal distributions, resulting in a low recall rate for abnormal accounts. Summary of the Invention
[0004] This disclosure provides a method, apparatus, electronic device, and storage medium for identifying abnormal accounts, in order to at least solve the problem of low recall rate of abnormal accounts in the aforementioned related technologies.
[0005] According to a first aspect of the present disclosure, an abnormal account identification method is provided, comprising: clustering the multiple accounts based on login information of multiple accounts of a target application to obtain multiple account clusters; constructing a directed acyclic graph (DAG) corresponding to each account cluster based on login information of each account in each account cluster, wherein each node in the DAG corresponds to an account; determining multiple sequences contained in each DAG for each DAG, wherein each sequence contains at least two sequentially connected nodes; determining an abnormal sequence in the multiple sequences based on sequence information or node behavior information of each sequence in the multiple sequences, and identifying the account corresponding to the node contained in the abnormal sequence as an abnormal account.
[0006] Optionally, the login information of the account includes information about the login device used by the account and the start and end times of the account's operation on the target application; the step of clustering the multiple accounts based on the login information of multiple accounts of the target application to obtain multiple account clusters includes: clustering multiple accounts associated with the login device information of the multiple accounts to obtain multiple initial account clusters; determining the operation duration corresponding to each account according to the start and end times of each account's operation on the target application, wherein the operation duration represents the time difference between the start and end times of the account's operation on the target application; and filtering the accounts in each initial account cluster according to the operation duration to obtain the multiple account clusters.
[0007] Optionally, the step of filtering accounts in each initial account cluster according to the operation duration to obtain the multiple account clusters includes: for each initial account cluster, removing accounts whose operation duration is greater than or equal to an operation duration threshold from the accounts included in each initial account cluster; and obtaining the multiple account clusters based on the remaining accounts in each initial account cluster.
[0008] Optionally, the login information of the account includes the start time and end time of the account's operation on the target application. The step of constructing a directed acyclic graph (DAG) corresponding to each account cluster based on the login information of each account in each cluster includes: sorting multiple nodes in the DAG corresponding to each account cluster based on the start time, wherein the earlier the start time, the higher the ranking of the corresponding node; for any two nodes among the sorted nodes, calculating the time interval between the start time of the later node and the end time of the earlier node; and, if the time interval is within a preset time interval, establishing an edge connecting the earlier node to the later node to obtain the DAG corresponding to each account cluster.
[0009] Optionally, determining the multiple sequences contained in the directed acyclic graph includes: obtaining multiple first sequences based on the pointing relationships between nodes in the directed acyclic graph; decomposing each first sequence to obtain multiple second sequences contained in each first sequence; and obtaining multiple sequences contained in the directed acyclic graph based on the multiple first sequences and the multiple second sequences contained in each first sequence.
[0010] Optionally, the node behavior information includes the login duration corresponding to the node, and the step of determining the abnormal sequence among the multiple sequences based on the sequence information or node behavior information of each sequence includes: for each sequence among the multiple sequences, determining the average login duration corresponding to the node contained in each sequence; and determining each sequence as the abnormal sequence if the average login duration is less than a preset duration threshold.
[0011] Optionally, the node behavior information further includes click behavior trajectory information within the target application. The step of determining the abnormal sequence among the multiple sequences based on the sequence information or node behavior information of each sequence in the multiple sequences further includes: for each of the multiple sequences, determining the click behavior trajectory information of each of the at least two nodes contained in each sequence within the target application; and determining each sequence as the abnormal sequence if the number of overlapping click behavior trajectory information in the at least two click behavior trajectory information corresponding to each sequence is greater than or equal to a preset number threshold.
[0012] Optionally, the sequence information includes the sequence length, and the step of determining the abnormal sequence among the multiple sequences based on the sequence information or node behavior information of each sequence among the multiple sequences includes: determining the sequence length of each sequence for each of the multiple sequences; and determining each sequence as the abnormal sequence if the sequence length of each sequence is greater than or equal to a preset length threshold.
[0013] Optionally, the login information of the account includes the start time and end time of the account's operation on the target application, and the sequence information also includes node connectivity. The step of determining abnormal sequences among the multiple sequences based on the sequence information or node behavior information of each sequence further includes: for each of the multiple sequences, determining the node connectivity between any two adjacent nodes among the at least two nodes contained in each sequence, wherein the node connectivity is the time difference between the start time of the later node and the end time of the earlier node among the two adjacent nodes; and determining each sequence as the abnormal sequence if the node connectivity between any two adjacent nodes among the at least two nodes contained in each sequence is less than a preset connectivity threshold.
[0014] According to a second aspect of the present disclosure, an abnormal account identification device is provided, comprising: a clustering processing module configured to perform clustering processing on the multiple accounts based on login information of multiple accounts of a target application to obtain multiple account clusters; a directed acyclic graph (DAG) construction module configured to construct a DAG corresponding to each account cluster based on login information of each account in each account cluster, wherein each node in the DAG corresponds to an account; a multiple sequence determination module configured to determine multiple sequences contained in each DAG for each DAG, wherein each sequence contains at least two sequentially connected nodes; and an abnormal sequence determination module configured to determine abnormal sequences in the multiple sequences based on sequence information or node behavior information of each sequence in the multiple sequences, and determine the accounts corresponding to the nodes contained in the abnormal sequences as abnormal accounts.
[0015] Optionally, the login information of the account includes information about the login device used by the account and the start and end times of the account's operation on the target application; the clustering module is configured to: perform clustering processing on multiple accounts associated with the login device information of the multiple accounts to obtain multiple initial account clusters; determine the operation duration corresponding to each account based on the start and end times of each account's operation on the target application, wherein the operation duration represents the time difference between the start and end times of the account's operation on the target application; and filter the accounts in each initial account cluster based on the operation duration to obtain the multiple account clusters.
[0016] Optionally, the clustering processing module is configured to: for each initial account cluster, remove accounts whose operation duration is greater than or equal to an operation duration threshold from the accounts included in each initial account cluster; and obtain the multiple account clusters based on the remaining accounts in each initial account cluster.
[0017] Optionally, the login information of the account includes the start time and end time of the account's operation on the target application. The directed acyclic graph (DAG) construction module is configured to: sort multiple nodes in the DAG corresponding to each account cluster based on the start time, wherein the earlier the start time, the higher the corresponding node is in the sort; for any two nodes among the sorted nodes, calculate the time interval between the start time of the later node and the end time of the earlier node; if the time interval is within a preset time interval, establish an edge connecting the earlier node to the later node to obtain the DAG corresponding to each account cluster.
[0018] Optionally, the multiple sequence determination module is configured to: obtain multiple first sequences based on the pointing relationships between nodes in the directed acyclic graph; decompose each first sequence to obtain multiple second sequences contained in each first sequence; and obtain multiple sequences contained in the directed acyclic graph based on the multiple first sequences and the multiple second sequences contained in each first sequence.
[0019] Optionally, the node behavior information includes the login duration corresponding to the node, and the abnormal sequence determination module is configured to: for each of the plurality of sequences, determine the average login duration corresponding to the nodes contained in each sequence; if the average login duration is less than a preset duration threshold, determine each sequence as the abnormal sequence.
[0020] Optionally, the node behavior information further includes click behavior trajectory information within the target application, and the abnormal sequence determination module is configured to: for each of the plurality of sequences, determine the click behavior trajectory information of each of the at least two nodes contained in each sequence within the target application; and determine each sequence as the abnormal sequence if the number of overlapping click behavior trajectory information in the at least two click behavior trajectory information corresponding to each sequence is greater than or equal to a preset number threshold.
[0021] Optionally, the sequence information includes the sequence length, and the abnormal sequence determination module is configured to: determine the sequence length of each of the plurality of sequences; and determine each sequence as the abnormal sequence if the sequence length of each sequence is greater than or equal to a preset length threshold.
[0022] Optionally, the login information of the account includes the start time and end time of the account's operation on the target application, and the sequence information also includes node connectivity. The abnormal sequence determination module is configured to: for each of the plurality of sequences, determine the node connectivity between any two adjacent nodes among the at least two nodes contained in each sequence, wherein the node connectivity is the time difference between the start time of the later node and the end time of the earlier node among the two adjacent nodes; and determine each sequence as the abnormal sequence if the node connectivity between any two adjacent nodes among the at least two nodes contained in each sequence is less than a preset connectivity threshold.
[0023] According to a third aspect of the present disclosure, an electronic device is provided, comprising: a processor; and a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement an abnormal account identification method according to the present disclosure.
[0024] According to a fourth aspect of the present disclosure, a computer-readable storage medium is provided that, when instructions in the computer-readable storage medium are executed by a processor of an electronic device, enables the electronic device to perform an abnormal account identification method according to the present disclosure.
[0025] According to a fifth aspect of the present disclosure, a computer program product is provided, including a computer program that, when executed by a processor, implements an abnormal account identification method according to the present disclosure.
[0026] The technical solutions provided by the embodiments of this disclosure bring at least the following beneficial effects:
[0027] First, a coarse-grained screening can be performed across multiple accounts based on login information. For the initially identified account clusters, a finer-grained screening can be conducted to find relationships between different accounts. This involves constructing a directed acyclic graph (DAG). By performing anomaly detection on the sequences contained within the DAG, the user's intent can be quickly determined, allowing for rapid identification of abnormal accounts and improving the recall rate. This has a significant risk control effect on cheating methods such as repeated system reloading, effectively ensuring the quality of activities aimed at increasing the number of users on the application platform.
[0028] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure. Attached Figure Description
[0029] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this disclosure and, together with the description, serve to explain the principles of this disclosure, and are not intended to unduly limit this disclosure.
[0030] Figure 1 This is a flowchart illustrating an abnormal account identification method according to an exemplary embodiment of the present disclosure;
[0031] Figure 2 This is a schematic diagram illustrating a directed acyclic graph according to an exemplary embodiment of the present disclosure;
[0032] Figure 3 This is a flowchart illustrating a specific implementation of an abnormal account identification method according to an exemplary embodiment of the present disclosure;
[0033] Figure 4This is a block diagram illustrating an abnormal account identification device according to an exemplary embodiment of the present disclosure;
[0034] Figure 5 This is a block diagram illustrating an electronic device according to an exemplary embodiment of the present disclosure. Detailed Implementation
[0035] To enable those skilled in the art to better understand the technical solutions of this disclosure, the technical solutions in the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings.
[0036] It should be noted that the terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this disclosure are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this disclosure described herein can be implemented in orders other than those illustrated or described herein. The embodiments described in the following examples do not represent all embodiments consistent with this disclosure. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this disclosure as detailed in the appended claims.
[0037] It should be noted that the phrase "at least one of several items" in this disclosure refers to three parallel cases: "any one of the several items", "a combination of any number of the several items", and "all of the several items". For example, "including at least one of A and B" includes the following three parallel cases: (1) including A; (2) including B; (3) including A and B. As another example, "performing at least one of step one and step two" indicates the following three parallel cases: (1) performing step one; (2) performing step two; (3) performing both step one and step two.
[0038] In the app business, there are often campaigns aimed at increasing the number of users on the application platform. These campaigns reward accounts that actively recruit new users, thereby increasing the platform's Daily Active Users (DAU). However, some users may take advantage of these campaigns to fraudulently obtain platform resources. If these user growth campaigns are filled with a large number of fraudulent accounts used to steal resources, it will not only cause significant losses to the application platform and fail to achieve the expected results, but it will also generate negative public opinion and reduce customer trust in the platform.
[0039] In related technologies, gang mining methods are mainly used to combat abnormal accounts that fraudulently obtain resources during activities aimed at increasing the number of users on application platforms. Gang mining frameworks include: gang mining based on shared media, gang mining based on similar behavioral chains, gang mining based on financial relationships, and gang mining based on social relationships, among others. The main algorithms for gang mining include: Label Propagation Algorithm (LPA), Connected Component Analysis, k-core algorithm, Louvain algorithm, and a series of improved algorithms aimed at optimizing modularity.
[0040] "Group detection methods" often rely on shared consistency among devices to construct a map, including: mobile phone numbers, ID card numbers, bank card numbers, device identifiers (device_id) provided by the device group, network card physical addresses (mac_id), etc. These media can be called "strong media," meaning they have strong account connectivity. Accounts associated with strong media can often be considered as account groups with similar relationships or consistent risk. However, as criminals' methods constantly evolve, relying solely on strong media relationships is no longer sufficient to easily identify shared account groups. Therefore, risk control personnel have shifted their focus to weak media combinations. Weak media can include the first three segments of the Internet Protocol (IP) of the network used by the device, the province and city where the account was registered, the province and city where the account was logged in, the legal representative's name, the application list (applist), etc. Dynamically characterizing the scope of abnormal clusters can serve as the basis for weak media combinations.
[0041] As the defense against anomalous accounts escalates, some users frequently restore their devices to factory settings and then use these devices to continuously register new accounts on application platforms. Each newly registered account appears to be a first-time registration on a completely new device. At this point, it becomes impossible to accurately identify the group controlling the devices through combinations of various media. For this type of account environment, which is essentially a cold start for the devices, it's impossible to find the dimensions for identifying the media relationships used by the group, resulting in extremely low response rates for such anomalous accounts in group detection solutions.
[0042] In related technologies, anomaly detection schemes can also be used to identify abnormal accounts. This involves identifying anomalies by judging the anomalies in the feature values of existing characteristics. The main methods include:
[0043] a. Statistical test method: The basic assumption is that normal data follows a specific distribution pattern and accounts for a large proportion, while the location of outliers is significantly different from that of normal data.
[0044] b. Depth-based method: This method locates outliers from the edge of the point space and determines the number of layers and outliers according to different needs.
[0045] c. Bias-based method: Each point is checked, and if the value of a point deviates too much from the index of the entire set, the point is an outlier.
[0046] d. Distance-based method: Calculate the distance between each point and its surrounding points to determine if a point is abnormal. This method is based on the assumption that normal points have many nearest neighbors, while abnormal points are relatively far from their surroundings.
[0047] e. Density-based method: For the point under study, calculate the density around it and the density around its neighboring points, and calculate the relative density based on these two density values as an anomaly score.
[0048] f. Deep learning methods: using autoencoders (AEs) to encode and then decode to uncover anomalous factors with poor reproducibility and distributions that differ from most samples.
[0049] By proactively defining abnormal cheating behaviors, especially clarifying the boundaries of these behaviors, appropriate model strategies can be set up.
[0050] Criminals use frequent manual factory resets to forge new devices. From an anomaly perspective, these devices are essentially normal and cannot be identified as abnormal. Statistically, the models used by criminals are common, widely used models, making it difficult to detect anomalies. Anomaly detection solutions also have extremely low recall rates and cannot solve the problem of resetting black market devices.
[0051] To address the aforementioned issues in related technologies, the abnormal account identification method, apparatus, electronic device, and storage medium proposed in this disclosure can first perform coarse-grained screening among multiple accounts based on login information. For the initially screened account clusters, fine-grained screening can then be performed to find the relationships between different accounts. This allows the construction of a directed acyclic graph (DAG). By performing anomaly detection on the sequences contained in the DAG, the user's intent can be quickly determined, thereby rapidly identifying abnormal accounts and improving the recall rate. This has a significant risk control effect on cheating methods such as continuous account reloading, effectively ensuring the quality of activities that promote the growth of application platform users.
[0052] Figure 1 This is a flowchart illustrating an abnormal account identification method according to an exemplary embodiment of the present disclosure.
[0053] Reference Figure 1In step 101, multiple accounts can be clustered based on their login information to obtain multiple account clusters. The accounts included in each "account cluster" are those that may be abnormal.
[0054] It should be noted that if an account is an abnormal account used to fraudulently obtain resources, it will quickly log out of the target application after its first login and will not log in again subsequently; while a normal account may log in to the target application multiple times. Therefore, this disclosure requires identifying abnormal accounts used to fraudulently obtain resources based on the relevant information of each account when it first logs into the target application.
[0055] The "account login information" can include partial information about the IP address of the network used by the login device, the apps installed on the login device, the start time of the account's operation on the target application, and the end time of the account's operation on the target application, etc. Furthermore, filtering abnormal accounts based on login information is a coarse-grained method; therefore, the accounts included in the "account cluster" here should be "suspected abnormal" accounts. Additionally, the "start time of the account's operation on the target application" can be the installation time of the target application on the login device, and the "end time of the account's operation on the target application" can be the registration time of the account on the target application.
[0056] According to an exemplary embodiment of this disclosure, as described above, the login information of an account may include information about the login device used by the account, the start time of the account's operation on the target application, and the end time of the account's operation on the target application.
[0057] Multiple accounts that are associated with the login device information of multiple accounts can be clustered to obtain multiple initial account clusters. This means that the initial account clusters are composed of accounts that contain the same media combination type during the first login to the target application.
[0058] Typically, criminals will quickly log out of their accounts after participating in activities aimed at increasing the number of users on an application platform, and then switch to another account to participate in the same activities. Therefore, devices controlled by the same group of criminals will exhibit a near-repeating phenomenon in both time and space. In this case, a coarse-grained screening can be performed across multiple accounts using information about the logged-in devices. This involves using weak media combinations to roughly filter out similar accounts. These roughly filtered similar accounts can be considered to be potentially fraudulent in their attempts to obtain resources.
[0059] According to an exemplary embodiment of this disclosure, the information related to the login device may include at least one of the following: a portion of the IP address of the network used by the login device, the App installed on the login device, and the installation time of the target application on the login device.
[0060] The "partial content of the IP address of the network used by the login device" can be the first three segments of the IP address of the network used by the login device. For example, assuming the IP address of the network used by the login device is "168.160.66.119", then the first three segments of this network IP address are "168.160.66". It should be noted that when users frequently restore their terminals to factory settings, that is, when users frequently reset their terminals, the first three or even the first two segments of the IP address of the network used by the terminal generally remain unchanged. Furthermore, if the first three segments of the IP addresses of the networks used by terminals of different accounts are the same, it can be said that these terminals of these accounts are all located in the same area, for example, they may all be located in the same district of the same city. Based on the above characteristics of network IP addresses, if at least two accounts have the same partial content of the IP addresses of the networks used by their login devices, it can be roughly assumed that these accounts are new accounts registered through repeated flashing of the firmware.
[0061] "Apps installed on the login device" refers to the apps that come pre-installed with the device at the factory, excluding apps downloaded manually by the user later. When a user frequently resets their device, the pre-installed apps generally remain unchanged after each reset. Therefore, if at least two accounts have the same pre-installed apps on their login devices, it can be reasonably assumed that these accounts were newly registered through repeated device re-flashing.
[0062] "Installation time of the target application on the logged-in device" refers to the time when the user of the logged-in device downloaded and installed the target application. It's important to note that after each device reflash, the user needs to download and install the target application and register a new account on it. Furthermore, a user may reflash their device multiple times a day, requiring the download and installation of the target application each time. Therefore, if the installation times of the target application on at least two logged-in devices are very close, for example, both on the same day, it can be roughly assumed that these accounts were newly registered through repeated device reflashes.
[0063] In this way, we can first perform coarse-grained screening on multiple accounts by using partial content of the network IP, the installed apps, or the installation time of the target application. Then, we can filter out clusters of accounts with similar characteristics, which can roughly filter out accounts suspected of fraudulently obtaining resources, thus narrowing the screening scope and facilitating subsequent targeted screening of abnormal accounts with finer granularity.
[0064] Then, the operation duration for each account can be determined based on the start and end times of each account's operation on the target application. The operation duration represents the time difference between the start and end times of an account's operation on the target application. Next, accounts in each initial account cluster can be filtered based on their operation durations to obtain multiple account clusters.
[0065] It's important to note that cheating accounts using botnets always aim for maximum gain in the shortest time. Therefore, once an activity to increase the number of users on an application platform achieves its goal, it's highly likely that the device will be reset and restarted shortly afterward. In this case, the cheating device will immediately register a new account on the target application after downloading and installing it. The time interval between the installation and registration of this new account is very short, possibly only a few minutes. The shorter this interval, the more likely the device is frequently resetting its bot. Conversely, a longer interval suggests the user didn't immediately register on the target application after downloading it, but rather used the device for a while before registering. This contradicts the cheating account's "shortest time, maximum gain" principle. Therefore, such a new account is likely a legitimate account, not an abnormal one.
[0066] According to an exemplary embodiment of this disclosure, for each initial account cluster, accounts with operation durations greater than or equal to an operation duration threshold can be removed from the accounts included in each initial account cluster. Then, multiple account clusters can be obtained based on the remaining accounts in each initial account cluster, that is, multiple account clusters can be obtained based on the remaining accounts in each initial account cluster whose corresponding operation durations are less than the operation duration threshold.
[0067] It should be noted that for each of the multiple accounts associated with the login device's information, the single-operation time corresponding to that account can be determined. Specifically, for each similar category of accounts with similar characteristics included in the aforementioned "initial account cluster," the single-operation time corresponding to that similar category account can be determined. This "single-operation time" is the same as the aforementioned "operation duration," which is the time interval between the installation time of the target application on the login device corresponding to each account and the registration time on the installed target application.
[0068] Next, accounts whose single operation time is greater than or equal to a preset operation time threshold among multiple accounts associated with the login device's relevant information can be removed. In other words, accounts with less suspicion of flashing can be directly identified as normal accounts among similar category accounts with similar characteristics included in the aforementioned "initial account cluster".
[0069] Furthermore, accounts whose single-operation time is less than a preset operation time threshold among multiple accounts associated with login device information can be identified as "suspected abnormal accounts." In other words, accounts with similar characteristics within the aforementioned "initial account cluster" that are highly suspected of being involved in firmware flashing can be directly identified as "suspected abnormal accounts." Then, a directed acyclic graph can be constructed based on the registration and installation times of similar accounts with high firmware flashing suspicion within the aforementioned "initial account cluster."
[0070] In this way, by analyzing the time of a single operation, we can identify accounts with similar characteristics within the initial account clusters that are highly suspected of being hacked, as well as those with less suspicion. Accounts with less suspicion of hacking can then be directly classified as normal accounts. Furthermore, we can construct a directed acyclic graph (DAG) using the DAGs from the accounts with similar characteristics that are highly suspected of hacking, further narrowing down the scope of abnormal account screening. This facilitates more granular screening of abnormal accounts using the constructed DAG.
[0071] In step 102, a directed acyclic graph (DAG) can be constructed for each account cluster based on the login information of each account within each account cluster. Specifically, the DAG can be constructed based on the installation and registration times of similar accounts with similar characteristics among the similar accounts included in each initial account cluster that are suspected of being hacked. Each node in the DAG can correspond to one account.
[0072] According to an exemplary embodiment of this disclosure, the login information of an account may include the start time and end time of the account's operation on the target application.
[0073] Based on the start operation time, the nodes in the directed acyclic graph (DAG) corresponding to each account cluster can be sorted. The earlier the start operation time, the higher the corresponding node can be in the sorted order. Then, for any two nodes in the sorted list, the time interval between the start operation time of the later node and the end operation time of the earlier node can be calculated. If this time interval falls within a preset time range, an edge can be established between the earlier node and the later node, resulting in the DAG corresponding to each account cluster.
[0074] It should be noted that for each of the multiple accounts, a typical data structure of {account:(installation time, registration time)} can be created. Furthermore, multiple nodes can be sorted based on installation time, meaning that similar accounts with a high suspicion of flashing within the same initial account cluster can be sorted by installation time. Specifically, the earlier the installation time, the higher the corresponding node can be in the ranking; the later the installation time, the lower the corresponding node can be in the ranking.
[0075] For any two nodes among the sorted nodes, the time interval between the installation time of the later node and the registration time of the earlier node can be calculated. If this time interval is within a preset time range, an edge can be established between the earlier node and the later node; otherwise, no edge can be established between the earlier and later nodes.
[0076] The aforementioned "preset time interval" can be [0h, 3h]. In this case, each account can be used as a node, and edges can be established between nodes based on 0h <= (installation time of the later node - registration time of the earlier node) <= 3h. This allows for the construction of a directed acyclic graph for similar categories of accounts with a high suspicion of flashing within the aforementioned initial account cluster.
[0077] In step 103, for each directed acyclic graph, multiple sequences contained in the directed acyclic graph can be determined. Each sequence may contain at least two nodes connected in sequence.
[0078] According to an exemplary embodiment of this disclosure, multiple first sequences can be obtained based on the pointing relationships between nodes in a directed acyclic graph. Then, each first sequence can be decomposed to obtain multiple second sequences contained within each first sequence. Next, multiple sequences contained in the directed acyclic graph can be obtained based on the multiple first sequences and the multiple second sequences contained within each first sequence.
[0079] It's important to note that a Depth-First Search (DFS) algorithm can be run on a directed acyclic graph (DAG) to obtain multiple first sequences. In other words, all possible paths can be found by running DFS on this DAG. For example, we can choose any node as a starting point and use a stack to record the points along the path. Each time a node n-1 is reached, the path recorded in the stack is added to the output list, where n is the number of nodes in the DAG. It's worth noting that because the constructed graph is a DAG, the same node will not be repeatedly traversed during the DFS search; therefore, there is no need to check whether the current node has been visited.
[0080] The time complexity is O(n*2n). The worst-case scenario is that each node can go to a node with a higher number than itself. In this case, the number of paths is 2n, and the length of each path is n. Therefore, the total time complexity is O(n*2n), and the space complexity is O(n), where n is the number of nodes in the directed acyclic graph, primarily due to stack space overhead.
[0081] For example, suppose there are four similar accounts with a high suspicion of ROM flashing in the aforementioned "initial account cluster," that is, there are four nodes: Node 1: {Zhang San: (2:00, 3:00)}, Node 2: {Li Si: (3:00, 5:00)}, Node 3: {Wang Wu: (4:00, 6:00)}, and Node 4: {Zhao Liu: (8:00, 10:00)}. Using each similar account as a node, and establishing edges between nodes based on 0h <= (installation time of the later node - registration time of the earlier node) <= 3h, a directed acyclic graph corresponding to the four similar accounts with a high suspicion of ROM flashing within the aforementioned "initial account cluster" can be constructed.
[0082] Figure 2 This is a schematic diagram illustrating a directed acyclic graph according to an exemplary embodiment of the present disclosure. (Refer to...) Figure 2 The diagram shows four nodes: node 1, node 2, node 3, and node 4. Since nodes installed earlier are listed earlier in the order, the result of sorting these four nodes by installation time is: node 1 - node 2 - node 3 - node 4.
[0083] The installation time of node 2 (3:00) - the registration time of node 1 (3:00) = 0h, and 0h <= 0h <= 3h. Therefore, there is a directed edge between node 1 and node 2, which points from node 1 to node 2.
[0084] The installation time of node 3 (4:00) - the registration time of node 1 (3:00) = 1 hour, and 0 hour <= 1 hour <= 3 hours. Therefore, there is a directed edge between node 1 and node 3, which points from node 1 to node 3.
[0085] The installation time of node 4 is 8:00 - the registration time of node 1 is 3:00 = 5h, and 5h is not within the interval [0h, 3h]. Therefore, a directed edge cannot be established between node 1 and node 4.
[0086] The installation time of node 3 is 4:00 - the registration time of node 2 is 5:00 = -1h, and -1h is not within the interval [0h, 3h]. Therefore, a directed edge cannot be established between node 2 and node 3.
[0087] The installation time of node 4 is 8:00 - the registration time of node 2 is 5:00 = 3h, and 0h <= 3h <= 3h. Therefore, there is a directed edge between node 2 and node 4, which points from node 2 to node 4.
[0088] The installation time of node 4 is 8:00 - the registration time of node 3 is 6:00 = 2h, and 0h <= 2h <= 3h. Therefore, there is a directed edge between node 3 and node 4, which points from node 3 to node 4.
[0089] By running DFS on this directed acyclic graph, two paths can be found, which means two first sequences can be found, namely 1->2->4 and 1->3->4.
[0090] Then, for each of the multiple first sequences, specifically for each of the two first sequences 1->2->4 and 1->3->4, we can use dynamic programming with a two-dimensional array to discover the multiple second sequences contained within each first sequence. Next, we can determine that the multiple first sequences and the multiple second sequences contained within each first sequence constitute multiple sequences within a directed acyclic graph.
[0091] Specifically, dp[i][j] can be used to represent the number of ways the first i characters in s can form the first j characters in t. When calculating dp[i][j], if s[i-1] is not equal to t[j-1], then dp[i][j] = dp[i-1][j], because the i-th character in s cannot form a new position in t, so its number of ways inherits the number of ways to form the first j characters in t from the first i-1 characters in s; otherwise, based on this, we can also provide a new number of ways for dp[i-1][j-1], so dp[i][j] = dp[i-1][j] + dp[i-1][j-1]. The complete state transition equation can be:
[0092]
[0093] The boundary condition is dp[i][0] = 1, which means that not selecting any character is considered a solution.
[0094] For example, for the first sequence 1->2->4, the mined second sequence can be 1->2; 2->4; 1->4; for the first sequence 1->3->4, the mined second sequence can be 1->3; 3->4; 1->4.
[0095] In step 104, based on the sequence information or node behavior information of each sequence in multiple sequences, abnormal sequences can be identified, and the accounts corresponding to the nodes contained in the abnormal sequences can be identified as abnormal accounts. That is, risk assessment can be performed on all sequences, specifically by fusing the attributes of all nodes in each sequence and extracting risk information codes (infocode). Here, "node behavior information" can include "login duration of the node" and "click behavior trajectory information within the target application"; "sequence information" can include "sequence length" and "node connectivity". For example, the "Isolation Forest algorithm" can be used to identify abnormal sequences.
[0096] It should be noted that when users fraudulently obtain resources on application platforms, they will repeatedly restore their devices to factory settings to register new accounts. Therefore, the overall cheating process should be: "Restore factory settings -> Install the target application -> Register a new account on the target application -> Log in to the target application using the newly registered account -> Log out of the target application" — "Restore factory settings -> Install the target application -> Register a new account on the target application -> Log in to the target application using the newly registered account -> Log out of the target application" — ...
[0097] Cheating users register new accounts by repeatedly flashing the app's firmware, thus repeatedly defrauding the application platform of commissions. Furthermore, to maximize profits, cheating users flash the firmware in very short cycles, resulting in excessively long sequences—meaning the sequences contain a large number of nodes, and the connections between any two adjacent nodes appear very tight. Based on these characteristics, abnormal sequences can be quickly identified.
[0098] According to an exemplary embodiment of this disclosure, the aforementioned "node behavior information" may include the login duration corresponding to the node. For each of the multiple sequences, the average login duration corresponding to the nodes included in that sequence can be determined. If the average login duration is less than a preset duration threshold, the sequence can be determined to be an abnormal sequence.
[0099] For example, the first sequence 1->3->4 contains three nodes: node 1, node 3, and node 4. Therefore, the average login duration corresponding to the three nodes in the first sequence 1->3->4 can be determined. Here, "login duration" refers to the time elapsed from when the account logs into the target application until when it logs out of the target application.
[0100] If the average of at least two login durations is less than a preset duration threshold, the sequence can be determined as an abnormal sequence. The "preset duration threshold" can be set according to actual needs. For example, it can be determined based on the login duration of normal accounts, such as 30 minutes.
[0101] As mentioned earlier, because the cycle of fraudulent users flashing their devices is very short, the login duration of each new account registered through this method will be relatively short, generally much shorter than that of a normal account. Therefore, if the average of at least two login durations is less than a preset duration threshold—that is, if the average of at least two login durations is less than the login duration of a normal account—it is highly likely that the sequence is an abnormal sequence. In other words, it is highly likely that the nodes contained in the sequence are abnormal accounts used to fraudulently obtain commissions from the application platform, effectively ensuring the accuracy of identifying abnormal accounts.
[0102] According to an exemplary embodiment of this disclosure, the aforementioned "node behavior information" may further include click behavior trajectory information within the target application. For each of the plurality of sequences, click behavior trajectory information of each of the at least two nodes contained in the sequence within the target application can be determined. If the number of overlapping click behavior trajectory information in the at least two click behavior trajectory information corresponding to the sequence is greater than or equal to a preset number threshold, the sequence can be determined to be an abnormal sequence.
[0103] It should be noted that a user's click behavior within the target application is equivalent to accessing the target application's API. For cheating users who register new accounts by repeatedly flashing their devices, their click behavior after each login to the target application is essentially the same.
[0104] For example, the typical click behavior pattern of a cheating user is: "Entering an invitation code to prove their identity as a user who recruited them to become a new account in the target application, used to pay commissions to users who actively recruited them -> Randomly clicking on a few tabs in the target application -> Logging out of the target application." For instance, if the target application is a video application, the typical click behavior pattern of a cheating user is: "Entering an invitation code to prove their identity as a user who recruited them to become a new account in the video application, used to pay commissions to users who actively recruited them -> Randomly clicking on a few tabs in the video application's 'Local Area Page,' 'Discover Page,' and 'Following Page' -> Logging out of the video application."
[0105] For normal users, due to differences in individual interests and hobbies, the click behavior patterns of different normal accounts are generally different. Therefore, it is possible to determine whether a sequence is abnormal based on the click behavior pattern information of each node in at least two nodes within the target application.
[0106] If the number of overlapping click behavior trajectories in at least two corresponding click behavior trajectories of a sequence is greater than or equal to a preset threshold, the sequence can be identified as an abnormal sequence. Thus, when the overlap between at least two corresponding click behavior trajectories of a sequence is high, it indicates that the click order and click area of each node within the target application are essentially the same. In this case, it is highly likely that the sequence is an abnormal sequence, meaning it is highly likely that the nodes contained in the sequence are abnormal accounts used to fraudulently obtain commissions from the application platform, effectively ensuring the accuracy of identifying abnormal accounts.
[0107] According to exemplary embodiments of this disclosure, the aforementioned "sequence information" may include sequence length. For each of a plurality of sequences, the sequence length can be determined. If the sequence length is greater than or equal to a preset length threshold, the sequence can be determined to be an abnormal sequence.
[0108] It should be noted that when the sequence information is the sequence length, for each of multiple sequences, the sequence length can be determined, which in turn determines the number of nodes contained in the sequence. If the sequence length is greater than or equal to a preset length threshold, that is, if the number of nodes contained in the sequence is large, the sequence can be determined as an abnormal sequence.
[0109] As mentioned earlier, to maximize profits, cheating users will perform very short refresh cycles, resulting in excessively long sequences containing a large number of nodes. Therefore, if the sequence length is greater than or equal to a preset length threshold, it's likely due to multiple device resets. In this case, it's highly probable that the sequence is abnormal, meaning the nodes within it are likely from fraudulent accounts used to defraud the application platform of commissions, effectively ensuring the accuracy of identifying abnormal accounts.
[0110] According to an exemplary embodiment of this disclosure, the login information of an account may include the start time and end time of the account's operation on the target application, and the aforementioned "sequence information" may also include node connectivity.
[0111] For each of a plurality of sequences, the node connectivity between any two adjacent nodes among the at least two nodes contained in the sequence can be determined. The node connectivity can be the time difference between the start time of the later node and the end time of the earlier node among any two adjacent nodes. If the node connectivity between any two adjacent nodes among the at least two nodes contained in the sequence is less than a preset connectivity threshold, the sequence can be determined as an abnormal sequence.
[0112] It should be noted that the lower the node connectivity between any two adjacent nodes in a sequence, the higher the risk that the sequence is an anomalous sequence, that is, the greater the probability that the sequence is an anomalous sequence.
[0113] As mentioned earlier, to maximize profits, cheating users will perform very short refresh cycles. This will result in a very tight connection between any two adjacent nodes in the sequence. Therefore, the smaller the connection between any two adjacent nodes in a sequence, the higher the risk that the sequence is an anomalous sequence, meaning the greater the probability that the sequence is abnormal, which can effectively ensure the accuracy of identifying anomalous accounts.
[0114] It should be noted that the hyperparameters throughout the experiment can be further adjusted by verifying the account retention rates in these anomalous sequences. For example, if an account is identified as an anomalous account, and that account has not logged into the target application since its initial login, then the parameter values of the various strategies in this disclosure are considered reasonable and require no adjustment. However, if an account is identified as an anomalous account, and that account has logged into the target application multiple times after its initial login, it indicates that the account is actually a normal account, meaning a "false positive" has occurred. In this case, the parameter values of the various strategies in this disclosure are considered unreasonable and need to be adjusted to improve the accuracy of identifying anomalous accounts.
[0115] Figure 3 This is a flowchart illustrating a specific implementation of an abnormal account identification method according to an exemplary embodiment of the present disclosure.
[0116] Reference Figure 3 In step 301, among the multiple accounts of the target application, multiple accounts that contain the same media combination type during the first login to the target application are obtained. The "media combination type" may include "the first three segments of the IP address of the network used by the login device", "the App installed on the login device", and "the installation time of the target application on the login device".
[0117] At this point, a coarse-grained screening can be performed across multiple accounts using weak media combinations. This allows for a rough selection of accounts in similar categories, forming an "account cluster." These roughly selected similar accounts can be considered to be potentially fraudulent in their attempts to obtain resources.
[0118] In step 302, among multiple accounts that contain the same media combination type during the initial login to the target application, accounts whose single operation time is greater than or equal to a preset operation time threshold and accounts whose single operation time is less than the preset operation time threshold are identified.
[0119] The "single operation time" refers to the time interval between the installation time of the target application on the login device corresponding to the account and the registration time on the installed target application. This allows us to identify accounts with a high suspicion of firmware flashing and those with a low suspicion of firmware flashing within the aforementioned "initial account cluster".
[0120] In step 303, among multiple accounts that share the same media combination type during the initial login to the target application, accounts whose single operation time is greater than or equal to a preset operation time threshold are identified as normal accounts. In other words, accounts with a lower likelihood of being flagged for reflashing can be directly identified as normal accounts within the aforementioned "initial account cluster."
[0121] In step 304, a directed acyclic graph is constructed based on the registration time and installation time of accounts whose single operation time is less than a preset operation time threshold among multiple accounts containing the same media combination type during the first login to the target application. In other words, a directed acyclic graph can be constructed based on the registration time and installation time of similar category accounts with a high suspicion of flashing among similar category accounts included in the aforementioned "initial account cluster".
[0122] In step 305, the depth-first search algorithm (DFS) is run on the directed acyclic graph to obtain multiple first sequences, that is, all possible paths can be found by running DFS on the directed acyclic graph.
[0123] In step 306, for each of the multiple first sequences, multiple second sequences contained in each first sequence are mined using a two-dimensional array dynamic programming approach.
[0124] In step 307, for each first sequence or each second sequence, based on the sequence information or node behavior information of each sequence, it is determined whether the sequence is an abnormal sequence, and the nodes contained in the abnormal sequence can be identified as abnormal accounts.
[0125] For example, risk assessment can be performed on all sequences by fusing attributes of all nodes in each sequence and extracting risk infocodes. "Node behavior information" can include "login duration" and "click behavior trajectory information within the target application"; "sequence information" can include "sequence length" and "node connectivity". For example, the "Isolation Forest algorithm" can be used to identify abnormal sequences.
[0126] Figure 4 This is a block diagram illustrating an abnormal account identification device according to an exemplary embodiment of the present disclosure.
[0127] Reference Figure 4 The device 400 may include a clustering processing module 401, a directed acyclic graph construction module 402, a multiple sequence determination module 403, and an abnormal sequence determination module 404.
[0128] The clustering module 401 can cluster multiple accounts based on their login information for the target application, resulting in multiple account clusters. The accounts included in each "account cluster" are those that may be abnormal.
[0129] It should be noted that if an account is an abnormal account used to fraudulently obtain resources, it will quickly log out of the target application after its first login and will not log in again subsequently; while a normal account may log in to the target application multiple times. Therefore, this disclosure requires identifying abnormal accounts used to fraudulently obtain resources based on the relevant information of each account when it first logs into the target application.
[0130] The "account login information" can include partial information about the IP address of the network used by the login device, the apps installed on the login device, the start time of the account's operation on the target application, and the end time of the account's operation on the target application, etc. Furthermore, filtering abnormal accounts based on login information is a coarse-grained method; therefore, the accounts included in the "account cluster" here should be "suspected abnormal" accounts. Additionally, the "start time of the account's operation on the target application" can be the installation time of the target application on the login device, and the "end time of the account's operation on the target application" can be the registration time of the account on the target application.
[0131] According to an exemplary embodiment of this disclosure, as described above, the login information of an account may include information about the login device used by the account, the start time of the account's operation on the target application, and the end time of the account's operation on the target application.
[0132] The clustering processing module 401 can cluster multiple accounts that are associated with the relevant information of the login device in multiple accounts to obtain multiple initial account clusters, that is, to obtain the initial account clusters composed of accounts that contain the same media combination type during the first login to the target application.
[0133] Typically, criminals will quickly log out of their accounts after participating in activities aimed at increasing the number of users on an application platform, and then switch to another account to participate in the same activities. Therefore, devices controlled by the same group of criminals will exhibit a near-repeating phenomenon in both time and space. In this case, a coarse-grained screening can be performed across multiple accounts using information about the logged-in devices. This involves using weak media combinations to roughly filter out similar accounts. These roughly filtered similar accounts can be considered to be potentially fraudulent in their attempts to obtain resources.
[0134] Then, the clustering module 401 can determine the operation duration for each account based on the start and end operation times of each account to the target application. The operation duration represents the time difference between the start and end operation times of an account to the target application. Next, the clustering module 401 can filter the accounts in each initial account cluster based on the operation duration, resulting in multiple account clusters.
[0135] According to an exemplary embodiment of this disclosure, for each initial account cluster, the clustering processing module 401 can remove accounts whose operation time is greater than or equal to an operation time threshold from the accounts included in each initial account cluster. Then, the clustering processing module 401 can obtain multiple account clusters based on the remaining accounts in each initial account cluster, that is, it can obtain multiple account clusters based on the remaining accounts in each initial account cluster whose corresponding operation time is less than the operation time threshold.
[0136] It should be noted that for each of the multiple accounts associated with the login device's information, the single-operation time corresponding to that account can be determined. Specifically, for each similar category of accounts with similar characteristics included in the aforementioned "initial account cluster," the single-operation time corresponding to that similar category account can be determined. This "single-operation time" is the same as the aforementioned "operation duration," which is the time interval between the installation time of the target application on the login device corresponding to each account and the registration time on the installed target application.
[0137] Next, accounts whose single operation time is greater than or equal to a preset operation time threshold among multiple accounts associated with the login device's relevant information can be removed. In other words, accounts with less suspicion of flashing can be directly identified as normal accounts among similar category accounts with similar characteristics included in the aforementioned "initial account cluster".
[0138] Furthermore, accounts whose single-operation time is less than a preset operation time threshold among multiple accounts associated with login device information can be identified as "suspected abnormal accounts." In other words, accounts with similar characteristics within the aforementioned "initial account cluster" that are highly suspected of being involved in firmware flashing can be directly identified as "suspected abnormal accounts." Then, a directed acyclic graph can be constructed based on the registration and installation times of similar accounts with high firmware flashing suspicion within the aforementioned "initial account cluster."
[0139] In this way, by analyzing the time of a single operation, we can identify accounts with similar characteristics within the initial account clusters that are highly suspected of being hacked, as well as those with less suspicion. Accounts with less suspicion of hacking can then be directly classified as normal accounts. Furthermore, we can construct a directed acyclic graph (DAG) using the DAGs from the accounts with similar characteristics that are highly suspected of hacking, further narrowing down the scope of abnormal account screening. This facilitates more granular screening of abnormal accounts using the constructed DAG.
[0140] The directed acyclic graph (DAG) construction module 402 can construct DAGs for each account cluster based on the login information of each account within each account cluster. Specifically, it can construct DAGs for each account cluster based on the installation and registration times of similar accounts with similar characteristics that are highly suspected of being hacked, within each initial account cluster. Each node in the DAG can correspond to one account.
[0141] According to an exemplary embodiment of this disclosure, the login information of an account may include the start time and end time of the account's operation on the target application.
[0142] The directed acyclic graph (DAG) construction module 402 can sort the multiple nodes contained in the DAG corresponding to each account cluster based on the start operation time. The earlier the start operation time, the higher the corresponding node can be in the sorted order. Then, for any two nodes among the sorted nodes, the time interval between the start operation time of the later node and the end operation time of the earlier node can be calculated. If this time interval is within a preset time interval, an edge can be established between the earlier node and the later node, thus obtaining the DAG corresponding to each account cluster.
[0143] It should be noted that for each of the multiple accounts, a typical data structure of {account:(installation time, registration time)} can be created. Furthermore, multiple nodes can be sorted based on installation time, meaning that similar accounts with a high suspicion of flashing within the same initial account cluster can be sorted by installation time. Specifically, the earlier the installation time, the higher the corresponding node can be in the ranking; the later the installation time, the lower the corresponding node can be in the ranking.
[0144] For any two nodes among the sorted nodes, the directed acyclic graph construction module 402 can calculate the time interval between the installation time of the later node and the registration time of the earlier node. If this time interval is within a preset time interval, the directed acyclic graph construction module 402 can establish an edge between the earlier node and the later node; otherwise, it may not establish an edge between the earlier and later nodes.
[0145] The aforementioned "preset time interval" can be [0h, 3h]. In this case, each account can be used as a node, and edges can be established between nodes based on 0h <= (installation time of the later node - registration time of the earlier node) <= 3h. This allows for the construction of a directed acyclic graph for similar categories of accounts with a high suspicion of flashing within the aforementioned initial account cluster.
[0146] For each directed acyclic graph, the multiple sequence determination module 403 can determine multiple sequences contained in the directed acyclic graph. Each sequence can contain at least two nodes connected in sequence.
[0147] According to an exemplary embodiment of this disclosure, the multiple sequence determination module 403 can obtain multiple first sequences based on the pointing relationships between nodes in a directed acyclic graph. Then, the multiple sequence determination module 403 can decompose each first sequence to obtain multiple second sequences contained within each first sequence. Next, the multiple sequence determination module 403 can obtain multiple sequences contained in the directed acyclic graph based on the multiple first sequences and the multiple second sequences contained within each first sequence.
[0148] It should be noted that the multiple sequence determination module 403 can run a depth-first search (DFS) algorithm on the directed acyclic graph (DAG) to obtain multiple first sequences. In other words, all possible paths can be found by running DFS on this DAG. For example, one can choose any node as the starting point and use a stack to record the points on the path. Each time a node n-1 is reached, the path recorded in the stack is added to the output list, where n is the number of nodes in the DAG. It should be noted that because the constructed graph is a DAG, the same node will not be repeatedly traversed during the DFS search; therefore, it is not necessary to check whether the current node has been traversed.
[0149] The time complexity is O(n*2n). The worst-case scenario is that each node can go to a node with a higher number than itself. In this case, the number of paths is 2n, and the length of each path is n. Therefore, the total time complexity is O(n*2n), and the space complexity is O(n), where n is the number of nodes in the directed acyclic graph, primarily due to stack space overhead.
[0150] Then, for each of the multiple first sequences, the multiple sequence determination module 403 can mine the multiple second sequences contained in each first sequence using a two-dimensional array dynamic programming approach. Next, the multiple sequence determination module 403 can determine that the multiple first sequences and the multiple second sequences contained in each first sequence are multiple sequences contained in a directed acyclic graph.
[0151] The abnormal sequence determination module 404 can identify abnormal sequences among multiple sequences based on the sequence information or node behavior information of each sequence, and can identify the accounts corresponding to the nodes contained in the abnormal sequence as abnormal accounts. That is, it can perform risk assessment on all sequences, specifically by fusing the attributes of all nodes in each sequence and extracting risk information codes (infocode). Among these, "node behavior information" can include "login duration of the node" and "click behavior trajectory information within the target application"; "sequence information" can include "sequence length" and "node connectivity". For example, the "Isolation Forest algorithm" can be used to identify abnormal sequences.
[0152] According to an exemplary embodiment of this disclosure, the aforementioned "node behavior information" may include the login duration corresponding to the node. For each of the multiple sequences, the abnormal sequence determination module 404 can determine the average login duration corresponding to the nodes included in the sequence. If the average login duration is less than a preset duration threshold, the abnormal sequence determination module 404 can determine that the sequence is an abnormal sequence.
[0153] For example, the first sequence 1->3->4 contains three nodes: node 1, node 3, and node 4. Therefore, the average login duration corresponding to the three nodes in the first sequence 1->3->4 can be determined. Here, "login duration" refers to the time elapsed from when the account logs into the target application until when it logs out of the target application.
[0154] If the average of at least two login durations is less than a preset duration threshold, the sequence can be determined as an abnormal sequence. The "preset duration threshold" can be set according to actual needs. For example, it can be determined based on the login duration of normal accounts, such as 30 minutes.
[0155] As mentioned earlier, because the cycle of fraudulent users flashing their devices is very short, the login duration of each new account registered through this method will be relatively short, generally much shorter than that of a normal account. Therefore, if the average of at least two login durations is less than a preset duration threshold—that is, if the average of at least two login durations is less than the login duration of a normal account—it is highly likely that the sequence is an abnormal sequence. In other words, it is highly likely that the nodes contained in the sequence are abnormal accounts used to fraudulently obtain commissions from the application platform, effectively ensuring the accuracy of identifying abnormal accounts.
[0156] According to an exemplary embodiment of this disclosure, the aforementioned "node behavior information" may further include click behavior trajectory information within the target application. For each of the plurality of sequences, the abnormal sequence determination module 404 may determine the click behavior trajectory information within the target application for each of the at least two nodes contained in the sequence. If the number of overlapping click behavior trajectory information in the at least two click behavior trajectory information corresponding to the sequence is greater than or equal to a preset number threshold, the abnormal sequence determination module 404 may determine that the sequence is an abnormal sequence.
[0157] It should be noted that a user's click behavior within the target application is equivalent to accessing the target application's API. For cheating users who register new accounts by repeatedly flashing their devices, their click behavior after each login to the target application is essentially the same.
[0158] For example, the typical click behavior pattern of a cheating user is: "Entering an invitation code to prove their identity as a user who recruited them to become a new account in the target application, used to pay commissions to users who actively recruited them -> Randomly clicking on a few tabs in the target application -> Logging out of the target application." For instance, if the target application is a video application, the typical click behavior pattern of a cheating user is: "Entering an invitation code to prove their identity as a user who recruited them to become a new account in the video application, used to pay commissions to users who actively recruited them -> Randomly clicking on a few tabs in the video application's 'Local Area Page,' 'Discover Page,' and 'Following Page' -> Logging out of the video application."
[0159] For normal users, due to differences in individual interests and hobbies, the click behavior patterns of different normal accounts are generally different. Therefore, it is possible to determine whether a sequence is abnormal based on the click behavior pattern information of each node in at least two nodes within the target application.
[0160] If the number of overlapping click behavior trajectories in at least two corresponding click behavior trajectories of a sequence is greater than or equal to a preset threshold, the sequence can be identified as an abnormal sequence. Thus, when the overlap between at least two corresponding click behavior trajectories of a sequence is high, it indicates that the click order and click area of each node within the target application are essentially the same. In this case, it is highly likely that the sequence is an abnormal sequence, meaning it is highly likely that the nodes contained in the sequence are abnormal accounts used to fraudulently obtain commissions from the application platform, effectively ensuring the accuracy of identifying abnormal accounts.
[0161] According to an exemplary embodiment of this disclosure, the aforementioned "sequence information" may include the sequence length. For each of the plurality of sequences, the abnormal sequence determination module 404 may determine the sequence length of that sequence. If the sequence length of the sequence is greater than or equal to a preset length threshold, the abnormal sequence determination module 404 may determine that the sequence is an abnormal sequence.
[0162] It should be noted that when the sequence information is the sequence length, for each of multiple sequences, the sequence length can be determined, which in turn determines the number of nodes contained in the sequence. If the sequence length is greater than or equal to a preset length threshold, that is, if the number of nodes contained in the sequence is large, the sequence can be determined as an abnormal sequence.
[0163] As mentioned earlier, to maximize profits, cheating users will perform very short refresh cycles, resulting in excessively long sequences containing a large number of nodes. Therefore, if the sequence length is greater than or equal to a preset length threshold, it's likely due to multiple device resets. In this case, it's highly probable that the sequence is abnormal, meaning the nodes within it are likely from fraudulent accounts used to defraud the application platform of commissions, effectively ensuring the accuracy of identifying abnormal accounts.
[0164] According to an exemplary embodiment of this disclosure, the login information of an account may include the start time and end time of the account's operation on the target application, and the aforementioned "sequence information" may also include node connectivity.
[0165] For each of the multiple sequences, the abnormal sequence determination module 404 can determine the node connectivity between any two adjacent nodes among the at least two nodes contained in the sequence. The node connectivity can be the time difference between the start time of the later node and the end time of the earlier node among any two adjacent nodes. If the node connectivity between any two adjacent nodes among the at least two nodes contained in the sequence is less than a preset connectivity threshold, the abnormal sequence determination module 404 can determine that the sequence is an abnormal sequence.
[0166] It should be noted that the lower the node connectivity between any two adjacent nodes in a sequence, the higher the risk that the sequence is an anomalous sequence, that is, the greater the probability that the sequence is an anomalous sequence.
[0167] As mentioned earlier, to maximize profits, cheating users will perform very short refresh cycles. This will result in a very tight connection between any two adjacent nodes in the sequence. Therefore, the smaller the connection between any two adjacent nodes in a sequence, the higher the risk that the sequence is an anomalous sequence, meaning the greater the probability that the sequence is abnormal, which can effectively ensure the accuracy of identifying anomalous accounts.
[0168] It should be noted that the hyperparameters throughout the experiment can be further adjusted by verifying the account retention rates in these anomalous sequences. For example, if an account is identified as an anomalous account, and that account has not logged into the target application since its initial login, then the parameter values of the various strategies in this disclosure are considered reasonable and require no adjustment. However, if an account is identified as an anomalous account, and that account has logged into the target application multiple times after its initial login, it indicates that the account is actually a normal account, meaning a "false positive" has occurred. In this case, the parameter values of the various strategies in this disclosure are considered unreasonable and need to be adjusted to improve the accuracy of identifying anomalous accounts.
[0169] Figure 5 This is a block diagram illustrating an electronic device 500 according to an exemplary embodiment of the present disclosure.
[0170] Reference Figure 5 The electronic device 500 includes at least one memory 501 and at least one processor 502. The at least one memory 501 stores instructions that, when executed by the at least one processor 502, perform an abnormal account identification method according to an exemplary embodiment of the present disclosure.
[0171] As an example, electronic device 500 may be a PC, tablet, personal digital assistant, smartphone, or other device capable of executing the aforementioned instructions. Here, electronic device 500 is not necessarily a single electronic device, but may be a collection of any devices or circuits capable of executing the aforementioned instructions (or instruction sets) individually or in combination. Electronic device 500 may also be part of an integrated control system or system manager, or may be configured to interconnect with a portable electronic device locally or remotely (e.g., via wireless transmission) through an interface.
[0172] In electronic device 500, processor 502 may include a central processing unit (CPU), a graphics processing unit (GPU), a programmable logic device, a dedicated processor system, a microcontroller, or a microprocessor. By way of example and not limitation, processor may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, etc.
[0173] The processor 502 can execute instructions or code stored in the memory 501, which can also store data. Instructions and data can also be sent and received over a network via a network interface device, which can employ any known transmission protocol.
[0174] The memory 501 may be integrated with the processor 502, for example, by arranging RAM or flash memory within an integrated circuit microprocessor. Alternatively, the memory 501 may include a separate device, such as an external disk drive, a storage array, or other storage device usable by any database system. The memory 501 and the processor 502 may be operatively coupled, or may communicate with each other, for example, via I / O ports, network connections, etc., enabling the processor 502 to read files stored in the memory.
[0175] In addition, electronic device 500 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of electronic device 500 can be interconnected via a bus and / or network.
[0176] According to exemplary embodiments of this disclosure, a computer-readable storage medium may also be provided, which, when executed by a processor of an electronic device, enables the electronic device to perform the aforementioned abnormal account identification method. Examples of computer-readable storage media include: read-only memory (ROM), random access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD+R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD+R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or optical disc storage, hard disk drive (HDD), solid-state drive (SSD), card storage (such as multimedia cards, secure digital (SD) cards, or ultra-fast digital (XD) cards), magnetic tape, floppy disk, magneto-optical data storage device, optical data storage device, hard disk, solid-state drive, and any other device configured to store a computer program and any associated data, data files, and data structures in a non-transitory manner and to provide the computer program and any associated data, data files, and data structures to a processor or computer so that the processor or computer can execute the computer program. The computer program in the aforementioned computer-readable storage medium can run in an environment deployed in computer devices such as clients, hosts, agent devices, servers, etc. Furthermore, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system, such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed manner through one or more processors or computers.
[0177] According to exemplary embodiments of the present disclosure, a computer program product may also be provided, including a computer program that, when executed by a processor, implements the abnormal account identification method according to the present disclosure.
[0178] According to the abnormal account identification method, apparatus, electronic device, and storage medium disclosed herein, coarse-grained screening can be performed on multiple accounts based on login information. For the initially screened account clusters, fine-grained screening can then be performed to find the relationships between different accounts. That is, a directed acyclic graph can be constructed. By performing anomaly detection on the sequences contained in the directed acyclic graph, the user's intent can be quickly determined, thereby quickly identifying abnormal accounts and improving the recall rate of abnormal accounts. This has a significant risk control effect on cheating methods such as continuous device flashing, effectively ensuring the quality of activities that promote the growth of application platform users.
[0179] Furthermore, a coarse-grained screening can be performed on multiple accounts by analyzing some content of the network IP, the installed apps, or the installation time of the target application. This allows for the selection of initial account clusters with similar characteristics, which can roughly identify accounts suspected of fraudulently obtaining resources. This narrows the screening scope and facilitates more targeted and fine-grained screening of abnormal accounts in the future.
[0180] Furthermore, by analyzing the time of a single operation, we can identify accounts with similar characteristics within the initial account clusters that are highly suspected of bot-flashing, as well as those with less suspicion. Accounts with less suspicion of bot-flashing can then be directly classified as normal accounts. Simultaneously, we can construct a directed acyclic graph (DAG) using the DAGs from the accounts with similar characteristics that are highly suspected of bot-flashing, further narrowing down the scope of abnormal account screening. This facilitates more granular screening of abnormal accounts using the constructed DAG.
[0181] Furthermore, because cheating users complete the firmware refresh cycle very quickly, the login duration of each new account registered through this process will be relatively short, generally much shorter than that of a normal account. Therefore, if the average of at least two login durations is less than a preset duration threshold—that is, if the average of at least two login durations is less than the login duration of a normal account—it is highly likely that the sequence is an abnormal sequence. In other words, it is highly likely that the nodes contained in the sequence are abnormal accounts used to fraudulently obtain commissions from the application platform, effectively ensuring the accuracy of identifying abnormal accounts.
[0182] Furthermore, if the overlap between at least two click behavior trajectory information corresponding to the sequence is high, it indicates that the click order and click area of each node within the target application are basically the same. In this case, it is highly likely that the sequence is an abnormal sequence, that is, it is highly likely that the nodes contained in the sequence are abnormal accounts used to fraudulently obtain commissions from the application platform, which can effectively ensure the accuracy of identifying abnormal accounts.
[0183] Furthermore, to maximize profits, cheating users will perform very short refresh cycles, resulting in excessively long sequences containing a large number of nodes. Therefore, if the sequence length is greater than or equal to a preset length threshold, it is likely due to multiple device resets. In this case, it is highly probable that the sequence is abnormal, meaning the nodes contained within it are likely from fraudulent accounts used to defraud the application platform of commissions, effectively ensuring the accuracy of identifying abnormal accounts.
[0184] Furthermore, to maximize profits, cheating users will perform very short refresh cycles. This is reflected in the sequence by a very tight connection between any two adjacent nodes. Therefore, the smaller the connection between any two adjacent nodes in a sequence, the higher the risk that the sequence is an anomalous sequence, meaning the greater the probability that the sequence is abnormal, which can effectively ensure the accuracy of identifying anomalous accounts.
[0185] Other embodiments of this disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this disclosure are indicated by the following claims.
[0186] It should be understood that this disclosure is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this disclosure is limited only by the appended claims.
Claims
1. A method for identifying abnormal accounts, characterized in that, include: Based on the login information of multiple accounts of the target application, the multiple accounts are clustered to obtain multiple account clusters. The login information of the account includes relevant information of the login device used by the account and the start and end operation times of the account to the target application. The relevant information of the login device used by the account includes at least one of the following: part of the IP of the network used by the login device, the App installed on the login device, and the installation time of the target application on the login device. Based on the login information of each account in each account cluster, a directed acyclic graph corresponding to each account cluster is constructed, wherein each node in the directed acyclic graph corresponds to an account; For each directed acyclic graph, determine a plurality of sequences contained in the directed acyclic graph, wherein each sequence contains at least two nodes connected in sequence; Based on the sequence information or node behavior information of each of the multiple sequences, abnormal sequences are identified among the multiple sequences, and the accounts corresponding to the nodes contained in the abnormal sequences are identified as abnormal accounts.
2. The abnormal account identification method as described in claim 1, characterized in that, Based on the login information of multiple accounts of the target application, the multiple accounts are clustered to obtain multiple account clusters, including: Clustering is performed on multiple accounts that are associated with the login device information in the multiple accounts to obtain multiple initial account clusters; Based on the start and end times of each account's operation on the target application, the operation duration for each account is determined, wherein the operation duration represents the time difference between the start and end times of the operation on the target application. Based on the operation time, the accounts in each initial account cluster are filtered to obtain the multiple account clusters.
3. The abnormal account identification method as described in claim 2, characterized in that, The step of filtering accounts in each initial account cluster based on the operation duration to obtain the multiple account clusters includes: For each initial account cluster, remove accounts whose operation duration is greater than or equal to the operation duration threshold from the accounts included in each initial account cluster; Based on the remaining accounts in each initial account cluster, the multiple account clusters are obtained.
4. The abnormal account identification method as described in claim 1, characterized in that, The login information of the account includes the start and end times of the account's operation on the target application. The step of constructing a directed acyclic graph corresponding to each account cluster based on the login information of each account in each account cluster includes: Based on the start operation time, the nodes contained in the directed acyclic graph corresponding to each account cluster are sorted, wherein the earlier the start operation time, the higher the sorting of the corresponding node. For any two nodes among the sorted plurality of nodes, calculate the time interval between the start operation time of the later node and the end operation time of the earlier node. If the time interval is within a preset time range, establish the connection between the preceding node and the following node to obtain the directed acyclic graph corresponding to each account cluster.
5. The abnormal account identification method as described in claim 1, characterized in that, Determining the multiple sequences contained in the directed acyclic graph includes: Based on the pointing relationships between nodes in the directed acyclic graph, multiple first sequences are obtained; Each first sequence is decomposed to obtain multiple second sequences contained in each first sequence; Based on the plurality of first sequences and the plurality of second sequences contained in each of the first sequences, the plurality of sequences contained in the directed acyclic graph are obtained.
6. The abnormal account identification method as described in claim 1, characterized in that, The node behavior information includes the login duration corresponding to the node. The step of determining abnormal sequences among the multiple sequences based on the sequence information or node behavior information of each sequence includes: For each of the plurality of sequences, determine the average login duration corresponding to the nodes contained in each sequence; If the average login duration is less than a preset duration threshold, each sequence is determined to be an abnormal sequence.
7. The abnormal account identification method as described in claim 1, characterized in that, The node behavior information also includes click behavior trajectory information within the target application. The step of determining abnormal sequences among the multiple sequences based on the sequence information or node behavior information of each sequence further includes: For each of the plurality of sequences, determine the click behavior trajectory information of each of the at least two nodes contained in each sequence within the target application; If the number of overlapping click behavior trajectory information in at least two click behavior trajectory information corresponding to each sequence is greater than or equal to a preset number threshold, then each sequence is determined to be the abnormal sequence.
8. The abnormal account identification method as described in claim 1, characterized in that, The sequence information includes the sequence length. The process of determining abnormal sequences among the multiple sequences based on the sequence information or node behavior information of each sequence includes: For each of the plurality of sequences, determine the sequence length of each sequence; If the sequence length of each sequence is greater than or equal to a preset length threshold, then each sequence is determined to be an abnormal sequence.
9. The abnormal account identification method as described in claim 1, characterized in that, The login information of the account includes the start and end times of the account's operation on the target application. The sequence information also includes node connectivity. The step of determining abnormal sequences among the multiple sequences based on the sequence information or node behavior information of each sequence further includes: For each of the plurality of sequences, determine the node connectivity between any two adjacent nodes among the at least two nodes contained in each sequence, wherein the node connectivity is the time difference between the start operation time corresponding to the later node and the end operation time corresponding to the earlier node among the two adjacent nodes. If the node connectivity between any two adjacent nodes in at least two nodes contained in each sequence is less than a preset connectivity threshold, then each sequence is determined to be an abnormal sequence.
10. An abnormal account identification device, characterized in that, include: The clustering module is configured to perform clustering processing on the multiple accounts based on the login information of multiple accounts of the target application, thereby obtaining multiple account clusters. The login information of the account includes relevant information of the login device used by the account and the start and end operation times of the account for the target application. The relevant information of the login device used by the account includes at least one of the following: part of the IP address of the network used by the login device, the App installed on the login device, and the installation time of the target application on the login device. The directed acyclic graph construction module is configured to construct a directed acyclic graph corresponding to each account cluster based on the login information of each account in each account cluster, wherein each node in the directed acyclic graph corresponds to an account; Multiple sequence determination modules are configured to determine multiple sequences contained in each directed acyclic graph, wherein each sequence contains at least two nodes connected in sequence. The abnormal sequence determination module is configured to determine the abnormal sequence in the plurality of sequences based on the sequence information or node behavior information of each sequence in the plurality of sequences, and to determine the account corresponding to the node contained in the abnormal sequence as an abnormal account.
11. An electronic device, characterized in that, include: processor; Memory used to store the processor's executable instructions; The processor is configured to execute the instructions to implement the abnormal account identification method as described in any one of claims 1 to 9.
12. A computer-readable storage medium, characterized in that, When the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device is able to perform the abnormal account identification method as described in any one of claims 1 to 9.