A method and apparatus for identifying aggressive behavior

By extracting triples from host logs to construct an attribute heterogeneous graph and training the model, the problem of insufficient semantic information in security detection device logs, making it difficult to automatically identify attack behaviors, is solved, thus achieving efficient attack behavior identification.

CN113935028BActive Publication Date: 2026-06-16NSFOCUS INFORMATION TECHNOLOGY CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NSFOCUS INFORMATION TECHNOLOGY CO LTD
Filing Date
2021-11-12
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Most of the massive logs generated by existing security detection equipment are low-level, isolated, and semantically meaningless, making it difficult to automatically identify attack behaviors.

Method used

By extracting triples from host logs, constructing an attribute heterogeneous graph, training the model, and obtaining a semantic reasoning model, the model is divided into subgraphs to identify attack behaviors.

🎯Benefits of technology

It improved the accuracy of attack behavior identification and reduced labor costs.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN113935028B_ABST
    Figure CN113935028B_ABST
Patent Text Reader

Abstract

A method and device for attack behavior recognition are used to automatically identify attack behaviors from massive logs obtained by a security detection device. In the present application, the method comprises: extracting a plurality of triplets from host logs, wherein the triplets comprise a source node, a target node and an edge, and the edge is used to indicate an operation between the source node and the target node; determining an attribute heterogeneous graph of the host logs according to the source nodes, the target nodes and the edges in the plurality of triplets; performing model training to obtain a semantic reasoning model according to the attribute heterogeneous graph; obtaining one or more subgraphs according to the semantic reasoning model and the attribute heterogeneous graph; and determining attack behaviors from behaviors corresponding to the one or more subgraphs.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of cybersecurity technology, and in particular to a method and apparatus for identifying attack behavior. Background Technology

[0002] To defend against internal or external attacks, enterprises typically deploy a large number of security detection devices, such as Web Application Firewalls (WAF), Endpoint Detection and Response (EDR) systems, and honeypots.

[0003] Security detection equipment generates massive amounts of logs, most of which are low-level, isolated, and lack semantic meaning. Automatically identifying attack behaviors from these massive logs is a current challenge for security operations. Summary of the Invention

[0004] This application provides a method and apparatus for identifying attack behaviors, which are used to automatically identify attack behaviors from host logs or host operation logs.

[0005] Firstly, this application provides a method for identifying attack behavior, which can be executed by a security detection device.

[0006] In one possible implementation, multiple triples are extracted from the host logs, each triple including a source node, a target node, and an edge, the edge indicating the operation between the source node and the target node; an attribute heterogeneous graph of the host logs is determined based on the source node, target node, and edge in the multiple triples; a semantic reasoning model is trained based on the attribute heterogeneous graph to obtain a semantic reasoning model; one or more subgraphs are obtained based on the semantic reasoning model and the attribute heterogeneous graph; and attack behaviors are determined from the behaviors corresponding to the one or more subgraphs.

[0007] In one possible implementation, the host log includes multiple log types, each type corresponding to the keyword of the source node and the keyword of the target node; the extraction of multiple triples from the host log includes: for any log type: extracting the source node from the log based on the keyword of the source node corresponding to the type; extracting the target node from the log based on the keyword of the target node corresponding to the type; determining an edge based on the operation between the source node and the target node in the log; and forming a triple of the log by combining the source node, the edge, and the target node.

[0008] In one possible implementation, a semantic reasoning model is trained based on the attribute heterogeneous graph, including: performing vector initialization for the source node, edge, and target node corresponding to each triple in the attribute heterogeneous graph to obtain the vector representation corresponding to the triple, thereby obtaining the vector representation corresponding to the plurality of triples; and training the model based on the vector representation corresponding to the plurality of triples to obtain the semantic reasoning model.

[0009] In one possible implementation, obtaining one or more subgraphs based on the semantic reasoning model and the attribute heterogeneous graph includes: performing cluster analysis on multiple nodes in the attribute heterogeneous graph based on the semantic reasoning model to obtain one or more node sets corresponding to the attribute heterogeneous graph; and forming a subgraph corresponding to the node set by combining multiple nodes in each node set and the edges corresponding to each node, thereby obtaining one or more subgraphs corresponding to the attribute heterogeneous graph.

[0010] In one possible implementation, before performing clustering analysis on multiple nodes in the attribute heterogeneous graph according to the semantic reasoning model, the method further includes: filtering out edges in the attribute heterogeneous graph that do not conform to temporal relationships and the target nodes corresponding to the edges according to the timestamps corresponding to the multiple edges in the attribute heterogeneous graph, so as to obtain a filtered attribute heterogeneous graph.

[0011] In one possible implementation, determining the attack behavior from the behaviors corresponding to the one or more subgraphs includes: determining the importance of each element in the attribute heterogeneous graph based on the number of subgraphs corresponding to the attribute heterogeneous graph and the frequency of occurrence of each element in the attribute heterogeneous graph, wherein the element is a source node, edge, or target node; for any subgraph, determining the vector representation corresponding to the subgraph based on the importance and vector representation of each element contained in the subgraph, wherein the vector representation corresponding to the subgraph is used to indicate the behavior corresponding to the subgraph; and determining the attack behavior based on the vector representations corresponding to the one or more subgraphs.

[0012] In one possible implementation, determining the attack behavior based on the vector representations corresponding to the one or more subgraphs includes: selecting a vector representation from the vector representations corresponding to the one or more subgraphs that has a similarity greater than a first preset similarity to a preset vector representation, and using it to indicate the attack behavior; and / or determining the similarity between the vector representations corresponding to any two subgraphs in the plurality of subgraphs, and determining the attack behavior based on the two vector representations that have a similarity less than a second preset similarity.

[0013] In the above technical solution, using attribute heterogeneous graphs to model host logs can effectively analyze the contextual semantics of behaviors in host logs and integrate heterogeneous data of various types. A semantic reasoning model is obtained by training the attribute heterogeneous graph. Based on the semantic reasoning model, the attribute heterogeneous graph is divided into one or more subgraphs, thereby realizing the semantic extraction and vector representation of the host log context. Furthermore, based on the vector representations of each subgraph, attack behaviors are effectively identified. Thus, while improving the accuracy of attack behavior identification, labor costs are reduced.

[0014] Secondly, this application provides an apparatus for identifying attack behavior, which can be a security detection device.

[0015] In one possible implementation, the apparatus includes: an extraction module for extracting multiple triples from host logs, wherein each triple includes a source node, a target node, and an edge, the edge indicating an operation between the source node and the target node; a processing module for determining an attribute heterogeneous graph of the host logs based on the source node, target node, and edge in the multiple triples; training a model based on the attribute heterogeneous graph to obtain a semantic reasoning model; obtaining one or more subgraphs based on the semantic reasoning model and the attribute heterogeneous graph; and determining attack behaviors from the behaviors corresponding to the one or more subgraphs.

[0016] In one possible implementation, the host log includes multiple types of logs, each type corresponding to the keyword of the source node and the keyword of the target node; the extraction module is specifically used to: for any type of log: extract the source node from the log based on the keyword of the source node corresponding to the type; extract the target node from the log based on the keyword of the target node corresponding to the type; determine the edge based on the operation between the source node and the target node in the log; and form a triple of the log by combining the source node, the edge, and the target node.

[0017] In one possible implementation, the processing module is specifically used to: perform vector initialization for the source node, edge, and target node corresponding to each triple in the attribute heterogeneous graph to obtain the vector representation corresponding to the triple, thereby obtaining the vector representation corresponding to the plurality of triples; and perform model training based on the vector representation corresponding to the plurality of triples to obtain the semantic reasoning model.

[0018] In one possible implementation, the processing module is specifically used to: perform cluster analysis on multiple nodes in the attribute heterogeneous graph according to the semantic reasoning model to obtain one or more node sets corresponding to the attribute heterogeneous graph; and form a subgraph corresponding to the node set by combining multiple nodes in each node set and the edges corresponding to each node, thereby obtaining one or more subgraphs corresponding to the attribute heterogeneous graph.

[0019] In one possible implementation, before performing clustering analysis on multiple nodes in the attribute heterogeneous graph according to the semantic reasoning model, the processing module is further configured to: filter the edges in the attribute heterogeneous graph that do not conform to the temporal relationship and the target nodes corresponding to the edges according to the timestamps corresponding to the multiple edges in the attribute heterogeneous graph, so as to obtain the filtered attribute heterogeneous graph.

[0020] In one possible implementation, the processing module is specifically used to: determine the importance of each element in the attribute heterogeneous graph based on the number of subgraphs corresponding to the attribute heterogeneous graph and the frequency of occurrence of each element in the attribute heterogeneous graph, wherein the element is a source node, edge, or target node; for any subgraph, determine the vector representation corresponding to the subgraph based on the importance and vector representation of each element contained in the subgraph, wherein the vector representation corresponding to the subgraph is used to indicate the behavior corresponding to the subgraph; and determine the attack behavior based on the vector representations corresponding to the one or more subgraphs.

[0021] In one possible implementation, the processing module is specifically used to: select a vector representation from the vector representations corresponding to the one or more subgraphs that has a similarity greater than a first preset similarity to a preset vector representation, for indicating the attack behavior; and / or determine the similarity between the vector representations corresponding to any two subgraphs in the plurality of subgraphs, and determine the attack behavior based on the two vector representations with a similarity less than a second preset similarity.

[0022] Thirdly, this application provides a computer-readable storage medium storing a computer program or instructions that, when executed by a device, cause the device to perform the method described in the first aspect or any possible implementation thereof.

[0023] Fourthly, this application provides a computer program product comprising a computer program or instructions that, when executed by a device, implement the method described in the first aspect or any possible implementation thereof.

[0024] Fifthly, this application provides a computing device including a processor connected to a memory for storing a computer program, and the processor for executing the computer program stored in the memory to enable the computing device to implement the method in the first aspect or any possible implementation thereof.

[0025] The technical effects that can be achieved by any of the second to fifth aspects mentioned above can be referred to the description of the beneficial effects in the first aspect mentioned above, and will not be repeated here. Attached Figure Description

[0026] Figure 1 A flowchart illustrating an attack behavior identification method provided in this application;

[0027] Figure 2 This application provides an attribute heterogeneity graph;

[0028] Figure 3 This application provides a relationship between the vectors corresponding to the elements of a triple;

[0029] Figure 4 This application provides an attribute heterogeneity graph during the program compilation process;

[0030] Figure 5 A schematic diagram illustrating the division of an attribute heterogeneous graph into multiple subgraphs provided in this application;

[0031] Figure 6 A flowchart illustrating a method for determining attack behavior provided in this application;

[0032] Figure 7 This application provides a schematic diagram of the structure of an attack behavior identification device;

[0033] Figure 8 A schematic diagram of another attack behavior identification device provided in this application. Detailed Implementation

[0034] This application provides a method for identifying attack behaviors. In this method, a security detection device models system behavior patterns by aggregating semantic information from massive logs, then automatically extracts context as behavioral semantics by analyzing the logs, and finally identifies attack behaviors from the logs based on the behavioral semantics.

[0035] For details, please refer to Figure 1 A flowchart illustrating an example of an attack behavior identification method is shown below:

[0036] Step 101: Extract multiple triples from the host logs.

[0037] The triple is a concept in knowledge graphs. A triple can include three elements: a source node, a target node, and an edge. The edge is used to indicate the operation or relationship between the source node and the target node.

[0038] In this application, the host log can also be a host runtime log, as illustrated below using a host log as an example. The host log can include various types, such as network connections, process behavior, registry operations, terminal logins, system services, operating system scheduled tasks, terminal applications, and file operations. Based on different types, the triplet can take different forms:

[0039] 1. Network connection

[0040] Network connection type logs are used to record the network behavior of the terminal.

[0041] For logs related to network connection types, they need to be differentiated based on the direction of the network connection as follows:

[0042] A connection direction of "in" indicates that a remote host is accessing the local machine via a network connection. The triple can be represented as remote_ip→process, where the source node remote_ip represents the IP address of the remote host, and the target node process represents the relevant process on the local machine.

[0043] A connection direction of "out" indicates that the local machine accesses a remote host via a network connection. The triple can be represented as process→remote_ip, where the source node "process" represents the relevant process on the local machine, and the target node "remote_ip" represents the IP address of the remote host.

[0044] 2. Process behavior

[0045] Logs of process behavior types can be used to describe the call relationships between processes.

[0046] Its triple can be represented as parent_process→children_process. Here, the source node parent_process of the triple represents the parent process, the target node children_process of the triple represents the child process, and the edges of the triple represent process operations.

[0047] 3. Registry operations

[0048] Registry operation type logs are used to represent operations performed on the registry.

[0049] The triple can be represented as process→registry_path. Here, the source node 'process' represents the process, the target node 'registry_path' represents the registry path, and the edges represent related registry operations.

[0050] 4. Terminal Login

[0051] Terminal login type logs are used to describe the terminal's login behavior.

[0052] Its triple can be represented as IP→process, where the source node IP of the triple represents the login IP or user, the target node process of the triple represents the terminal and login-related processes, and the edges of the triple represent login operations.

[0053] 5. System Services

[0054] System service logs are used to describe the operating system's service operations.

[0055] The triple can be represented as process→service, where the source node process represents the process related to the system service, the target node service represents the related service, and the edges of the triple represent the service type.

[0056] 6. Operating System Scheduled Tasks

[0057] The operating system scheduled task type log is used to represent the operating system scheduled tasks executed by the user.

[0058] Its triple can be represented as user→service, where the source node user represents the user, the target node service represents the related task, and the edges of the triple represent the task behavior.

[0059] 7. Terminal Application

[0060] Logs of the terminal application type are used to represent the relevant log records of the terminal application.

[0061] Its triple can be represented as user→app, where the source node user represents the end user, the target node app represents the application, and the edges of the triple represent the user's related operations on the application.

[0062] 8. File Operations

[0063] File operation logs are used to describe file operations performed on the terminal.

[0064] Its triple can be represented as process→file, where the source node process represents a process, the target node file represents a file, and the edges of the triple represent the processes' operations on the file.

[0065] In one possible implementation, each type corresponds to the key of its respective source node and the key of its respective target node. For example, if the type is network connection, then the key of the source node and the key of the target node corresponding to network connection are remote_ip and process, respectively.

[0066] When extracting triples from host logs, the process involves determining the log type for each log entry. Then, based on the log type, the keywords of the source node and target node corresponding to that type are obtained, allowing the extraction of the corresponding source and target nodes. Edges are then identified based on the operations between the source and target nodes within the log. Finally, the source node, edges, and target node are combined to form a log triple.

[0067] Taking network connection as an example again, when extracting triples from the network connection log, the corresponding source node and target node can be extracted based on remote_ip and process, and the edges of the triple can be determined based on the operation between remote_ip and process in the network connection log.

[0068] Step 102: Determine the attribute heterogeneous graph of the host log based on the source node, target node, and edges in multiple triples.

[0069] In the context of attribute heterogeneous graphs, it refers to a directed graph composed of vertices, edges, labels, relation types, and properties. A vertex, also called a node, can be either the source node or the target node in a triple. An edge, also called a relation, can have various types.

[0070] A heterogeneous attribute graph of the host log can be constructed by combining the source node, target node, and edges of each triple in multiple triples. For example... Figure 2 This application provides an example of an attribute heterogeneous graph.

[0071] like Figure 2 The logs shown may involve the following processes: creating a shellcode tool and successfully delivering it, and using the shellcode tool to establish a reverse connection (e.g.) Figure 2 The shellcode tool, as shown in the reverse shell example, quickly migrates the session and modifies the registry for persistent storage.

[0072] by Figure 2 Taking the triple Explorer.exe→Test.exe highlighted by the dashed box as an example, the edge represents a write operation, that is, the process Explorer.exe performs a write operation on the file Test.exe.

[0073] Other similar cases will not be elaborated upon further.

[0074] Step 103: Based on the attribute heterogeneity graph, train the model to obtain the semantic reasoning model.

[0075] In one possible implementation, for each triple in the attribute heterogeneous graph, vector initialization is performed on the source node, edge, and target node to obtain the vector representation of the triple, thereby obtaining multiple vector representations of triples; based on the multiple vector representations of triples, the model is trained to obtain the semantic reasoning model.

[0076] Taking any triplet as an example, the triplet can include three elements: a source node, a target node, and an edge. Specifically, vector initialization can be performed on the source node to obtain its corresponding vector representation; vector initialization can be performed on the edge to obtain its corresponding vector representation; and vector initialization can be performed on the target node to obtain its corresponding vector representation. The vector initialization can be implemented using the existing one-hot encoding technique.

[0077] Thus, we can obtain the vectors corresponding to the source node, edge, and target node in each triplet, which can be represented as h, r, and t respectively. Alternatively, we can understand that the vector corresponding to the triplet is (h, r, t).

[0078] The vectors corresponding to multiple triples in the attribute heterogeneous graph can be used as training data to train the model and obtain a trained semantic reasoning model. The semantic reasoning model can be used to indicate that two nodes with semantic association are relatively close in the vector space.

[0079] It can also be understood that during model training, considering the contextual relationships (or semantic associations) of each node in the heterogeneous attribute graph, the vector representation of each element (source node, edge, or target node) is adjusted so that the positions of two nodes with contextual relationships in the vector space are closer than the positions of two nodes without contextual relationships in the vector space.

[0080] In a specific implementation, the semantic reasoning model can be the TransE model. The TransE model belongs to the translation model. The TransE model can intuitively view the edges in each triple instance (source node, edge, target node) as translations from the source node to the target node. The source node can be represented as the entity head, the target node can be represented as the entity tail, and the edge can be represented as the relation.

[0081] The TransE model can continuously adjust h, r, and t in the vector space to make (h+r) as equal to t as possible, i.e., h+r=t. See [link to details] for more information. Figure 3 As shown.

[0082] The TransE model defines a distance function d(h+r, t), which measures the distance between h+r and t. In practical applications, L1 or L2 norms can be used. During model training, TransE employs the maximum margin method, with the following objective function:

[0083]

[0084] Where S is the triplet in the knowledge base, i.e., the training set, S ′ It is a negatively sampled triple, obtained by replacing h or t. γ is an interval distance parameter with a value greater than 0.

[0085] Step 104: Based on the semantic reasoning model and the attribute heterogeneous graph, obtain one or more subgraphs.

[0086] Each subgraph can be used to indicate a behavior or a behavior summary.

[0087] The behavior summary can be understood as data or information flow related to a certain behavior.

[0088] System behavior summary analysis can reduce a heterogeneous attribute graph to find a subgraph that is causally related to a specific attack behavior or an attack behavior. In this application, behavior summary extraction can be achieved using path partitioning or subgraph partitioning methods.

[0089] In one possible implementation, the attribute heterogeneous graph can be partitioned into subgraphs based on a semantic reasoning model to obtain one or more subgraphs corresponding to the attribute heterogeneous graph. Specifically, based on the semantic reasoning model, multiple nodes in the attribute heterogeneous graph can be clustered to obtain one or more node sets corresponding to the attribute heterogeneous graph; multiple nodes in each node set, along with the edges corresponding to each node, are used to form a subgraph corresponding to the node set, thus obtaining one or more subgraphs corresponding to the attribute heterogeneous graph.

[0090] Figure 4This example illustrates an attribute heterogeneity graph corresponding to a program compilation process. Specifically, this attribute heterogeneity graph can be a source graph, corresponding to the scenario where a software test engineer wants to use their privileges to steal sensitive information. The software test engineer's daily work includes using Git to synchronize code, using gcc to compile source code, and using the apt command to test security-related dependency packages. If the software test engineer wants to steal a sensitive file (such as secret.txt), they need to simulate their daily work behavior to evade detection. Specifically, the software test engineer can first copy the sensitive file to their usual working directory and rename it to pro2.c, then compile pro2.c using gcc, and finally upload pro2.c to GitHub.

[0091] Based on the trained semantic reasoning model, it can be used for, for example Figure 4 The attribute heterogeneity graph shown is partitioned, and the nodes in the attribute heterogeneity graph are clustered. The clustering method can be hierarchical clustering (HCA), and then the following is obtained: Figure 5 The three subgraphs shown are respectively represented as sub-graphs. Figure 1 ,son Figure 2 Kazuko Figure 3 .

[0092] Furthermore, before subgraph partitioning the attribute heterogeneous graph, the temporal relationship of actions can be considered, meaning that each subsequent action must occur after the preceding action. This temporal constraint can filter out a large portion of dependencies. In one possible implementation, ancestor nodes can be obtained from the attribute heterogeneous graph first, where no dependent nodes exist. Then, starting from this ancestor node, it can be sequentially determined whether there are edges or nodes that need to be filtered out.

[0093] Specifically, each triple in the attribute heterogeneous graph can correspond to its own timestamp. Based on the timestamp of each edge in the attribute heterogeneous graph, edges that do not conform to the temporal relationship and the target nodes corresponding to those edges can be filtered out to obtain the filtered attribute heterogeneous graph.

[0094] For example, triple 1 includes source node 1, edge 1, and target node 1, while triple 2 includes source node 2, edge 2, and target node 2. In practice, triple 2 occurs after triple 1, and the target node 1 in triple 1 and the source node 2 in triple 2 are the same node. However, in the current heterogeneous graph, the timestamp corresponding to edge 2 is earlier than the timestamp corresponding to edge 1, meaning triple 2 occurs before triple 1. Therefore, edge 2 and target node 2 can be filtered out based on their respective timestamps.

[0095] Based on this filtering method, nodes and edges that clearly do not conform to temporal relationships can be filtered out, thereby obtaining a more simplified attribute heterogeneous graph. Then, the filtered attribute heterogeneous graph can be further divided into subgraphs, which helps to improve processing efficiency.

[0096] Step 105: Identify the attack behavior from the behaviors corresponding to one or more subgraphs.

[0097] For details, please refer to Figure 6 An exemplary flowchart for determining attack behavior is shown below:

[0098] Step 601: Determine the importance of each element in the attribute heterogeneous graph based on the number of subgraphs corresponding to the attribute heterogeneous graph and the frequency of occurrence of elements in the attribute heterogeneous graph. The frequency of occurrence of an element in the attribute heterogeneous graph can be the total number of times the element appears in the attribute heterogeneous graph, the number of subgraphs containing the element in the attribute heterogeneous graph, or other methods.

[0099] For a given action, there is a series of underlying related operations, but the importance and necessity of each underlying operation are different for that action.

[0100] For example Figure 4 In the program compilation process, software test engineers typically don't directly compile the source code. Instead, they first use commands like `ls` or `dir` to locate the source code. While commands like `ls` and `dir` can represent the actions of software test engineers, they contribute little to the semantics of higher-level behaviors. Thus, it's understandable that if an operation / element appears frequently in the attribute heterogeneity graph, its importance might actually be lower.

[0101] In this application, Inverse Document Frequency (IDF) can be used to define the importance of an operation to a behavior. To correspond with the use of IDF, an operation can be viewed as a word in a document, and a behavior as a document. The IDF calculation formula can be expressed as follows:

[0102]

[0103] Where e represents an element in the attribute heterogeneous graph (i.e., a source node, edge, or target node), and N represents the total number of subgraphs. e ω represents the number of subgraphs containing element e. IDF (e) indicates the importance of an element (or weight, importance weight, relative importance, or contribution).

[0104] Step 602: For any subgraph, determine the vector representation corresponding to the subgraph based on the importance and vector representation of each element contained in the subgraph.

[0105] The vector representation corresponding to the subgraph is used to indicate the behavior of the subgraph.

[0106] For example, a subgraph may include multiple elements. A weighted sum can be performed based on the importance and vector representation of each element in the subgraph to obtain the corresponding vector representation. For instance, if a subgraph includes three elements with importance values ​​of ω1, ω2, and ω3, and vector representations of a1, a2, and a3, the vector representation of this subgraph can be calculated using the formula ω1×a1 + ω2×a2 + ω3×a3.

[0107] In this way, the vector representations corresponding to the one or more subgraphs can be obtained.

[0108] Step 603: Determine the attack behavior based on the vector representations corresponding to one or more subgraphs.

[0109] In the current scenario, a behavior can be considered as a collection of semantically similar behavior instances. Therefore, labeled behavior instances in a cluster are representative instances (e.g., cluster neutrality). If valid behavior labels can be identified, then the representative behavior instance can be investigated, thus improving the automation level of attack investigation.

[0110] Given the vector representations of different behavioral instances (i.e., subgraphs), it is possible to determine whether an attack behavior exists based on the vector representation of each subgraph.

[0111] In one example, a vector representation of the attack behavior can be preset (referred to as a preset vector representation), and then the similarity between the preset vector representation and the vector representations of one or more subgraphs can be determined. Vector representations with a similarity greater than a first preset similarity are then selected to indicate the attack behavior.

[0112] Thus, a vector representation that is closer to a preset vector representation can be selected from the vector representations of one or more subgraphs, that is, a behavior that is closer to a preset attack behavior can be selected from one or more behaviors as the attack behavior.

[0113] In another example, the similarity between the vector representations of any two subgraphs in multiple subgraphs can be determined, and the attack behavior can be determined based on the two vector representations whose similarity is less than a second preset similarity.

[0114] It can be understood that the similarity between the vector representation corresponding to an attack behavior and the vector representation corresponding to other normal behaviors (or non-attack behaviors) is less than the similarity between the vector representations corresponding to two normal behaviors. Thus, it can be determined that one of the two vector representations with a smaller similarity is the vector representation corresponding to an attack behavior.

[0115] For example, an attribute heterogeneous graph corresponds to 5 subgraphs, denoted as subgraphs. Figure 1 ,son Figure 2 ,son Figure 3 ,son Figure 4 Kazuko Figure 5 , among which Figure 1 With you Figure 2 ,son Figure 1 With you Figure 3 ,son Figure 1 With you Figure 4 ,son Figure 1 With you Figure 5 The similarity of the corresponding vector representations is all less than the second preset similarity, while the sub-vectors... Figure 2 ,son Figure 3 ,son Figure 4 Kazuko Figure 5 If the similarity between the corresponding vectors of any pairwise pair is greater than or equal to the second preset similarity, then the sub-sense can be determined. Figure 1 The indicated behavior is an attack.

[0116] Of course, this application may also have other ways to determine the attack behavior from the behavior indicated by the one or more subgraphs, without limitation.

[0117] In this application, the similarity between two vector representations can be determined by calculating cosine similarity, as detailed in the formula:

[0118]

[0119] Among them, F m With F n Let e ​​be a vector representation of any two subgraphs, where e i e j This represents the vector representation of each element in the subgraph.

[0120] In the above technical solution, multiple triples are extracted from the host logs. Based on the source node, target node, and edges in the triples, an attribute heterogeneous graph of the host logs is determined. Then, a semantic reasoning model is trained based on the attribute heterogeneous graph to obtain a semantic reasoning model. Next, one or more subgraphs are obtained based on the semantic reasoning model and the attribute heterogeneous graph, where each subgraph indicates a behavior. Finally, the attack behavior is determined from the behaviors corresponding to the one or more subgraphs.

[0121] Therefore, by using attribute heterogeneous graphs to model host logs, the contextual semantics of behaviors within the host logs can be effectively analyzed, and heterogeneous data of various types can be integrated. Furthermore, a semantic reasoning model is obtained through model training using the attribute heterogeneous graph. Based on this model, the attribute heterogeneous graph is divided into one or more subgraphs, thereby enabling semantic extraction and vector representation of the host log context. Further, based on the vector representations of each subgraph, attack behaviors can be effectively identified. Thus, while improving the accuracy of attack behavior identification, labor costs are reduced.

[0122] Based on the above content and the same concept, Figure 7 and Figure 8 This is a schematic diagram of the possible apparatus provided in this application. These apparatuses can be used to implement the above-described method embodiments, and therefore can also achieve the beneficial effects of the above-described method embodiments.

[0123] like Figure 7 As shown, the device includes: an extraction module 701, used to extract multiple triples from host logs, wherein each triple includes a source node, a target node, and an edge, the edge indicating the operation between the source node and the target node; a processing module 702, used to determine an attribute heterogeneous graph of the host logs based on the source node, target node, and edge in the multiple triples; to perform model training based on the attribute heterogeneous graph to obtain a semantic reasoning model; to obtain one or more subgraphs based on the semantic reasoning model and the attribute heterogeneous graph; and to determine attack behaviors from the behaviors corresponding to the one or more subgraphs.

[0124] In one possible implementation, the host log includes multiple types of logs, each type corresponding to the keyword of the source node and the keyword of the target node; the extraction module 701 is specifically used to: for any type of log: extract the source node from the log based on the keyword of the source node corresponding to the type; extract the target node from the log based on the keyword of the target node corresponding to the type; determine the edge based on the operation between the source node and the target node in the log; and form a triple of the log by combining the source node, the edge, and the target node.

[0125] In one possible implementation, the processing module 702 is specifically used to: perform vector initialization for the source node, edge, and target node corresponding to each triple in the attribute heterogeneous graph to obtain the vector representation corresponding to the triple, thereby obtaining the vector representation corresponding to the plurality of triples; and perform model training based on the vector representation corresponding to the plurality of triples to obtain the semantic reasoning model.

[0126] In one possible implementation, the processing module 702 is specifically used to: perform cluster analysis on multiple nodes in the attribute heterogeneous graph according to the semantic reasoning model to obtain one or more node sets corresponding to the attribute heterogeneous graph; and form a subgraph corresponding to the node set by combining multiple nodes in each node set and the edges corresponding to each node, thereby obtaining one or more subgraphs corresponding to the attribute heterogeneous graph.

[0127] In one possible implementation, before performing clustering analysis on multiple nodes in the attribute heterogeneous graph according to the semantic reasoning model, the processing module 702 is further configured to: filter the edges in the attribute heterogeneous graph that do not conform to the temporal sequence relationship and the target nodes corresponding to the edges according to the timestamps corresponding to the multiple edges in the attribute heterogeneous graph, so as to obtain the filtered attribute heterogeneous graph.

[0128] In one possible implementation, the processing module 702 is specifically configured to: determine the importance of each element in the attribute heterogeneous graph based on the number of subgraphs corresponding to the attribute heterogeneous graph and the frequency of occurrence of each element in the attribute heterogeneous graph, wherein the element is a source node, edge, or target node; for any subgraph, determine the vector representation corresponding to the subgraph based on the importance and vector representation of each element contained in the subgraph, wherein the vector representation corresponding to the subgraph is used to indicate the behavior corresponding to the subgraph; and determine the attack behavior based on the vector representations corresponding to the one or more subgraphs.

[0129] In one possible implementation, the processing module 702 is specifically used to: select a vector representation from the vector representations corresponding to the one or more subgraphs that has a similarity greater than a first preset similarity to a preset vector representation, for indicating the attack behavior; and / or determine the similarity between the vector representations corresponding to any two subgraphs in the plurality of subgraphs, and determine the attack behavior based on the two vector representations with a similarity less than a second preset similarity.

[0130] like Figure 8 The image shown is of the apparatus 800 provided in an embodiment of this application. Figure 8 The device shown can be Figure 7 The illustrated device is implemented using one type of hardware circuit.

[0131] For ease of explanation, Figure 8 Only the main components of the device are shown.

[0132] Figure 8The illustrated device 800 includes a communication interface 810, a processor 820, and a memory 830, wherein the memory 830 is used to store program instructions and / or data. The processor 820 may operate in conjunction with the memory 830. The processor 820 may execute the program instructions stored in the memory 830. When the instructions or program stored in the memory 830 are executed, the processor 820 is used to perform the operations performed by the processing module 702 in the above embodiments, and the communication interface 810 is used to perform the operations performed by the extraction module 701 in the above embodiments.

[0133] The memory 830 and the processor 820 are coupled. The coupling in this embodiment is an indirect coupling or communication connection between devices, units, or modules, and can be electrical, mechanical, or other forms, used for information exchange between devices, units, or modules. At least one of the memories 830 may be included in the processor 820.

[0134] In this embodiment, the communication interface can be a transceiver, circuit, bus, module, or other type of communication interface. In this embodiment, when the communication interface is a transceiver, the transceiver can include an independent receiver, an independent transmitter, or a transceiver integrating transceiver functions, or simply a communication interface.

[0135] The device 800 may also include a communication line 840. The communication interface 810, processor 820, and memory 830 can be interconnected via the communication line 840. The communication line 840 can be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The communication line 840 can be divided into an address bus, a data bus, a control bus, etc. For ease of representation, Figure 8 The bus is represented by a single thick line, but this does not mean that there is only one bus or one type of bus.

[0136] Obviously, those skilled in the art can make various modifications and variations to this application without departing from the scope of protection of this application. Therefore, if such modifications and variations fall within the scope of the claims of this application and their equivalents, this application also intends to include such modifications and variations.

Claims

1. A method for identifying attack behavior, characterized in that, include: Multiple triples are extracted from the host logs, where each triple includes a source node, a target node, and an edge, and the edge is used to indicate the operation between the source node and the target node. Based on the source node, target node, and edges in the multiple triples, determine the attribute heterogeneous graph of the host log; Based on the attribute heterogeneity graph, a semantic reasoning model is obtained through model training; Based on the semantic reasoning model and the attribute heterogeneous graph, one or more subgraphs are obtained; The attack behavior is determined from the behaviors corresponding to the one or more subgraphs; The step of obtaining one or more subgraphs based on the semantic reasoning model and the attribute heterogeneity graph includes: Based on the semantic reasoning model, cluster analysis is performed on multiple nodes in the attribute heterogeneous graph to obtain one or more node sets corresponding to the attribute heterogeneous graph. The semantic reasoning model is used to indicate the position of two nodes with semantic association in the vector space. In contrast, two nodes without semantic association are closer in the vector space. By combining multiple nodes in each node set and the edges corresponding to each node, a subgraph corresponding to the node set is formed, thereby obtaining one or more subgraphs corresponding to the attribute heterogeneous graph.

2. The method as described in claim 1, characterized in that, The host logs include multiple types of logs, each type corresponding to the keywords of the source node and the target node; The extraction of multiple triples from the host logs includes: For any type of log: Extract the source node from the log based on the keyword of the source node corresponding to the type; Extract the target node from the log based on the keyword of the target node corresponding to the type; The edges are determined based on the operations between the source node and the target node in the log. The source node, the edge, and the target node are combined to form a triplet of the log.

3. The method as described in claim 1, characterized in that, The step of training a semantic reasoning model based on the attribute heterogeneity graph includes: For each triple in the attribute heterogeneous graph, vector initialization is performed to obtain the vector representation of the triple, thereby obtaining the vector representation of the multiple triples. The semantic reasoning model is obtained by training the model based on the vector representations corresponding to the multiple triples.

4. The method as described in claim 1, characterized in that, Before performing clustering analysis on multiple nodes in the attribute heterogeneous graph based on the semantic reasoning model, the method further includes: Based on the timestamps corresponding to multiple edges in the attribute heterogeneous graph, edges that do not conform to the temporal relationship and the target nodes corresponding to the edges in the attribute heterogeneous graph are filtered to obtain the filtered attribute heterogeneous graph.

5. The method according to any one of claims 1 to 4, characterized in that, Determining the attack behavior from the behaviors corresponding to the one or more subgraphs includes: The importance of each element in the attribute heterogeneous graph is determined based on the number of subgraphs corresponding to the attribute heterogeneous graph and the frequency of occurrence of each element in the attribute heterogeneous graph. The element is a source node, an edge, or a target node. For any subgraph, based on the importance and vector representation of each element contained in the subgraph, the vector representation corresponding to the subgraph is determined, and the vector representation corresponding to the subgraph is used to indicate the behavior corresponding to the subgraph; The attack behavior is determined based on the vector representations corresponding to the one or more subgraphs.

6. The method as described in claim 5, characterized in that, Determining the attack behavior based on the vector representations corresponding to the one or more subgraphs includes: From the vector representations corresponding to the one or more subgraphs, a vector representation with a similarity greater than a first preset similarity is selected to indicate the attack behavior; and / or, Determine the similarity between the vector representations corresponding to any two subgraphs in the plurality of subgraphs, and determine the attack behavior based on the two vector representations whose similarity is less than a second preset similarity.

7. A device for identifying attack behavior, characterized in that, Includes a module for performing the method as described in any one of claims 1 to 6.

8. A computing device, characterized in that, The device includes a processor connected to a memory for storing a computer program, and the processor is configured to execute the computer program stored in the memory to cause the computing device to perform the method of any one of claims 1 to 6.

9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program or instructions that, when executed by a device, implement the method as described in any one of claims 1 to 6.