Script file processing method, electronic device, and storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By obtaining the function call tree topology feature vector and variant confidence of Webshell files, and comprehensively measuring similarity, the problem of inaccurate classification of Webshell files is solved, achieving effective defense against Webshell attacks and improving network security.

CN116186714BActive Publication Date: 2026-06-19ALIBABA CLOUD COMPUTING CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: ALIBABA CLOUD COMPUTING CO LTD
Filing Date: 2023-02-21
Publication Date: 2026-06-19

Application Information

Patent Timeline

21 Feb 2023

Application

19 Jun 2026

Publication

CN116186714B

IPC: G06F21/57; G06F18/22; G06F18/2433; G06F18/241

AI Tagging

Application Domain

Platform integrity maintainance

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies struggle to accurately measure the similarity between Webshell files, leading to inaccurate classification, difficulty in effectively defending against Webshell attacks, and impacting network security.

Method used

By obtaining the topological feature vector of the function call tree of the script file and combining it with the variant confidence, the similarity between script files is comprehensively measured, including topological similarity and variant confidence, so as to accurately classify script files.

Benefits of technology

It improves the accuracy of Webshell file classification, effectively defends against Webshell attacks, and ensures network security.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN116186714B_ABST

Patent Text Reader

Abstract

This application provides a script file processing method, an electronic device, and a storage medium. In this application, considering the complex structure and numerous variations of script files, both the topological similarity between the function call trees of two script files and the variant confidence score reflecting whether the topological structure of the function call trees has changed are considered. Therefore, by combining the topological similarity and variant confidence scores of the function call trees, the similarity between script files is measured more accurately, providing a reliable foundation for accurate script file classification. In particular, this is beneficial for effectively defending against various script attacks such as Webshell attacks, thus ensuring network security.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of network security technology, and in particular to a script file processing method, electronic device, and storage medium. Background Technology

[0002] A webshell (website backdoor) file is a script program that can be executed remotely on a web server. Because webshells are flexible to write, easy to use, and powerful, they are often modified by attackers and used as intrusion tools against web servers. Attackers exploit website vulnerabilities to upload malicious webshell files to the web server's page directory, then access the pages to gain control of the web server, thus negatively impacting network security.

[0003] To effectively defend against webshell attacks, accurate classification of webshell files is crucial. Classifying webshell files involves first determining the similarity between each pair of files. Then, files with high similarity are grouped into the same category, while those with low similarity are grouped into different categories. Clearly, if the determined similarity scores are inaccurate, the classification of webshell files will be inaccurate, making it difficult to effectively defend against webshell attacks and negatively impacting network security. Summary of the Invention

[0004] This application provides a script file processing method, an electronic device, and a storage medium for accurately determining the similarity between script files.

[0005] This application provides a script file processing method, including: obtaining a first function call tree of a first script file and a second function call tree of a second script file; determining the topological similarity between the first and second function call trees based on their respective topological feature vectors; determining the variant confidence level between the first and second function call trees, reflecting the variation of the topological structure, based on the tree structure information of the remaining tree structure, wherein the remaining tree structure refers to the tree structure in the first and second function call trees excluding the common subtrees; and determining the similarity between the first and second script files based on the topological similarity and the variant confidence level.

[0006] This application also provides a script file processing method applied to a cloud server, comprising: acquiring multiple script files sent by an application server; acquiring a first function call tree of the first script file and a second function call tree of the second script file for any first script file and a second script file among the multiple script files; determining the topological similarity between the first function call tree and the second function call tree based on their respective topological feature vectors; determining the variant confidence between the first function call tree and the second function call tree based on the tree structure information of the remaining tree structure, wherein the remaining tree structure refers to the tree structure of the first function call tree and the second function call tree excluding the common subtrees, and the variant confidence represents the confidence that a variant has occurred between the topological structures of the first function call tree and the second function call tree; determining the similarity between the first script file and the second script file based on the topological similarity and the variant confidence; and classifying the multiple script files according to the similarity between any two script files among the multiple script files to obtain a script file classification result.

[0007] This application also provides an electronic device, including: a memory and a processor; the memory for storing a computer program; and the processor coupled to the memory for executing the computer program to perform steps in a script file processing method.

[0008] This application also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, enables the processor to implement the steps in the script file processing method.

[0009] In this embodiment, considering the complex structure and numerous variations of script files, both the topological similarity between the function call trees of two script files and the variant confidence, which reflects whether the topological structure of the function call trees has changed, are considered. Thus, by combining the topological similarity between function call trees and the variant confidence, the similarity between script files can be measured more accurately, providing a reliable foundation for accurate script file classification. In particular, it is beneficial for effectively defending against various script attacks such as Webshell attacks and ensuring network security. Attached Figure Description

[0010] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, illustrate exemplary embodiments and are used to explain this application, but do not constitute an undue limitation of this application. In the drawings:

[0011] Figure 1 An exemplary application scenario diagram provided for an embodiment of this application;

[0012] Figure 2A flowchart illustrating a script file processing method provided in this application embodiment;

[0013] Figure 3 Exemplary cyclic function call trees and acyclic function call trees are provided for embodiments of this application;

[0014] Figure 4 An exemplary heat map provided for embodiments of this application;

[0015] Figure 5 A flowchart illustrating another script file processing method provided in this application embodiment;

[0016] Figure 6 This is a schematic diagram of the structure of a script file processing device provided in an embodiment of this application;

[0017] Figure 7 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0018] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below in conjunction with specific embodiments and corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0019] In the embodiments of this application, "at least one" refers to one or more, and "more than one" refers to two or more. "And / or" describes the access relationship between associated objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, and B existing alone, where A and B can be singular or plural. In the textual description of this application, the character " / " generally indicates that the preceding and following associated objects have an "or" relationship. Furthermore, in the embodiments of this application, "first," "second," "third," etc., are only used to distinguish the content of different objects and have no other special meaning.

[0020] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use and processing of the relevant data must comply with the relevant laws, regulations and standards of the relevant countries and regions, and corresponding operation portals are provided for users to choose to authorize or refuse.

[0021] In practical applications, when measuring the similarity between two webshell files, the first step is to extract the sequence of function calls used in the function call process from each webshell file. This sequence includes multiple function calls ordered in chronological order. Next, the similarity between the two function call sequences is determined; this similarity is called sequence similarity. Finally, the sequence similarity is used as the similarity between the webshell files. However, webshell files are characterized by complex structures and numerous variations. Complex structure refers to the complexity of the function call tree topology, and numerous variations refer to the variety of topological structures within the webshell files. Simply using sequence similarity as the similarity metric between webshell files is insufficient to distinguish the many variations. Therefore, measuring the similarity of webshell files is a challenging task.

[0022] Based on the above, embodiments of this application provide a script file processing method, an electronic device, and a storage medium. In these embodiments, considering the complex structure and numerous variations of script files, both the topological similarity between the function call trees of two script files and the variant confidence score reflecting whether the topological structure of the function call trees has changed are considered. Therefore, by combining the topological similarity and variant confidence scores between function call trees, the similarity between script files is measured more accurately, providing a reliable foundation for accurate script file classification. In particular, this is beneficial for effectively defending against various script attacks such as Webshell attacks, thus ensuring network security.

[0023] Figure 1 This is an exemplary application scenario diagram provided for an embodiment of this application. See also... Figure 1 As shown in ① and ②, a legitimate visitor uploads a legitimate webshell file to the web server, while an attacker tampers with the legitimate webshell file on the web server and replaces it with a malicious one. Thus, multiple webshell files on the web server may contain both legitimate and malicious webshell files. See also... Figure 1 As shown in ③, the web server uploads multiple webshell files to the cloud server. See [reference needed]. Figure 1 As shown in ④, ⑤, and ⑥, the cloud server first determines the similarity between Webshell files, then classifies the Webshell files based on their similarity, and finally performs anomaly detection on each category of Webshell files to determine their status labels. The status labels indicate whether the Webshell files are abnormal or normal, thus providing a foundation for effective defense against Webshell attacks.

[0024] In the abnormal file detection phase of Webshell files, see Figure 1 As shown in section ⑦, the web server sends an anomaly detection request to the cloud server for the webshell file to be identified. The cloud server responds to the anomaly detection request and determines the anomaly detection result of the webshell file to be identified. Specifically, the cloud server first calculates the similarity between the webshell file to be identified and webshell files of various categories to determine the category to which the webshell file to be identified belongs. If the status label of the category to which the webshell file to be identified belongs is "abnormal," then the anomaly detection result of the webshell file to be identified is determined to be abnormal; if the status label of the category to which the webshell file to be identified belongs is "abnormal," then the anomaly detection result of the webshell file to be identified is determined to be normal. See also... Figure 1 As shown in ⑨ and ⑩, the cloud server returns the anomaly detection result of the Webshell file to be identified to the Web server. If the Webshell file to be identified is abnormal, the Web server deletes the Webshell file to be identified and intercepts the access request initiated by the attacker who tampered with the Webshell file in order to defend against Webshell attacks.

[0025] exist Figure 1 In the scenario shown, the topological similarity and variant confidence between the function call trees of the cloud server accurately measure the similarity between Webshell files. Since the similarity measurement between Webshell files is relatively accurate, Webshell files can be accurately classified based on their similarity, thus providing a reliable foundation for effective defense against Webshell attacks and ensuring network security.

[0026] It is worth noting that, Figure 1 The application scenario shown is merely an exemplary one, and the embodiments of this application do not limit the application scenario.

[0027] The technical solutions provided by the various embodiments of this application are described in detail below with reference to the accompanying drawings.

[0028] Figure 2 This is a flowchart illustrating a script file processing method provided in an embodiment of this application. The method can be executed by a script file processing device, which may consist of software and / or hardware, and is generally configured in an electronic device. See also... Figure 2 The method may include the following steps:

[0029] 201. Obtain the first function call tree of the first script file and the second function call tree of the second script file.

[0030] 202. Determine the topological similarity between the first function call tree and the second function call tree based on their respective topological feature vectors.

[0031] 203. Based on the tree structure information of the remaining tree structure, determine the variant confidence between the first function call tree and the second function call tree that reflects the variation of the topology. The remaining tree structure refers to the tree structure in the first function call tree and the second function call tree excluding the same subtrees.

[0032] 204. Determine the similarity between the first script file and the second script file based on the topological similarity and variant confidence.

[0033] In this embodiment, there are no restrictions on the type of script file. Script files include, but are not limited to, Webshell files, PHP (Hypertext Preprocessor) files, and ASP (Active Server Pages) files.

[0034] In this embodiment, for any two script files that need to be measured for similarity, for ease of understanding and distinction, one script file is referred to as the first script file and the other as the second script file. It is worth noting that the two script files can be the same or different script files; there is no restriction on this.

[0035] First, extract at least one function call record from the script file. From this record, extract the calling and called functions for each function call. Then, construct a function call tree for the script file based on these records. The function call tree describes the call relationship between the calling and called functions. It consists of multiple nodes and edges connecting them. Each node corresponds to a calling or called function, and edges connect the nodes corresponding to the calling and called functions that have a call relationship.

[0036] The MD5 message-digest algorithm is a widely used cryptographic hash function that produces a 128-bit hash value to ensure the integrity and consistency of transmitted information. Because the MD5 algorithm has low complexity and is irreversible, and is universal, stable, and fast, in this embodiment, the MD5 algorithm can also be used to determine the MD5 value of each script file, thus distinguishing different script files.

[0037] Optionally, to accurately construct the function call tree, data cleaning can be performed on at least one function call data point before constructing the function call tree corresponding to the script file based on at least one function call data point. Data cleaning can remove some dirty data, delete duplicate information, correct existing errors, and ensure data consistency. In this embodiment, data cleaning can be used to remove invalid and duplicate function call data. Invalid function call data is, for example, function call data where no function was called.

[0038] In practical applications, when creating a function call tree, the main function (which serves as the entry point) is set as the root node. Multiple extracted functions are then treated as nodes within the tree, and an edge is created between nodes corresponding to functions with a calling relationship. In practice, the called function might be itself, a recursively called function, or a child function of multiple parent functions. This can easily lead to cycles in the function call tree, which is detrimental to measuring the similarity of the function call tree's topology.

[0039] Therefore, to avoid cycles in the function call tree, when creating the function call tree corresponding to the script file, each called function is abstracted as a node. For example, at least one function call data can be extracted from the script file, reflecting the calling relationship between the calling function and the called function. Based on the at least one function call data, at least one calling function and at least one called function called by each calling function are determined. The at least one calling function and the at least one called function called by each calling function are abstracted as nodes of the first function call tree, and edges are created connecting the nodes corresponding to the calling functions and the called functions that have a calling relationship.

[0040] by Figure 3 For example, Figure 3 Figure (a) shows a cyclic function call tree. Figure 3 Figure (b) shows an acyclic function call tree. Figure 3In (a), the calling function corresponding to node 6 is a self-calling function. The functions corresponding to nodes 5, 7, and 8 also have calling relationships with the functions corresponding to nodes 1 and 6, respectively. Due to the existence of the above situation, Figure 3 The function call tree in (a) shows a cycle. Figure 3 In section (b), for cases where the called function is a self-calling function, a recursive calling function, or a child function of multiple parent functions, a node with the same name is added to the function call tree. Figure 3 The nodes with the same name in (b) include: node 5, node 6, node 7, and node 8. Edges are connected between the node that calls the function and the newly added node with the same name of the called function to avoid cycles in the function call tree.

[0041] In this embodiment, assuming the function call tree is denoted as T, the node as V, and the edge as E, the function call tree can be labeled as T = {V, E}. Node attributes include, but are not limited to: function name, function category, sensitivity attribute, and node weight. Assuming the node is denoted as V, the function name as name, the function category as type, the sensitivity attribute as sensitive, and the node weight as pointWeight, the node V can be labeled as V = {name, type, sensitive, pointWeight}. Here, the function name refers to the name of the function corresponding to the node; the function category can be defined according to the function's function, such as the main function, initialization function, printing function, and specific function implementing technical logic; the sensitivity attribute is used to indicate whether the function is sensitive. For example, if the sensitivity attribute value is 0, the function is sensitive; if the sensitivity attribute value is 1, the function is not sensitive. Whether a function is sensitive refers to whether the function involves data security operations. If the function involves data security operations, then the function is a sensitive function; if the function does not involve data security operations, then the function is an insensitive function; the node weight is used to reflect the importance of the node.

[0042] In this embodiment, the edge attributes include, but are limited to, the number of function calls. The number of function calls refers to the number of times the calling function of the node connected by the edge calls the called function. Assuming the number of function calls is denoted as frequency, the edge can be labeled as E = {V}. i V j ,frequency}, where V i V j This is the identifier of the node connected by the edge.

[0043] For ease of understanding and distinction, the function call tree corresponding to the first script file is referred to as the first function call tree, and the one corresponding to the second script file is referred to as the second function call tree. The first function call tree is created based on at least one function call data extracted from the first script file, and the second function call tree is created based on at least one function call data extracted from the second script file. The creation process of the first and second function call trees can be found in the aforementioned creation process of the function call tree, and will not be repeated here.

[0044] In this embodiment, the topological feature vector representing the topological structure information of the function call tree can be extracted using the node attributes and / or edge attributes of several nodes in the function call tree. Several exemplary methods for extracting the topological feature vector of a function call tree are described below.

[0045] Method 1: For any node in at least a subset of nodes included in any function call tree, determine the node feature vector based on at least one of the corresponding sensitive attributes and node weights; perform fusion processing on the node feature vectors of at least a subset of nodes included in the function call tree to obtain the topological feature vector of the function call tree.

[0046] In this embodiment, sensitive attributes or node weights can be used as node feature vectors, or various operations such as addition, subtraction, and difference can be performed on sensitive attributes and node weights to obtain node feature vectors.

[0047] In this embodiment, the node feature vectors of all or some of the nodes in the function call tree can be fused to obtain the topological feature vector of the function call tree. The fusion processing methods include, but are not limited to, concatenation, element-wise product, and element-wise sum.

[0048] Method 2: For any function call tree, perform hash operations on the function categories of at least some nodes included in the function call tree to obtain multiple hash values, and determine the quantity of each hash value; for any node among the nodes included in the function call tree, process the quantity of hash values of the node's category using the node's sensitivity attributes and / or node weights to determine the node's node feature vector; fuse the node feature vectors of at least some nodes included in the function call tree to obtain the topological feature vector of the function call tree.

[0049] Specifically, a hash function is used to hash the function category of any node in at least some of the nodes, resulting in a hash value. Thus, in the case of multiple nodes, multiple hash values can be obtained. For each of the resulting hash values, the number of each hash value is counted. For example, node 1 has a hash value of 1, node 2 has a hash value of 1, and so on.

[0050] The hash value of node 3 is 2, the hash value of node 4 is 3, the hash value of node 5 is 4, and the hash value of node 6 is 4. Statistical analysis shows that there are 2 hash values of type 1, 1 hash value of type 2, 1 hash value of type 3, and 2 hash values of type 4. The number of hash values (1) for the category of node 1 is 2, the number of hash values (1) for the category of node 2 is 2, the number of hash values (2) for the category of node 3 is 1, the number of hash values (3) for the category of node 4 is 1, the number of hash values (4) for the category of node 5 is 2, and the number of hash values (4) for the category of node 6 is 2. When determining the node feature vector of any node, the number of hash values corresponding to that node can be weighted using the node's corresponding sensitive attributes and / or node weights to obtain the node feature vector. Of course, when weighting the number of hash values corresponding to a node by combining the node's corresponding sensitive attributes and node weight, various operations such as addition, subtraction, and difference can be performed on the node's corresponding sensitive attributes and node weight to obtain a weight coefficient. The weight coefficient is then used to weight the number of hash values corresponding to the node.

[0051] Assume the i-th hash value is identity. i The number of the i-th hash value is denoted as count(identity). i ), where i is a positive integer. It has identity. i The node feature vector of a node can be denoted as pointWeight × sensitive × count(identity) i ), where × is the multiplication symbol.

[0052] Assuming the topological feature vector of the function call tree is denoted as Tvec, and the function call tree has n nodes, where n is a positive integer, then Tvec = [pointWeight×sensitive×count(identity1), pointWeight×sensitive×count(identity2), ..., pointWeight×sensitive×count(identity2)]. n )).

[0053] Method 3: For any node among at least some nodes in the function call tree, iteratively update the node's attributes based on the node's and its neighboring nodes' attributes until the required number of iterations is met; if the required number of iterations is met, perform hash operations on the function categories of at least some nodes in the function call tree to obtain multiple hash values and determine the quantity of each hash value; for any node among at least some nodes in the function call tree, process the number of hash values of the node's category using the node's sensitivity attributes and / or node weights to determine the node's node feature vector; fuse the node feature vectors of at least some nodes in the function call tree to obtain the topological feature vector of the function call tree.

[0054] In this embodiment, compared to method 2, method 3 iteratively updates the node attributes of nodes in the function call tree before performing hash operations on the function categories of at least some of the nodes included in the function call tree. In each iteration, at least one of the following steps may be performed: updating the function name of the node based on the function names of the node and its neighboring nodes; updating the function category of the node based on the function categories of the node and its neighboring nodes; updating the sensitivity attributes of the node based on the sensitivity attributes of the node and its neighboring nodes; and updating the node weight of the node based on the node weights of the node and its neighboring nodes, and the number of function calls between the node and its neighboring nodes.

[0055] In this embodiment, during each iteration, for any node, the function name of that node and the function names of each of its neighboring nodes are aggregated, and the aggregated function names are updated to the function name of the node.

[0056] In this embodiment, during each iteration, for any node, the function category of that node and the function categories of each of its neighboring nodes are aggregated, and the aggregated function categories are updated to the function category of the node.

[0057] In this embodiment, during each iteration, for any node, if any node and its neighboring nodes have a sensitive attribute that is sensitive to an indicator function, the node's sensitive attribute is updated to be sensitive to an indicator function. If none of the node and its neighboring nodes have a sensitive attribute that is sensitive to an indicator function, the node's sensitive attribute is updated to be sensitive to an insensitive attribute.

[0058] In this embodiment, during each iteration, for any node, various operations such as addition, subtraction, or averaging can be performed on the node weights of the node and its neighboring nodes, as well as the number of function calls between the node and its neighboring nodes, and the results of the operations are updated to the node weights of the node.

[0059] In this embodiment, the number of iterations is not limited. Let k be the number of iterations, where k is a positive integer. After k iterations, the node attributes of each node are updated k times. Further, optionally, to better iteratively update node attributes, various graph kernel algorithms such as Weisfeiler-Lehman (WL), graph kernel algorithms, and WL Subtree (Weisfeiler-Lehman subtree kernel) graph kernel algorithms can be used for iterative updates. The WL graph kernel algorithm obtains a new representation of the node by continuously aggregating neighbor information. The WL Subtree graph kernel algorithm uses the number of different node labels during the iteration process as the feature vector of the graph. For more information on graph kernel algorithms, please refer to the introduction of related technologies.

[0060] It is worth noting that, in order of increasing accuracy, the extracted topological feature vectors are: those extracted by method 1, those extracted by method 2, and those extracted by method 3.

[0061] It is worth noting that the topological feature vector of the first function call tree is extracted based on the node attributes and / or edge attributes of several nodes in the first function call tree. Similarly, the topological feature vector of the second function call tree is extracted based on the node attributes and / or edge attributes of several nodes in the second function call tree. For details on the extraction methods of the topological feature vectors of the first and second function call trees, please refer to the aforementioned extraction method for the topological feature vector of function call trees; these will not be repeated here.

[0062] In this embodiment, after obtaining the topological feature vectors of the first function call tree and the second function call tree respectively, the topological similarity between the first function call tree and the second function call tree is determined based on their respective topological feature vectors.

[0063] In practical applications, the Euclidean distance, cosine similarity, or Manhattan distance between the topological feature vectors of the first and second function call trees can be determined, and this distance can be used as the topological similarity between the first and second function call trees. Further, optionally, to better measure topological similarity, the Euclidean distance between the topological feature vectors of the first and second function call trees can be determined, and the topological similarity between the first and second function call trees can be determined based on this Euclidean distance.

[0064] In practical applications, the Euclidean distance can be directly used as the topological similarity between the first and second function call trees. Alternatively, to better measure topological similarity, the Euclidean distance between the respective topological feature vectors of the first and second function call trees can be determined, and the variance of these feature vectors can be calculated. The topological similarity between the first and second function call trees can then be determined based on the Euclidean distance and variance. When calculating the variance of the respective topological feature vectors of the first and second function call trees, each element in the feature vector is treated as a variable.

[0065] As an example, suppose the function call tree T i The topological feature vector is denoted as Tvec. i and function call tree T j The topological feature vector is denoted as Tvec. j Function call tree T i and function call tree T j The topological similarity between them is denoted as C. ij C ij The following formula can be used for calculation:

[0066]

[0067] In formula (1), C ij ∈[0,1],||·|| represents the Euclidean distance, σ represents the variance, and e is an irrational constant.

[0068] In this embodiment, in addition to determining the topological similarity between the function call trees of two script files, it is also necessary to determine the variant confidence between the function call trees of the two script files. Variant confidence refers to the confidence level reflecting whether the topological structure of the function call trees has changed. The higher the variant confidence, the greater the difference between the topological structures of the two function call trees, and the greater the probability that the two function call trees belong to two different types of function call trees; that is, the greater the probability that one function call tree is a variant of the other. Conversely, the lower the variant confidence, the smaller the difference between the topological structures of the two function call trees, and the lower the probability that the two function call trees belong to two different types of function call trees; that is, the lower the probability that one function call tree is a variant of the other.

[0069] In this embodiment, to accurately measure the variant confidence between function call trees, the tree structure information of the remaining tree structure (excluding identical subtrees) in the function call trees participating in the variant confidence is considered. The tree structure information of the remaining tree structure includes, but is not limited to: the number of nodes in the remaining tree structure, the number of connected components in the remaining tree structure, and the number of nodes corresponding to each of the at least one connected component. A connected component refers to a maximal connected subgraph in the graph.

[0070] Taking the function call trees participating in the variant confidence as the first and second function call trees as an example, the first and second function call trees can be traversed with the goal of finding common subgraphs until common subgraphs are found in both trees. The remaining tree structures in the first and second function call trees, excluding common subgraphs, are then considered as the remaining tree structures. Optionally, nodes in the first and second function call trees can be traversed; if the currently traversed node is a common node in both trees, the tree structure consisting of the common node and its neighboring nodes is determined as a common subtree, and the remaining tree structures are considered as the remaining tree structures. It is worth noting that traversing the function call trees with the goal of finding common nodes to find common subgraphs is more efficient than traversing the function call trees with the goal of finding common subgraphs, effectively improving the overall efficiency of script file similarity measurement.

[0071] It is worth noting that, taking the function call trees considered for variant confidence as the first and second function call trees, we identify the common subtrees in the first and second function call trees. The first residual tree structure (excluding the common subtrees) of the first function call tree and the second residual tree structure (excluding the common subtrees) of the second function call tree are taken as the entire residual tree structure. The tree structure information of the entire residual tree structure is used in the variant confidence determination task. The number of nodes in the entire residual tree structure is the sum of the number of nodes in the first and second residual tree structures. The number of nodes in the first residual tree structure refers to the number of nodes included in the first residual tree structure; similarly, the number of nodes in the second residual tree structure refers to the number of nodes included in the second residual tree structure.

[0072] The total number of connected components in the entire residual tree structure is the sum of the number of connected components in the first residual tree structure and the second residual tree structure. The number of connected components in the first residual tree structure refers to the total number of connected components in the first residual tree structure, and the number of connected components in the second residual tree structure refers to the total number of connected components in the second residual tree structure. The number of nodes in each connected component is also the number of nodes included in that connected component.

[0073] This embodiment does not limit the implementation method of determining the variant confidence between the first function call tree and the second function call tree based on the tree structure information of the remaining tree structure. For example, the variant confidence between function call trees can be obtained by weighted summing or averaging the number of nodes in the remaining tree structure and the number of nodes in each connected component of the remaining tree structure. Further optionally, in order to accurately measure the variant confidence between function call trees, the ratio of the number of nodes in each connected component to the number of nodes in the remaining tree structure can be determined for at least one connected component; the variant confidence between the first function call tree and the second function call tree can be determined based on the ratio corresponding to each of the at least one connected component.

[0074] In practical applications, the variant confidence between the first and second function call trees can be obtained by summing or averaging the ratios corresponding to at least one connected component. Further, optionally, to accurately measure the variant confidence between function call trees, the variant confidence between the first and second function call trees can be determined based on the ratios corresponding to at least one connected component and the topological similarity. For example, various operations such as weighted summation, averaging, or subtraction can be performed on the ratios corresponding to at least one connected component and the topological similarity to obtain the variant confidence between the first and second function call trees.

[0075] For example, suppose the function call tree T i and function call tree T j The confidence level between the variants is denoted as M. ij The function call tree T will not be included. i and function call tree T j Let Q be the number of nodes in the remaining tree structure with identical subtrees, where Q is a positive integer. Let P be the number of connected components in the remaining tree structure, where P is a positive integer. Let R be the number of nodes in the c-th connected component. c The value of c ranges from [1, P]. Applying the principle of information entropy, M is calculated using the following formula. ij :

[0076] M ij =C ij ×H……(2)

[0077]

[0078] H can be viewed as a form of information entropy.

[0079] In this embodiment, after obtaining the topological similarity and variant confidence between the first function call tree and the second function call tree, the similarity between the first script file and the second script file is determined based on the topological similarity and variant confidence. For example, various operations such as addition, subtraction, averaging, or multiplication are performed on the topological similarity and variant confidence to obtain the similarity between the first script file and the second script file.

[0080] For example, suppose the function call tree T i and function call tree T j The similarity between the corresponding script files is denoted as S. ij Calculate S according to the following formula ij :

[0081] S ij =C ij ×M ij =C ij 2 ×H……(4)

[0082] In some optional embodiments, in order to more intuitively evaluate the effectiveness of the similarity measurement method between script files, a heatmap can be generated based on the similarity between any two script files. The heatmap includes multiple regions of different colors, and the color of each region is used to represent the similarity between script files. The intensity of the color is related to the magnitude of the similarity.

[0083] Specifically, see Figure 4 Taking K script files as an example, where K is a positive integer, a similarity matrix of K×K elements is generated based on the similarity between any two script files. The element in the i-th row and j-th column of the similarity matrix corresponds to the similarity between the i-th and j-th script files, where i and j are any positive integers from 1 to K. A heatmap of K×K regions is then rendered based on the similarity matrix. The region in the i-th row and j-th column of the heatmap corresponds to the element in the i-th row and j-th column of the similarity matrix. The intensity of the color of each region is related to the magnitude of the similarity; for example, a darker color indicates a greater similarity, and a lighter color indicates a smaller similarity. If the areas on the diagonal of the heatmap (corresponding to the similarity between identical script files) are darker, and the areas not on the diagonal (corresponding to the similarity between different script files) are lighter, then the similarity measurement method between script files is considered valid. If the color difference between the areas on the diagonal and those not on the diagonal is small, or if the areas on the diagonal are lighter and those not on the diagonal are darker, then the similarity measurement method between script files is considered invalid.

[0084] Understandably, heatmaps can be used to visually assess the effectiveness of similarity measurement methods between script files, thereby providing a basis for decision-making on whether to classify script files and ensuring the reliability of script file classification.

[0085] The technical solution provided in this application takes into account the complex structure and numerous variations of script files. It focuses not only on the topological similarity between the function call trees of two script files, but also on the variant confidence, which reflects whether the topological structure of the function call tree has changed. Therefore, by combining the topological similarity between function call trees and the variant confidence, the similarity between script files can be measured more accurately. This provides a reliable foundation for accurate classification of script files in the future. In particular, it is beneficial for effectively defending against various script attacks such as Webshell attacks and ensuring network security.

[0086] Figure 5 A flowchart illustrating another script file processing method provided in this application embodiment. See also... Figure 5 The method may include the following steps:

[0087] 501. Obtain multiple script files sent by the application server.

[0088] 502. For any first script file and second script file among multiple script files, obtain the first function call tree of the first script file and the second function call tree of the second script file.

[0089] 503. Determine the topological similarity between the first function call tree and the second function call tree based on their respective topological feature vectors.

[0090] 504. Based on the tree structure information of the remaining tree structure, determine the variant confidence between the first function call tree and the second function call tree that reflects the variation of the topology. The remaining tree structure refers to the tree structure in the first function call tree and the second function call tree excluding the same subtrees.

[0091] 505. Determine the similarity between the first script file and the second script file based on the topological similarity and variant confidence.

[0092] 506. Based on the similarity between any two script files, classify the multiple script files to obtain the script file classification results.

[0093] Specifically, this method can be executed by a cloud server. Various application servers, such as web servers, social network platform servers, and e-commerce platform servers, can send multiple script files to the cloud server. The cloud server executes steps 502 to 505 to determine the similarity between any two script files. Then, based on the similarity between any two script files, the cloud server classifies the multiple script files to obtain a script file classification result. The script file classification result includes script files in various categories. When classifying multiple script files, if the similarity between two script files is less than or equal to a preset similarity threshold, the two script files are classified into the same category; if the similarity between two script files is less than or equal to the preset similarity threshold, the two script files are classified into different categories. The preset similarity threshold is set according to the actual application requirements.

[0094] In some optional embodiments, the cloud server can also perform anomaly detection on script files of various categories to obtain status labels for each category, thereby providing a basis for subsequent anomaly file detection. When performing anomaly detection on script files of each category, existing malicious script detection methods can be used. Further optionally, to improve the accuracy of anomaly detection, a machine learning model capable of identifying malicious scripts can be trained. This machine learning model can then be used to perform anomaly detection on script files of each category, obtaining status labels for each category. These status labels indicate whether the script files of the corresponding category are abnormal or normal.

[0095] When the cloud server receives an abnormal file detection request from the application server, the abnormal file detection request includes the script file to be identified. In response to the abnormal file detection request from the application server, the cloud server determines the category to which the script file to be identified belongs based on the similarity between the script file to be identified and script files of various categories. Based on the status label of the category to which the script file to be identified belongs, the cloud server determines whether the script file to be identified is abnormal. The cloud server then returns an abnormal detection result indicating whether the script file to be identified is abnormal to the application server.

[0096] Specifically, the cloud server determines the category to which the script file to be identified belongs based on the similarity between the script file to be identified and script files in various categories. A status label indicating the category to which the script file belongs indicates that script files in that category are abnormal, and the script file to be identified is also abnormal; a status label indicating the category to which the script file belongs indicates that script files in that category are normal, and the script file to be identified is also normal. After the cloud server returns the anomaly detection results of the script file to be identified to the application server, the application server continues to retain normal script files to be identified locally; the application server deletes the abnormal script files to be identified and blocks access requests initiated by attackers who tamper with those script files, thus defending against malicious script attacks.

[0097] The technical solution provided in this application embodiment enables cloud servers to accurately measure the similarity between script files and accurately classify script files, thereby providing a reliable foundation for effectively defending against malicious script attacks and ensuring network security.

[0098] Figure 6 This is a schematic diagram of a script file processing device provided in an embodiment of this application. See also... Figure 6 The device may include:

[0099] Module 61 is used to obtain the first function call tree of the first script file and the second function call tree of the second script file;

[0100] The first determining module 62 is used to determine the topological similarity between the first function call tree and the second function call tree based on their respective topological feature vectors.

[0101] The second determining module 63 is used to determine the variant confidence between the first function call tree and the second function call tree, reflecting the variation of the topology, based on the tree structure information of the remaining tree structure. The remaining tree structure refers to the tree structure in the first function call tree and the second function call tree excluding the same subtrees.

[0102] The third determining module 64 is used to determine the similarity between the first script file and the second script file based on the topological similarity and the variant confidence.

[0103] Further optionally, a node in the first function call tree corresponds to a calling function or a called function, and the node attributes of the node include at least the function category, a sensitivity attribute indicating whether the function is sensitive, and the node weight;

[0104] The first determining module 62 is further configured to: perform hash operations on the function categories of at least some nodes included in the first function call tree to obtain multiple hash values, and determine the quantity of each hash value; for any node among the at least some nodes included in the first function call tree, process the quantity of hash values of the node's category using the node's sensitive attributes and / or node weights to determine the node's node feature vector; and perform fusion processing on the node feature vectors of at least some nodes included in the first function call tree to obtain the topological feature vector of the first function call tree.

[0105] Further optionally, before performing hash operations on the function categories of at least some of the nodes included in the first function call tree, the first determining module 62 is further configured to: for any node among the at least some nodes, iteratively execute updating the node attributes of the node based on the node attributes of the node and its neighboring nodes until the number of iterations meets the requirement.

[0106] Further optionally, when the first determining module 62 iteratively executes the update of the node attributes based on the node attributes of the node and its neighboring nodes, it is specifically used to: in each iteration, perform at least one of the following steps: update the function name of the node based on the function name of the node and its neighboring nodes; update the function category of the node based on the function category of the node and its neighboring nodes; update the sensitive attributes of the node based on the sensitive attributes of the node and its neighboring nodes; update the node weight of the node based on the node weight of the node and its neighboring nodes, and the number of function calls between the node and its neighboring nodes.

[0107] Further optionally, when the first determining module 62 determines the topological similarity between the first function call tree and the second function call tree, it is specifically used to: determine the Euclidean distance between the respective topological feature vectors of the first function call tree and the second function call tree; and determine the topological similarity between the first function call tree and the second function call tree based on the Euclidean distance.

[0108] Further optionally, the tree structure information includes at least the number of nodes in the remaining tree structure and the number of nodes corresponding to each of the at least one connected component;

[0109] Accordingly, when the second determining module 63 determines the variant confidence between the first function call tree and the second function call tree based on the tree structure information of the remaining tree structure, it is specifically used to: for each connected component in at least one connected component, determine the ratio of the number of nodes in the connected component to the number of nodes in the remaining tree structure; and determine the variant confidence between the first function call tree and the second function call tree based on the ratio corresponding to each of the at least one connected component.

[0110] Further optionally, when the second determining module 63 determines the variant confidence between the first function call tree and the second function call tree based on the ratio of the number of connected components, it is specifically used to: determine the variant confidence between the first function call tree and the second function call tree based on the ratio of the number of connected components and the topological similarity.

[0111] Further optionally, before determining the variant confidence between the first function call tree and the second function call tree based on the tree structure information of the remaining tree structure, the second determining module 63 is also used to: traverse the nodes in the first function call tree and the second function call tree; if the currently traversed node is the same node in the first function call tree and the second function call tree, then the tree structure composed of the same node and its neighboring nodes in the first function call tree and the second function call tree is determined as the same subtree.

[0112] Optionally, when the acquisition module 61 acquires the first function call tree of the first script file, it is specifically used to: extract at least one function call data from the first script file, the function call data reflecting the call relationship between the calling function and the called function in the first script file; determine at least one calling function and at least one called function called by each calling function based on the at least one function call data; abstract the at least one calling function and the at least one called function called by each calling function as a node of the first function call tree; and create an edge connecting the nodes corresponding to the calling function and the called function with the call relationship, wherein the node attributes of the node include at least the function name, function category, sensitivity attribute and node weight, and the edge attributes include at least the number of function calls.

[0113] Further optionally, after determining the similarity between the first script file and the second script file, the third determining module 64 is also configured to: generate a heatmap based on the similarity between any two script files among the multiple script files, the heatmap including multiple regions of different colors, the color of each region being used to characterize the similarity between the script files, the intensity of the color being related to the magnitude of the similarity.

[0114] Figure 6 The device shown can perform Figure 2 The implementation principle and technical effects of the methods in the illustrated embodiments will not be elaborated further. Regarding the methods in the above embodiments... Figure 6 The specific ways in which each module and unit of the device shown performs operations have been described in detail in the embodiments of the method, and will not be elaborated here.

[0115] It should be noted that the execution subject of each step of the method provided in the above embodiments can be the same device, or the method can be executed by different devices. For example, the execution subject of steps 201 to 204 can be device A; or the execution subject of steps 201 and 202 can be device A, and the execution subject of steps 203 and 204 can be device B; and so on.

[0116] Furthermore, in some of the processes described in the above embodiments and accompanying drawings, multiple operations appear in a specific order. However, it should be clearly understood that these operations may not be executed in the order they appear herein, or they may be executed in parallel. The operation numbers, such as 201, 202, etc., are merely used to distinguish different operations and do not represent any execution order. Additionally, these processes may include more or fewer operations, and these operations may be executed sequentially or in parallel. It should be noted that the descriptions such as "first" and "second" in this document are used to distinguish different messages, devices, modules, etc., and do not represent a sequential order, nor do they limit "first" and "second" to different types.

[0117] Figure 7 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Figure 7 As shown, the electronic device includes: a memory 71 and a processor 72;

[0118] Memory 71 is used to store computer programs and can be configured to store various other data to support operation on the computing platform. Examples of this data include instructions for any application or method operating on the computing platform, contact data, phone book data, messages, pictures, videos, etc.

[0119] The memory 71 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random-access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk.

[0120] The processor 72, coupled to the memory 71, is used to execute the computer program in the memory 71 for the steps in the script file processing method in the above embodiments.

[0121] Furthermore, such as Figure 7 As shown, the electronic device also includes other components such as a communication component 73, a display 74, a power supply component 75, and an audio component 76. Figure 7 The diagram only shows some components and does not mean that the electronic device includes only these components. Figure 7 The components shown. Additionally... Figure 7 The components within the dashed box are optional, not mandatory, and their specific requirements depend on the product form of the electronic device. The electronic device in this embodiment can be a desktop computer, laptop computer, smartphone, or IoT (Internet of Things) device, or a server-side device such as a conventional server, cloud server, or server array. If the electronic device in this embodiment is a desktop computer, laptop computer, or smartphone, it may include... Figure 7The components within the dashed box; if the electronic device in this embodiment is implemented as a conventional server, cloud server, or server array, etc., it may be omitted. Figure 7 The component within the dashed box.

[0122] For a detailed description of the implementation process of each action by the processor, please refer to the relevant descriptions in the foregoing method embodiments or device embodiments, which will not be repeated here.

[0123] Accordingly, embodiments of this application also provide a computer-readable storage medium storing a computer program, which, when executed, can implement the steps that can be performed by an electronic device in the above method embodiments.

[0124] Accordingly, this application also provides a computer program product, including a computer program / instructions, which, when executed by a processor, enables the processor to perform the steps that can be executed by an electronic device in the above method embodiments.

[0125] The aforementioned communication components are configured to facilitate wired or wireless communication between the device containing the communication components and other devices. The device containing the communication components can access wireless networks based on communication standards, such as WiFi, 2G, 3G, 4G / LTE, 5G, or combinations thereof. In one exemplary embodiment, the communication components receive broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication components also include a Near Field Communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID), Infrared Data Association (IrDA), Ultra Wide Band (UWB), Bluetooth (BT), and other technologies.

[0126] The aforementioned display includes a screen, which may include a Liquid Crystal Display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen can be implemented as a touchscreen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensors can sense not only the boundaries of touch or swipe actions but also the duration and pressure associated with the touch or swipe operation.

[0127] The aforementioned power supply components provide power to various components within the device in which they reside. These power supply components may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power to the device in which they reside.

[0128] The aforementioned audio component can be configured to output and / or input audio signals. For example, the audio component includes a microphone (MIC) configured to receive external audio signals when the device containing the audio component is in an operating mode, such as call mode, recording mode, or voice recognition mode. The received audio signals can be further stored in memory or transmitted via a communication component. In some embodiments, the audio component also includes a speaker for outputting audio signals.

[0129] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-readable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0130] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0131] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0132] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0133] In a typical configuration, a computing device includes one or more processors (central processing unit, CPU), input / output interfaces, network interfaces, and memory.

[0134] Memory may include non-persistent storage in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.

[0135] Computer-readable media, including both permanent and non-permanent, removable and non-removable media, can store information using any method or technology. Information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase-change RAM (PRAM), static random-access memory (SRAM), dynamic random-access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile disc (DVD) or other optical storage, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transfer medium that can be used to store information accessible by a computing device. As defined in this article, computer-readable media do not include transient media, such as modulated data signals and carrier waves.

[0136] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.

[0137] The above are merely embodiments of this application and are not intended to limit the scope of this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of the claims of this application.

Claims

1. A script file processing method, characterized in that, include: Obtain the first function call tree of the first script file and the second function call tree of the second script file; Based on the topological feature vectors of the first function call tree and the second function call tree, the topological similarity between the first function call tree and the second function call tree is determined; Based on the tree structure information of the remaining tree structure, determine the variant confidence between the first function call tree and the second function call tree that reflects the variation of the topology structure. The remaining tree structure refers to the tree structure in the first function call tree and the second function call tree excluding the same subtrees. The similarity between the first script file and the second script file is determined based on the topological similarity and the variant confidence.

2. The method according to claim 1, characterized in that, Each node in the first function call tree corresponds to a calling function or a called function, and the node attributes of the node include at least the function category, a sensitivity attribute indicating whether the function is sensitive, and the node weight; The method further includes: Perform hash operations on the function categories of at least some nodes included in the first function call tree to obtain multiple hash values, and determine the number of each hash value; For any node among at least some of the nodes included in the first function call tree, the number of hash values of the type to which the node belongs is processed using the node's sensitive attributes and / or node weights to determine the node's node feature vector; The node feature vectors of at least some of the nodes included in the first function call tree are fused to obtain the topological feature vector of the first function call tree.

3. The method according to claim 2, characterized in that, Before performing hash operations on the function categories of at least some nodes included in the first function call tree, the method further includes: For any node among the at least some nodes, iteratively execute the update of the node's node attributes based on the node attributes of the node and its neighboring nodes until the required number of iterations is met.

4. The method of claim 3, wherein, Iteratively executing the update of the node's node attributes based on the node attributes of the node and its neighboring nodes includes: In each iteration, perform at least one of the following steps: Update the function name of the node based on the function names of the node and its neighboring nodes; Update the function category of the node based on the function categories of the node and its neighboring nodes; Update the node's sensitivity attributes based on the node's and its neighboring nodes' sensitivity attributes; The node weight is updated based on the node weights of the node and its neighboring nodes, as well as the number of function calls between the node and its neighboring nodes.

5. The method according to any one of claims 1 to 4, characterized in that, Determining the topological similarity between the first function call tree and the second function call tree includes: Determine the Euclidean distance between the topological feature vectors of the first function call tree and the second function call tree; The topological similarity between the first function call tree and the second function call tree is determined based on the Euclidean distance.

6. The method according to any one of claims 1 to 4, characterized in that, The tree structure information includes at least the number of nodes in the remaining tree structure and the number of nodes corresponding to at least one connected component. Accordingly, based on the tree structure information of the remaining tree structure, the variant confidence between the first function call tree and the second function call tree is determined, including: For each of at least one connected component, determine the ratio of the number of nodes in the connected component to the number of nodes in the remaining tree structure; The variant confidence between the first function call tree and the second function call tree is determined based on the ratios corresponding to at least one connected component.

7. The method of claim 6, wherein, Determining the variant confidence between the first function call tree and the second function call tree based on the ratios corresponding to at least one connected component includes: The variant confidence between the first function call tree and the second function call tree is determined based on the ratios corresponding to at least one connected component and the topological similarity.

8. The method according to any one of claims 1 to 4, characterized in that, Before determining the variant confidence between the first function call tree and the second function call tree based on the tree structure information of the remaining tree structure, the process also includes: Traverse the nodes in the first function call tree and the second function call tree; If the currently traversed node is the same node in both the first function call tree and the second function call tree, then the tree structure composed of the same node and its neighboring nodes in both the first and second function call trees is determined as the same subtree.

9. The method according to any one of claims 1 to 4, characterized in that, The first function call tree of the first script file includes: Extract at least one function call data from the first script file, wherein the function call data reflects the calling relationship between the calling function and the called function in the first script file; Determine at least one calling function and at least one called function called by each calling function based on at least one function call data; At least one calling function and at least one called function called by each calling function are abstracted as nodes in the first function call tree. An edge is created between the nodes corresponding to the calling function and the called function that have a calling relationship. The node attributes of the node include at least the function name, function category, sensitivity attribute and node weight. The edge attributes include at least the number of function calls.

10. A script file processing method, characterized in that, Applied to cloud servers, the method includes: Retrieve multiple script files sent by the application server; For any first script file and second script file among multiple script files, obtain the first function call tree of the first script file and the second function call tree of the second script file; Based on the topological feature vectors of the first function call tree and the second function call tree, the topological similarity between the first function call tree and the second function call tree is determined; Based on the tree structure information of the remaining tree structure, the variant confidence between the first function call tree and the second function call tree is determined. The remaining tree structure refers to the tree structure of the first function call tree and the second function call tree excluding the common subtrees. The variant confidence represents the confidence that a variant has occurred between the topological structures of the first function call tree and the second function call tree. The similarity between the first script file and the second script file is determined based on the topological similarity and the variant confidence. Based on the similarity between any two script files, the script files are classified to obtain the script file classification results.

11. The method according to claim 10, characterized in that, Also includes: Anomaly detection is performed on script files of each category to obtain status labels for each category, and the status labels indicate whether the script files of the corresponding category are normal or abnormal. In response to an abnormal file detection request sent by the application server, the abnormal file detection request includes a script file to be identified; The category to which the script file to be identified belongs is determined based on the similarity between the script file to be identified and script files of each category; Determine whether the script file to be identified is abnormal based on the status label of the category to which the script file to be identified belongs; Returning an anomaly detection result to the application server, indicating whether the script file to be identified is abnormal.

12. An electronic device, comprising: include: Memory and processor; The memory is used to store computer programs; The processor is coupled to the memory for executing the computer program to perform the steps of the method according to any one of claims 1-11.

13. A computer readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it causes the processor to perform the steps of the method according to any one of claims 1-11.