An Android application gray behavior classification method, system, device and storage medium
By obtaining the function call graph of the APK file and using a graph embedding neural network model for clustering and real device testing, the problem of incomplete feature extraction in existing technologies is solved, the accuracy of gray behavior classification in Android applications is improved, and the security of Android applications is enhanced.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HUAZHONG UNIV OF SCI & TECH
- Filing Date
- 2022-10-12
- Publication Date
- 2026-06-12
AI Technical Summary
Existing technologies fail to extract comprehensive features when detecting gray behaviors in Android applications, resulting in low classification accuracy and failing to effectively enhance the security of Android applications.
By obtaining the function call graph of the APK file, a graph embedding neural network model is used to generate graph embedding vectors, and clustering operations are performed. Combined with real device testing and decompilation, gray behavior categories in the file clusters are obtained.
It achieves more comprehensive feature extraction, improves the accuracy of gray behavior classification in Android applications, and enhances the security of Android applications.
Smart Images

Figure CN115905897B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of network security technology, and in particular to a method, system, device and storage medium for classifying gray behaviors of Android applications. Background Technology
[0002] Currently, the Android operating system holds a 69.74% global market share, ranking first in the global mobile operating system market and boasting a massive user base and numerous usage scenarios. Based on Android's open-source nature, users can freely obtain applications with different functions from the open-source platform and install them on various devices. However, it is precisely this open-source nature that makes Android vulnerable to serious security threats, such as privacy breaches, remote control vulnerabilities, and backdoor attacks.
[0003] Empirical studies of unlabeled Android application APK files provided by enterprises and open-source platforms have revealed that some Android application APK files exhibit sensitive operations on the microphone and camera, such as silent recording and SMS remote control. On the one hand, these covert behaviors of applications can lead to the leakage of users' important privacy, such as voice and image data, posing a serious security threat; on the other hand, features like silent recording can record evidence when users are illegally violated, thus protecting users' legitimate rights. This type of application behavior is defined by the academic community as Android gray behavior, and the research task of detecting this gray behavior is called Android gray behavior detection.
[0004] Existing technologies mainly collect unlabeled Android application APK files from the industry, extract basic features such as permissions, API calls, Intents, and strings, combine them with community detection clustering algorithms to obtain clustered APK file clusters, and finally obtain gray behavior categories through manual analysis. However, extracting features from APK files is relatively simple and may result in the omission of key features. Summary of the Invention
[0005] This invention aims to at least solve the technical problems existing in the prior art. To this end, this invention proposes a method, system, device, and storage medium for classifying gray behaviors in Android applications, which can more comprehensively extract the characteristics of Android installation packages, improve the accuracy of gray behavior classification in Android applications, and enhance the security of Android applications.
[0006] A first aspect of the present invention provides a method for classifying gray behaviors in Android applications, comprising the following steps:
[0007] Function call graph for obtaining APK file;
[0008] The function call is used to input the graph into the graph embedding neural network model to obtain the graph embedding vector;
[0009] Clustering is performed on the graph embedding vectors to obtain APK file clusters;
[0010] Perform real device testing and decompile on the files in the APK file cluster to obtain the gray behavior categories of the files in the file cluster.
[0011] According to embodiments of the present invention, at least the following technical effects are achieved:
[0012] This method obtains the function call graph of the APK file, inputs the function call graph into a graph embedding neural network model to obtain graph embedding vectors, performs clustering operations on the graph embedding vectors to obtain APK file clusters, performs real device testing and decompiles on the files in the APK file clusters to obtain the gray behavior categories of the files in the file clusters, and achieves more comprehensive extraction of Android installation package features, improves the accuracy of gray behavior classification of Android applications, and enhances the security of Android applications.
[0013] According to some embodiments of the present invention, the function call graph for obtaining the APK file includes:
[0014] Generate a candidate function call graph for the APK file based on Androguard;
[0015] Calculate the node degree of the candidate function call graph;
[0016] A function call graph is selected from the candidate function call graphs, wherein the degree of the nodes in the function call graph is greater than a preset degree.
[0017] According to some embodiments of the present invention, the graph embedding neural network model is a GPT-GNN model, and the step of inputting the function call graph into the graph embedding neural network model to obtain the graph embedding vector includes:
[0018] Obtain the initial feature vector of each node in the function call graph;
[0019] The function call graph is input into the GPT-GNN model so that the GPT-GNN model obtains the graph embedding vector based on the initial feature vector of each node in the function call graph.
[0020] According to some embodiments of the present invention, obtaining the initial feature vector of each node in the function call graph includes:
[0021] Use APKtool to decompile the APK file to obtain a smali file;
[0022] Each node of the function call graph is matched with the smali code in the smali file to obtain the smali code matching result for each node;
[0023] Based on the smali code matching results and Androguard, a control flow graph corresponding to each node of the function call graph is generated;
[0024] Based on the node positions in the control flow graph and the smali code of the function corresponding to the node, the smali instruction sequence of the function corresponding to the node is calculated;
[0025] The sequence feature vector of the smali instruction sequence is generated according to the SimCSE model, and the initial feature vector of each node is obtained.
[0026] According to some embodiments of the present invention, the step of clustering the graph embedding vectors to obtain APK file clusters includes:
[0027] The similarity matrix of the graph embedding vectors is constructed based on the fully connected Gaussian kernel distance, and the adjacency matrix and degree matrix are constructed based on the similarity matrix;
[0028] The standardized Laplacian matrix is calculated based on the adjacency matrix and the degree matrix, wherein the formula for calculating the standardized Laplacian matrix based on the adjacency matrix and the degree matrix is as follows:
[0029]
[0030] Where L is the Laplacian matrix, D is the degree matrix, and W is the adjacency matrix;
[0031] Calculate the first k1 eigenvalues of the Laplacian matrix and the eigenvectors corresponding to the first k1 eigenvalues, where k1 is a preset value;
[0032] The eigenvectors corresponding to the first k1 eigenvalues are standardized by row to form an N×k1 dimensional feature matrix, where N is the total number of APK files;
[0033] The feature matrix is clustered using an improved K-means clustering algorithm to obtain APK file clusters.
[0034] According to some embodiments of the present invention, the step of clustering the feature vectors using an improved K-means clustering algorithm to obtain APK file clusters includes:
[0035] Step S1: Randomly select k APK files from the APK files as the first cluster centers, where k is the pre-set number of APK file clusters;
[0036] Step S2: Traverse the function call graph of the APK file according to the types defined in the Android official documentation to obtain the type tags of the APK file, and calculate the initial similarity based on the type tags. The formula for calculating the initial similarity based on the type tags is as follows:
[0037]
[0038] Where, α(x,C) k Let be the initial similarity between the cluster centers of the x-th APK file and the k-th APK file cluster, and β be a random number between 0 and 1;
[0039] Step S3: Calculate the shortest distance between each APK file and the first cluster center based on the initial similarity, wherein the formula for calculating the shortest distance between each APK file and the cluster center based on the initial similarity is:
[0040] D(x) = min{dist(x,C)} k )×α(x,C k ), x∈X, k=1,2,…}
[0041] Where X is the set of all APK files, and C k Let dist(x,C) be the cluster center of the k-th APK file cluster. k Let be the Euclidean distance between the cluster centers of the x-th APK file and the k-th APK file cluster;
[0042] Step S4: Classify the APK files into the file clusters corresponding to the cluster centers with the smallest shortest distance, obtaining k second APK file clusters, and calculate the probability of each APK file being selected as the next cluster center based on the shortest distance:
[0043]
[0044] Where P(x) is the probability that the x-th APK file is selected as the next cluster center;
[0045] Step S5: Based on the probability of being selected as the next cluster center, obtain k second cluster centers;
[0046] Step S6: Calculate the shortest distance between each APK file and the second cluster center and the probability of being selected as the next cluster center based on the initial similarity, to obtain the third cluster center and k third APK file clusters. Iterate and train sequentially until the preset maximum number of iterations is reached to obtain the APK file clusters.
[0047] According to some embodiments of the present invention, after clustering the feature vectors using an improved K-means clustering algorithm to obtain APK file clusters, the Android application gray behavior classification method further includes:
[0048] The clustering results are evaluated using a silhouette coefficient based on similarity weights, a Kalinsky-Harabás score based on similarity weights, and a Davis-Bolding index based on similarity weights. The formula for calculating the silhouette coefficient based on similarity weights is as follows:
[0049] distC = min{dist(i,j kc ),k=1,2,…}
[0050]
[0051]
[0052]
[0053] Where j represents the j-th APK file belonging to the same APK file cluster as the i-th APK file. kc Let j be the cluster center of the k-th file cluster. k Let be the j-th APK file in the k-th file cluster, distC be the distance between the i-th APK file and the nearest cluster center, b(i) be the distance between the i-th APK file and the nearest different file clusters, a(i) be the intra-cluster cohesion of the j-th APK file, and n be the distance between the j-th APK file and the nearest different file clusters. k Let n be the number of APK files in the k-th file cluster, and n be the number of APK files in the file cluster containing the i-th APK file.
[0054] The formula for calculating the Kalinsky-Harabas score based on similarity weights is as follows:
[0055]
[0056]
[0057]
[0058]
[0059] Where BGSS is the inter-cluster sum of squares, C is the cluster center of all APK files, and WGSS is the intra-cluster sum of squares. k Let X be the sum of squares of the k-th file cluster. ik Let CH be the i-th APK file in the k-th file cluster, and CH be the Kalinsky-Harabas fraction.
[0060] The formula for calculating the Davis-Bolding index based on similarity weights is as follows:
[0061]
[0062]
[0063]
[0064]
[0065] in, Let be the average distance between the APK file in the i-th file cluster and the cluster center. is the average distance between the APK file of the j-th file cluster and the cluster center, Di,j is the intra-cluster distance ratio between the i-th and j-th file clusters, and DB is the Davis-Bolding index based on similarity weights.
[0066] A second aspect of the present invention provides an Android application gray behavior classification system, the Android application gray behavior classification system comprising:
[0067] The data acquisition module is used to obtain the function call graph of the APK file;
[0068] The data input module is used to input the function call graph into the graph embedding neural network model to obtain the graph embedding vector;
[0069] The data clustering module is used to perform clustering operations on the graph embedding vectors to obtain APK file clusters;
[0070] The data classification module is used to perform real device testing and decompile on the files in the APK file cluster to obtain the gray behavior categories of the files in the file cluster.
[0071] This system obtains the function call graph of an APK file, inputs the function call graph into a graph embedding neural network model to obtain graph embedding vectors, performs clustering operations on the graph embedding vectors to obtain APK file clusters, performs real device testing and decompiles on the files in the APK file clusters to obtain the gray behavior categories of the files in the file clusters, and achieves more comprehensive extraction of Android installation package features, improves the accuracy of gray behavior classification of Android applications, and enhances the security of Android applications.
[0072] A third aspect of the present invention provides an Android application gray behavior classification electronic device, including at least one control processor and a memory for communicatively connecting to the at least one control processor; the memory stores instructions executable by the at least one control processor, the instructions being executed by the at least one control processor to enable the at least one control processor to perform the above-described Android application gray behavior classification method.
[0073] In a fourth aspect, the present invention provides a computer-readable storage medium storing computer-executable instructions for causing a computer to execute the above-described Android application gray behavior classification method.
[0074] It should be noted that the beneficial effects of the second to fourth aspects of the present invention compared with the prior art are the same as the beneficial effects of the above-described Android application gray behavior classification system compared with the prior art, and will not be described in detail here.
[0075] Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. Attached Figure Description
[0076] The above and / or additional aspects and advantages of the present invention will become apparent and readily understood from the description of the embodiments taken in conjunction with the following drawings, in which:
[0077] Figure 1 This is a flowchart of an Android application gray behavior classification method according to an embodiment of the present invention;
[0078] Figure 2 yes Figure 1 The flowchart of step S101 in the text;
[0079] Figure 3 yes Figure 1 The flowchart of step S102 in the document;
[0080] Figure 4 yes Figure 3 The flowchart of step S301 in the process;
[0081] Figure 5 yes Figure 1 The flowchart of step S103 in the process;
[0082] Figure 6 yes Figure 5 The flowchart of step S505 in the document;
[0083] Figure 7 This is a schematic diagram of the overall process of an Android application gray behavior classification method according to an embodiment of the present invention;
[0084] Figure 8 This is a flowchart of an Android application gray behavior classification system according to an embodiment of the present invention. Detailed Implementation
[0085] Embodiments of the present invention are described in detail below. Examples of these embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, and should not be construed as limiting the present invention.
[0086] In the description of this invention, the use of terms such as "first," "second," etc., is for the purpose of distinguishing technical features only and should not be construed as indicating or implying relative importance, or implicitly indicating the number of technical features indicated, or implicitly indicating the order of the technical features indicated.
[0087] In the description of this invention, it should be understood that the orientation descriptions, such as up, down, etc., are based on the orientation or positional relationship shown in the drawings and are only for the convenience of describing this invention and simplifying the description, and are not intended to indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation, and therefore should not be construed as a limitation of this invention.
[0088] In the description of this invention, it should be noted that, unless otherwise explicitly defined, terms such as "setting," "installation," and "connection" should be interpreted broadly, and those skilled in the art can reasonably determine the specific meaning of the above terms in this invention in conjunction with the specific content of the technical solution.
[0089] Currently, the Android operating system holds a 69.74% global market share, ranking first in the global mobile operating system market and boasting a massive user base and numerous usage scenarios. Due to Android's open-source nature, users can freely obtain applications with different functions from the open-source platform and install them on various devices. However, this very open-source nature also exposes Android to serious security threats, such as privacy leaks, remote control vulnerabilities, and backdoor attacks. Existing technologies primarily involve collecting unlabeled Android application APK files from the industry, extracting basic features such as permissions, API calls, Intents, and strings, and combining this with community-based clustering algorithms to obtain clustered APK file clusters. Finally, manual analysis is used to categorize gray behaviors. However, extracting features from APK files is relatively simple and may result in the omission of crucial features.
[0090] To address the aforementioned technical deficiencies, referring to... Figure 1 The present invention also provides a method for classifying gray behaviors in Android applications, including:
[0091] Step S101: Obtain the function call graph of the APK file.
[0092] Step S102: Input the function call graph into the graph embedding neural network model to obtain the graph embedding vector.
[0093] Step S103: Perform clustering operation on the graph embedding vector to obtain APK file clusters.
[0094] Step S104: Perform real device testing and decompile on the files in the APK file cluster to obtain the gray behavior categories of the files in the file cluster.
[0095] This method obtains the function call graph of the APK file, inputs the function call graph into a graph embedding neural network model to obtain graph embedding vectors, performs clustering operations on the graph embedding vectors to obtain APK file clusters, performs real device testing and decompiles on the files in the APK file clusters to obtain the gray behavior categories of the files in the file clusters, and achieves more comprehensive extraction of Android installation package features, improves the accuracy of gray behavior classification of Android applications, and enhances the security of Android applications.
[0096] Reference Figure 2 In some embodiments, step S101 may include, but is not limited to, steps S201 to S203:
[0097] Step S201: Generate a candidate function call graph for the APK file based on Androguard.
[0098] Step S202: Calculate the node degree of the candidate function call graph.
[0099] Step S203: Select a function call graph from the candidate function call graph, wherein the degree of the nodes in the function call graph is greater than the preset degree of the nodes.
[0100] Reference Figure 3 In some embodiments, the graph embedding neural network model is a GPT-GNN model, and step S102 may include, but is not limited to, steps S301 to S302:
[0101] Step S301: Obtain the initial feature vector of each node in the function call graph.
[0102] Step S302: Input the function call graph into the GPT-GNN model so that the GPT-GNN model obtains the graph embedding vector based on the initial feature vector of each node in the function call graph. The GPT-GNN model is based on node prediction, edge prediction and mutual information comparison learning, and the graph embedding vector has a dimension of 128.
[0103] Reference Figure 4 In some embodiments, step S301 may include, but is not limited to, steps S401 to S405:
[0104] Step S401: Use APKtool to decompile the APK file to obtain the smali file.
[0105] Step S402: Match each node of the function call graph with the smali code in the smali file to obtain the smali code matching result for each node.
[0106] Step S403: Based on the smali code matching results and Androguard, generate the control flow graph corresponding to each node of the function call graph.
[0107] Step S404: Calculate the smali instruction sequence of the function corresponding to the node based on the node position in the control flow graph and the smali code of the function corresponding to the node.
[0108] Step S405: Generate the sequence feature vector of the smali instruction sequence based on the SimCSE model to obtain the initial feature vector of each node.
[0109] This invention uses node embedding instead of whole graph embedding to avoid losing detailed features between functions.
[0110] Reference Figure 5 In some embodiments, step S103 may include, but is not limited to, steps S501 to S505:
[0111] Step S501: Construct a similarity matrix of graph embedding vectors based on the fully connected Gaussian kernel distance, and construct an adjacency matrix and a degree matrix based on the similarity matrix.
[0112] Step S502: Calculate the standardized Laplacian matrix based on the adjacency matrix and the degree matrix. The formula for calculating the standardized Laplacian matrix based on the adjacency matrix and the degree matrix is as follows:
[0113]
[0114] Where L is the Laplace matrix, D is the degree matrix, and W is the adjacency matrix.
[0115] Step S503: Calculate the first k1 eigenvalues of the Laplacian matrix and the eigenvectors corresponding to the first k1 eigenvalues, where k1 is a preset value.
[0116] Step S504: Standardize the eigenvectors corresponding to the first k1 eigenvalues by row to form an N×k1 dimensional feature matrix, where N is the total number of APK files.
[0117] Step S505: Cluster the feature matrix using the improved K-means clustering algorithm to obtain APK file clusters.
[0118] Reference Figure 6 In some embodiments, step S505 may include, but is not limited to, steps S601 to S606:
[0119] Step S601: Randomly select k APK files from the APK files as the first cluster centers, where k is the pre-set number of APK file clusters.
[0120] Step S602: Traverse the function call graph of the APK file according to the types defined in the Android official documentation to obtain the type labels of the APK file. Calculate the initial similarity based on the type labels. The formula for calculating the initial similarity based on the type labels is as follows:
[0121]
[0122] Where, α(x,C) k ) represents the initial similarity between the cluster centers of the x-th APK file and the k-th APK file cluster, and β is a random number between 0 and 1.
[0123] Step S603: Calculate the shortest distance between each APK file and the first cluster center based on the initial similarity. The formula for calculating the shortest distance between each APK file and the cluster center based on the initial similarity is as follows:
[0124] D(x) = min{dist(x,C)} k )×α(x,C k ), x∈X, k=1,2,…}
[0125] Where X is the set of all APK files, and C k Let dist(x,C) be the cluster center of the k-th APK file cluster. k ) represents the Euclidean distance between the cluster centers of the x-th APK file and the k-th APK file cluster.
[0126] Step S604: Classify the APK files into the file clusters corresponding to the cluster centers with the smallest shortest distance, obtaining k second APK file clusters, and calculate the probability of each APK file being selected as the next cluster center based on the shortest distance:
[0127]
[0128] Where P(x) is the probability that the x-th APK file is selected as the next cluster center.
[0129] Step S605: Based on the probability of being selected as the next cluster center, obtain k second cluster centers.
[0130] Step S606: Calculate the shortest distance between each APK file and the second cluster center and the probability of being selected as the next cluster center based on the initial similarity, to obtain the third cluster center and k third APK file clusters. Iterate and train in sequence until the preset maximum number of iterations is reached to obtain the APK file clusters.
[0131] The improved KMeans algorithm avoids the local optima defect that may be caused by randomly selecting the initial cluster center point, and introduces the initial similarity of APK files into the similarity calculation, which improves the accuracy of distance calculation between APK files.
[0132] In some embodiments, after clustering the feature vectors using an improved K-means clustering algorithm to obtain APK file clusters, the Android application gray behavior classification method further includes steps S701 to S703:
[0133] Step S701: The clustering results are evaluated using the silhouette coefficient based on similarity weights, the Kalinsky-Harabas score based on similarity weights, and the Davis-Bolding index based on similarity weights. The formula for calculating the silhouette coefficient based on similarity weights is as follows:
[0134] distC = min{dist(i,j kc ),k=1,2,…}
[0135]
[0136]
[0137]
[0138] Where j represents the j-th APK file belonging to the same APK file cluster as the i-th APK file. kc Let j be the cluster center of the k-th file cluster. k Let be the j-th APK file in the k-th file cluster, distC be the distance between the i-th APK file and the nearest cluster center, b(i) be the distance between the i-th APK file and the nearest different file clusters, a(i) be the intra-cluster cohesion of the j-th APK file, and n be the distance between the j-th APK file and the nearest different file clusters. k Let n be the number of APK files in the k-th file cluster, and n be the number of APK files in the file cluster containing the i-th APK file.
[0139] Introducing α(i,j) can better characterize the original similarity relationship between the i-th APK file and the j-th APK file, which is beneficial for characterizing the intra-cluster cohesion of APK files in the same cluster. At the same time, by first calculating the distance from the i-th APK file to the nearest cluster center and using this as a threshold to calculate b(i), the computational efficiency of b(i) is improved.
[0140] Step S702, the formula for calculating the Kalinsky-Hallabus score based on similarity weight is as follows:
[0141]
[0142]
[0143]
[0144]
[0145] Where BGSS is the inter-cluster sum of squares, C is the cluster center of all APK files, and WGSS is the intra-cluster sum of squares. k Let X be the sum of squares of the k-th file cluster. ik Let be the i-th APK file in the k-th file cluster, and CH be the Kalinsky-Harabas score.
[0146] Step S703: The formula for calculating the Davis-Bolding index based on similarity weights is as follows:
[0147]
[0148]
[0149]
[0150]
[0151] in, Let be the average distance between the APK file in the i-th file cluster and the cluster center. is the average distance between the APK file of the j-th file cluster and the cluster center, Di,j is the intra-cluster distance ratio between the i-th and j-th file clusters, and DB is the Davis-Bolding index based on similarity weights.
[0152] Reference Figure 7 To facilitate understanding by those skilled in the art, a set of preferred embodiments is provided below:
[0153] I. Experimental Environment Setup:
[0154] The experimental environment for this embodiment is a 64-bit Ubuntu system server with an Intel(R) Xeon(R) Gold5118 CPU @ 2.30GHz and 64GB of memory. The following are the tools used in this embodiment:
[0155] Androguard: Androguard is a framework for analyzing Android APKs, developed in Python, used to generate function call graphs and control flow graphs for applications.
[0156] PyTorch: PyTorch is an open-source deep learning framework from Facebook that supports a large number of machine learning algorithms and is used to build GPT-GNN models.
[0157] SimCSE Embedding Generator: This tool is an open-source tool used to directly convert smali instruction sequence information in the control flow graph into sequence feature vectors in the form of Sentence Embedding.
[0158] II. Collect APK files:
[0159] Download a large number of APK files from the Androzoo platform over the past three years, ignoring their built-in tags, to create a collection of 100,000 untagged APK files.
[0160] Simultaneously, APK files from well-known domestic app stores were collected and added to the APK file collection, with a total of 20,000 APK files downloaded from 360 App Store, Huawei App Market, and Baidu App Market. In addition, a dataset of malicious APK files was collected from industry partners (NSFOCAC), including loan apps, pornographic apps, and fraudulent apps—malicious applications that may leak users' voice and portrait privacy information.
[0161] The Android application APK file deduplication operation is performed by comparing the SHA256 hash codes of all APK files. If two APK files have the same SHA256 hash code, they are considered to be the same APK file, and only one is kept in the APK file set.
[0162] The 120,000 unlabeled APK files collected were divided into two groups in a 4:6 ratio. 48,000 of these APK files were selected as the APK file set for pre-training the GPT-GNN model, and 72,000 of these APK files were selected as the APK file set for detecting gray behaviors.
[0163] The final APK file collection results are shown in Table 1.
[0164]
[0165] Table 1
[0166] III. Function call diagram for obtaining APK files:
[0167] Use APKtool to decompile the APK file to obtain the smali file.
[0168] Each node in the function call graph is matched with the smali code in the smali file to obtain the smali code matching result for each node.
[0169] Based on the smali code matching results and the control flow graph corresponding to each node of the function call graph generated by Androguard.
[0170] Based on the node positions in the control flow graph and the smali code of the corresponding function, the smali instruction sequence of the function corresponding to the node is calculated.
[0171] The sequence feature vector of the smali instruction sequence is generated based on the SimCSE model, and the initial feature vector of each node is obtained.
[0172] The function call graph is input into the GPT-GNN model so that the GPT-GNN model can obtain a graph embedding vector based on the initial feature vector of each node in the function call graph. The GPT-GNN model is based on node prediction, edge prediction and mutual information comparison learning, and the graph embedding vector has a dimension of 128.
[0173] IV. Obtaining the graph embedding vector:
[0174] The output of graph embedding models falls into two categories. One category directly outputs the complete embedding vector of the input graph, directly representing the embedding vector of the entire graph. The other category outputs the node embeddings of each node in the graph. Finally, the node embeddings are integrated as the embedding of the input graph. Common integration methods include summation, alignment, etc. In this embodiment, node embedding is used instead of whole-graph embedding, thereby avoiding the loss of detailed features between functions.
[0175] The function call graph is input into the GPT-GNN model, which is based on node prediction, edge prediction, and mutual information comparison learning.
[0176] The GPT-GNN model obtains a graph embedding vector based on the initial feature vector of each node in the graph called by the function. The graph embedding vector has a dimension of 128, the learning rate is set to 0.0001, and the number of iterations is set to 50.
[0177] This invention employs a GPT-GNN model pre-training task that combines node prediction, edge prediction, and mutual information comparison learning, constructing a pre-trained APK file pair. Partial nodes in the input graph are masked, and the trained network re-predicts the target node based on its contextual relationships.
[0178] The steps for pre-training the GPT-GNN model using mutual information contrastive learning are as follows:
[0179] Step 1: Construct positive and negative sample instances in the original space. Positive instances are the adjacency matrix A and attribute information X of the input graph. Negative instances are obtained by performing random transformations on A and X, denoted as follows: and
[0180] Step 2: Combine matrix (X,A) with matrix... The corresponding latent space nodes (H, A) and nodes are generated using the GPT-GNN model encoder.
[0181] Step 3: Use the readout function to transform the node H of the positive example graph into the complete graph of the positive example graph in the latent space.
[0182] Step 4: The goal of the GPT-GNN model pre-training task is to make each node in the positive example graph in the latent space... Rather than the whole image Maximize mutual information, while ensuring that each node in the negative example graph... Its positive example graph and the whole graph Minimize mutual information.
[0183] V. Output APK file cluster:
[0184] Step 11: Construct a similarity matrix for graph embedding vectors based on the fully connected Gaussian kernel distance, and construct an adjacency matrix and degree matrix based on the similarity matrix;
[0185] Step 12: Calculate the standardized Laplacian matrix based on the adjacency matrix and the degree matrix. The formula for calculating the standardized Laplacian matrix based on the adjacency matrix and the degree matrix is as follows:
[0186]
[0187] Where L is the Laplace matrix, D is the degree matrix, and W is the adjacency matrix.
[0188] Step 13: Calculate the first k1 eigenvalues of the Laplacian matrix and the eigenvectors corresponding to the first k1 eigenvalues, where k1 is a preset value.
[0189] Step 14: Standardize the eigenvectors corresponding to the first k1 eigenvalues by row to form an N×k1 dimensional feature matrix, where N is the total number of APK files.
[0190] Step 15: Cluster the feature matrix using the improved K-means clustering algorithm to obtain APK file clusters.
[0191] In step 15, the feature matrix is clustered using an improved K-means clustering algorithm to obtain APK file clusters, including:
[0192] Step 21: Randomly select k APK files from the APK files as the first cluster centers, where k is the pre-set number of APK file clusters.
[0193] Step 22: Traverse the function call graph of the APK file according to the types defined in the official Android documentation to obtain the type labels of the APK file. Calculate the initial similarity based on the type labels. The formula for calculating the initial similarity based on the type labels is as follows:
[0194]
[0195] Where, α(x,C) k ) represents the initial similarity between the cluster centers of the x-th APK file and the k-th APK file cluster, and β is a random number between 0 and 1.
[0196] Step 23: Calculate the shortest distance between each APK file and the first cluster center based on the initial similarity. The formula for calculating the shortest distance between each APK file and the cluster center based on the initial similarity is as follows:
[0197] D(x) = min{dist(x,C)} k )×α(x,C k ), x∈X, k=1,2,…}
[0198] Where X is the set of all APK files, and C k Let dist(x,C) be the cluster center of the k-th APK file cluster. k ) represents the Euclidean distance between the cluster centers of the x-th APK file and the k-th APK file cluster.
[0199] Step 24: Classify the APK files into the file clusters corresponding to the cluster centers with the smallest shortest distance, obtaining k second APK file clusters, and calculate the probability of each APK file being selected as the next cluster center based on the shortest distance:
[0200]
[0201] Where P(x) is the probability that the x-th APK file is selected as the next cluster center.
[0202] Step 25: Based on the probability of being selected as the next cluster center, obtain k second cluster centers.
[0203] Step 26: Calculate the shortest distance between each APK file and the second cluster center and the probability of being selected as the next cluster center based on the initial similarity. This yields the third cluster center and k third APK file clusters. Iterate through the training until the preset maximum number of iterations is reached to obtain the APK file clusters.
[0204] The improved KMeans algorithm avoids the local optima defect that may be caused by randomly selecting the initial cluster center point, and introduces the initial similarity of APK files into the similarity calculation, which improves the accuracy of distance calculation between APK files.
[0205] After clustering the feature vectors using an improved K-means clustering algorithm to obtain APK file clusters, the clustering results are evaluated using the silhouette coefficient based on similarity weights, the Kalinsky-Harabas score based on similarity weights, and the Davis-Bolding index based on similarity weights. The formula for calculating the silhouette coefficient based on similarity weights is as follows:
[0206] distC = min{dist(i,j kc ),k=1,2,…}
[0207]
[0208]
[0209]
[0210] Where j represents the j-th APK file belonging to the same APK file cluster as the i-th APK file. kc Let j be the cluster center of the k-th file cluster. k Let be the j-th APK file in the k-th file cluster, distC be the distance between the i-th APK file and the nearest cluster center, b(i) be the distance between the i-th APK file and the nearest different file clusters, a(i) be the intra-cluster cohesion of the j-th APK file, and n be the distance between the j-th APK file and the nearest different file clusters. k Let n be the number of APK files in the k-th file cluster, and n be the number of APK files in the file cluster containing the i-th APK file.
[0211] Introducing α(i,j) can better characterize the original similarity relationship between the i-th APK file and the j-th APK file, which is beneficial for characterizing the intra-cluster cohesion of APK files in the same cluster. At the same time, by first calculating the distance from the i-th APK file to the nearest cluster center and using this as a threshold to calculate b(i), the computational efficiency of b(i) is improved.
[0212] Step 31: The formula for calculating the Kalinsky-Harabas score based on similarity weights is as follows:
[0213]
[0214]
[0215]
[0216]
[0217] Where BGSS is the inter-cluster sum of squares, C is the cluster center of all APK files, and WGSS is the intra-cluster sum of squares. k Let X be the sum of squares of the k-th file cluster. ik Let be the i-th APK file in the k-th file cluster, and CH be the Kalinsky-Harabas score.
[0218] Step 32: The formula for calculating the Davis-Bolding index based on similarity weights is as follows:
[0219]
[0220]
[0221]
[0222]
[0223] in, Let be the average distance between the APK file in the i-th file cluster and the cluster center. is the average distance between the APK file of the j-th file cluster and the cluster center, Di,j is the intra-cluster distance ratio between the i-th and j-th file clusters, and DB is the Davis-Bolding index based on similarity weights.
[0224] Table 2 shows the clustering evaluation indicators for different clustering results.
[0225]
[0226]
[0227] Table 2
[0228] VI. Manually analyze and confirm the behavioral classification of APK files within clustered APK file clusters:
[0229] Select the top 100 Android application APK files in each APK file cluster for manual analysis. This mainly consists of two steps: (1) real device testing and (2) decompiling to identify code snippets.
[0230] During the real device testing phase, a Google Pixel 3 phone with Android 9 installed was used. The APK files in the APK file cluster were installed on the device and run. The behavior of the application was observed during the test. If the application had any gray-area behavior of embedding ads, this gray-area behavior could be captured during the test. This testing process will exclude some corrupted APK files or those with outdated API development versions.
[0231] The decompilation and code segment location operation involves using decompilation tools to attempt to locate relevant code segments based on gray behaviors captured during real device testing, achieving code-level gray behavior judgment. The APK decompilation tools used in this invention include APKtool and JEB.
[0232] This invention identified two novel gray behaviors in the final classification results: silent recording and SMS remote control. Representative APK files detected were the open-source applications Echo and Simples Remote downloaded from the F-Droid platform. Compared to existing research, this invention offers more comprehensive APK feature extraction and identifies two new gray behavior categories, thus expanding the benchmark dataset for existing gray software research.
[0233] Additionally, refer to Figure 8 An embodiment of the present invention provides an Android application gray behavior classification system, including a data acquisition module 1100, a data input module 1200, a data clustering module 1300, and a data classification module 1400, wherein:
[0234] The data acquisition module 1100 is used to obtain the function call graph of the APK file.
[0235] The data input module 1200 is used to input the function call graph into the graph embedding neural network model to obtain the graph embedding vector.
[0236] The data clustering module 1300 is used to perform clustering operations on the graph embedded vectors to obtain APK file clusters.
[0237] The data classification module 1400 is used to perform real device testing and decompile on files in the APK file cluster to obtain the gray behavior categories of the files in the file cluster.
[0238] This system obtains the function call graph of an APK file, inputs the function call graph into a graph embedding neural network model to obtain graph embedding vectors, performs clustering operations on the graph embedding vectors to obtain APK file clusters, performs real device testing and decompiles on the files in the APK file clusters to obtain the gray behavior categories of the files in the file clusters, and achieves more comprehensive extraction of Android installation package features, improves the accuracy of gray behavior classification of Android applications, and enhances the security of Android applications.
[0239] It should be noted that this system embodiment is based on the same inventive concept as the above system embodiment. Therefore, the relevant content of the above method embodiment is also applicable to this system embodiment, and will not be repeated here.
[0240] This application also provides an electronic device for classifying gray behaviors of Android applications, including: a memory, a processor, and a computer program stored on the memory and executable on the processor. When the processor executes the computer program, it implements the gray behavior classification method for Android applications as described above.
[0241] The processor and memory can be connected via a bus or other means.
[0242] Memory, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs. Furthermore, memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory may optionally include memory remotely located relative to the processor, and these remote memories can be connected to the processor via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
[0243] The non-transitory software program and instructions required to implement the Android application gray behavior classification method in the above embodiments are stored in memory. When executed by the processor, the Android application gray behavior classification method in the above embodiments is executed, for example, the method described above is executed. Figure 1 The method steps S101 to S104.
[0244] This application also provides a computer-readable storage medium storing computer-executable instructions for executing, such as the Android application gray behavior classification method described above.
[0245] The computer-readable storage medium stores computer-executable instructions that are executed by a processor or controller, for example, by a processor in the above-described electronic device embodiment, causing the processor to execute the Android application gray behavior classification method in the above-described embodiment, for example, to execute the above-described... Figure 1 The method steps S101 to S104.
[0246] It will be understood by those skilled in the art that all or some of the steps and systems in the methods disclosed above can be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components can be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit. Such software can be distributed on a computer-readable medium, which can include computer storage media (or non-transitory media) and communication media (or transient media). As is known to those skilled in the art, the term computer storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storing information (such as computer-readable instructions, data structures, program units, or other data). Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technologies, CD-ROM, digital versatile disc (DVD) or other optical disc storage, magnetic cartridges, magnetic tape, disk storage or other magnetic storage devices, or any other medium that can be used to store desired information and is accessible to a computer. Furthermore, as is known to those skilled in the art, communication media typically contain computer-readable instructions, data structures, program units, or other data in modulated data signals such as carrier waves or other transmission mechanisms, and may include any information delivery medium.
[0247] The embodiments of the present invention have been described in detail above with reference to the accompanying drawings. However, the present invention is not limited to the above embodiments. Within the scope of knowledge possessed by those skilled in the art, various changes can be made without departing from the spirit of the present invention.
Claims
1. A method for classifying gray behaviors in Android applications, characterized in that, The Android application gray behavior classification method includes: The function call graph for obtaining the APK file is as follows: Generate a candidate function call graph for the APK file based on Androguard; Calculate the node degree of the candidate function call graph; A function call graph is selected from the candidate function call graphs, wherein the degree of the nodes in the function call graph is greater than a preset node degree; The function call inputs a graph embedding neural network model to obtain a graph embedding vector, wherein the graph embedding neural network model is a GPT-GNN model, specifically: Obtain the initial feature vector of each node in the function call graph, specifically as follows: Use APKtool to decompile the APK file to obtain a smali file; Each node of the function call graph is matched with the smali code in the smali file to obtain the smali code matching result for each node; Based on the smali code matching results and Androguard, a control flow graph corresponding to each node of the function call graph is generated; Based on the node positions in the control flow graph and the smali code of the function corresponding to the node, the smali instruction sequence of the function corresponding to the node is calculated; The sequence feature vector of the smali instruction sequence is generated according to the SimCSE model, and the initial feature vector of each node is obtained. The function call graph is input into the GPT-GNN model so that the GPT-GNN model obtains the graph embedding vector based on the initial feature vector of each node in the function call graph. Clustering is performed on the graph embedding vectors to obtain APK file clusters; Perform real device testing and decompile on the files in the APK file cluster to obtain the gray behavior categories of the files in the file cluster.
2. The method for classifying gray behaviors in Android applications according to claim 1, characterized in that, The step of clustering the graph embedding vectors to obtain APK file clusters includes: The similarity matrix of the graph embedding vectors is constructed based on the fully connected Gaussian kernel distance, and the adjacency matrix and degree matrix are constructed based on the similarity matrix; The standardized Laplacian matrix is calculated based on the adjacency matrix and the degree matrix, wherein the formula for calculating the standardized Laplacian matrix based on the adjacency matrix and the degree matrix is as follows: in, For Laplace matrix, Let be the degree matrix. The adjacency matrix is defined as follows; Calculate the first part of the Laplacian matrix The feature values and the previous The eigenvectors corresponding to the eigenvalues, the This is the default value; The former The eigenvectors corresponding to each eigenvalue are standardized by row to form... The feature matrix of dimension, the Total number of APK files; The feature matrix is clustered using an improved K-means clustering algorithm to obtain APK file clusters.
3. The method for classifying gray behaviors in Android applications according to claim 2, characterized in that, The step of clustering the feature vectors using an improved K-means clustering algorithm to obtain APK file clusters includes: Step S1: Randomly select from the APK file The APK file was used as the first cluster center. The number of APK file clusters is set in advance; Step S2: Traverse the function call graph of the APK file according to the types defined in the Android official documentation to obtain the type tags of the APK file, and calculate the initial similarity based on the type tags. The formula for calculating the initial similarity based on the type tags is as follows: in, For the first The first APK file and the first The initial similarity of the cluster centers of APK file clusters. A random number between 0 and 1; Step S3: Calculate the shortest distance between each APK file and the first cluster center based on the initial similarity, wherein the formula for calculating the shortest distance between each APK file and the cluster center based on the initial similarity is: in, This is a collection of all APK files. For the first Cluster centers of APK file clusters For the first The first APK file and the first Euclidean distance between the cluster centers of APK file clusters; Step S4: Classify the APK file into the file cluster corresponding to the cluster center with the smallest shortest distance, and obtain... The second cluster of APK files is defined, and the probability of each APK file being selected as the next cluster center is calculated based on the shortest distance: in, For the first The probability that an APK file will be selected as the next cluster center; Step S5: Based on the probability of being selected as the next cluster center, obtain... A second cluster center; Step S6: Calculate the shortest distance between each APK file and the second cluster center, and the probability of it being selected as the next cluster center, based on the initial similarity, to obtain the third cluster center and... The third APK file cluster is trained iteratively until the preset maximum number of iterations is reached, thus obtaining the APK file cluster.
4. The method for classifying gray behaviors in Android applications according to claim 3, characterized in that, After clustering the feature vectors using the improved K-means clustering algorithm to obtain APK file clusters, the Android application gray behavior classification method further includes: The clustering results are evaluated using a silhouette coefficient based on similarity weights, a Kalinsky-Harabás score based on similarity weights, and a Davis-Bolding index based on similarity weights. The formula for calculating the silhouette coefficient based on similarity weights is as follows: in, For the first The APK file belongs to the same APK file cluster. One APK file, For the first Cluster centers of file clusters For the first The first in the file cluster One APK file, For the first The distance between each APK file and its nearest cluster center For the first The distance between an APK file and its nearest different file cluster For the first The cohesion within a file cluster of an APK file For the first The number of APK files in each file cluster For the first The number of APK files in the file cluster containing each APK file; The formula for calculating the Kalinsky-Harabas score based on similarity weights is as follows: in, For the sum of squares between file clusters, It serves as the cluster center for all APK files. The sum of squares within the file cluster. For the first The sum of squares of each file cluster, For the first The first in the file cluster One APK file, The score for Kalinsky-Harabas; The formula for calculating the Davis-Bolding index based on similarity weights is as follows: in, For the first The average distance between the APK files of each file cluster and the cluster center. It is the first The average distance between the APK files of each file cluster and the cluster center. It is the first The file cluster and the first The intra-cluster distance ratio of each file cluster, This is the Davis-Bolding index, which is based on similarity weights.
5. An Android application gray behavior classification system, characterized in that, The Android application gray behavior classification system includes: The data acquisition module is used to obtain the function call graph of the APK file, specifically: Generate a candidate function call graph for the APK file based on Androguard; Calculate the node degree of the candidate function call graph; A function call graph is selected from the candidate function call graphs, wherein the degree of the nodes in the function call graph is greater than a preset node degree; The data input module is used to input the function call graph into a graph embedding neural network model to obtain a graph embedding vector, wherein the graph embedding neural network model is a GPT-GNN model, specifically: Obtain the initial feature vector of each node in the function call graph, specifically as follows: Use APKtool to decompile the APK file to obtain a smali file; Each node of the function call graph is matched with the smali code in the smali file to obtain the smali code matching result for each node; Based on the smali code matching results and Androguard, a control flow graph corresponding to each node of the function call graph is generated; Based on the node positions in the control flow graph and the smali code of the function corresponding to the node, the smali instruction sequence of the function corresponding to the node is calculated; The sequence feature vector of the smali instruction sequence is generated according to the SimCSE model, and the initial feature vector of each node is obtained. The function call graph is input into the GPT-GNN model so that the GPT-GNN model obtains the graph embedding vector based on the initial feature vector of each node in the function call graph. The data clustering module is used to perform clustering operations on the graph embedding vectors to obtain APK file clusters; The data classification module is used to perform real device testing and decompile on the files in the APK file cluster to obtain the gray behavior categories of the files in the file cluster.
6. An Android application gray behavior classification device, characterized in that, It includes at least one control processor and a memory for communicatively connecting to the at least one control processor; the memory stores instructions executable by the at least one control processor, the instructions being executed by the at least one control processor to enable the at least one control processor to perform the Android application gray behavior classification method as described in any one of claims 1 to 4.
7. A computer-readable storage medium, characterized in that: The computer-readable storage medium stores computer-executable instructions for causing a computer to perform the Android application gray behavior classification method as described in any one of claims 1 to 4.