A method for detecting unknown Android malicious applications based on behavioral feature text embedding and transfer learning
By using behavioral feature-based text embedding and transfer learning, behavioral description features are generated and the LSTM classification model is optimized, which solves the problem of poor detection performance of unknown Android malicious applications in the existing technology and achieves efficient identification and robust detection of unknown malicious applications.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- WUHAN TEXTILE UNIV
- Filing Date
- 2024-12-12
- Publication Date
- 2026-06-12
AI Technical Summary
Existing Android malware detection methods are ineffective when faced with unknown malicious applications in the real world, and cannot effectively identify malicious applications that have not been learned.
We employ a method based on behavioral feature text embedding and transfer learning. By extracting permission information, API call records, and URL features, we generate behavioral description features and use a pre-trained embedding model for text embedding. We combine an LSTM classification model and L1 regularization to optimize the model's generalization ability, remove easily confused features, and improve detection performance.
It significantly improves the accuracy and robustness of detecting unknown Android malicious applications, enhances the model's ability to identify unknown malicious applications in the real world, reduces feature extraction resources, and extracts more representative features.
Smart Images

Figure CN119903512B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of application detection methods, specifically to a method for detecting unknown malicious Android applications based on behavioral feature text embedding and transfer learning. Background Technology
[0002] Android malware poses a significant threat to mobile device security. It not only compromises users' privacy and financial security but also causes device performance degradation, system crashes, and remote control. While the number of new malware applications is increasing significantly, a large number of malware applications have emerged in the real world that are still unknown to many existing detectors. Therefore, effectively detecting unknown real-world malware applications is a pressing issue that needs to be addressed.
[0003] In the field of Android malicious application detection, numerous studies have proposed detection methods with excellent performance based on recognized datasets. However, with the continuous emergence of unknown malicious applications in the real world, the detection effectiveness of these methods risks failing. Unknown malicious applications exist in the real world, representing knowledge that the detector has never learned. To simulate real-world scenarios, this invention collects real-world software datasets while ensuring that the data in these datasets has not been learned by the model. Simultaneously, the application strategy of transfer learning provides a novel approach to identifying unknown malicious applications in the real world. Its core lies in treating the malicious application detection model built based on existing datasets as a source domain, and the detection task targeting unknown applications as a target domain, where transfer learning techniques can be used to effectively utilize the knowledge and experience of the source domain. Summary of the Invention
[0004] The technical problem to be solved by the present invention is to provide a method for detecting unknown malicious Android applications based on behavioral feature text embedding and transfer learning, which addresses the above-mentioned shortcomings.
[0005] To solve the above technical problems, the present invention adopts the following technical solution:
[0006] A method for detecting unknown malicious Android applications based on behavioral feature text embedding and transfer learning, characterized by the following steps:
[0007] Step 1: Establish a dataset of malicious application samples and a dataset of benign application samples;
[0008] Step 2: Extract features from each APK sample in the dataset. The extracted features include permission information, API call records, and URL features in the source code.
[0009] Step 3: Set up a behavior description template. Insert the permission information, API call records, and URL features extracted in the previous step, along with the interpretable text of the corresponding permissions and API calls, into the behavior description template to generate behavior description features.
[0010] Step 4: Embed the behavioral description features of all samples into text using a pre-trained embedding model BGE;
[0011] Step 5: Calculate the dot product similarity between the text embedded by each malicious sample and the text embedded by all benign samples. Select the benign sample with the highest dot product similarity as the most similar sample. Remove the benign behavior description features corresponding to the most similar sample from the behavior description features of each malicious sample to obtain the processed malicious application sample dataset.
[0012] Step 6: Build an optimized LSTM classification model. Use the processed malicious application sample dataset obtained in the previous step and the benign application sample dataset from Step 1 to train the optimized LSTM classification model to obtain the trained malicious application detection model. The loss function of the optimized LSTM classification model adopts the L1 regularized loss function.
[0013] Step 7: Input the application to be detected into the malicious application detection model, and the model outputs the detection results.
[0014] Furthermore, the extraction method in step 2 specifically includes the following steps:
[0015] Step 2-1: Use Apktool to decompile the APK file and obtain the permissionandroid:name information from the Manifest.xml file of the original APK file as the permission feature of the APK file. Remove redundant fields and symbols from the permission feature.
[0016] Step 2-2: Use Apktool to decompile the classes.dex file and extract the API call features and URL features of the APK. When extracting API call features, remove API calls that are not semantically meaningful or irrelevant, and only keep meaningful API call information.
[0017] Furthermore, step 3 specifically includes the following steps:
[0018] Step 3-1: Create a behavior description template and construct appropriate contexts for the features that represent the behavior of the application within the template;
[0019] Step 3-2: According to the developer documentation, obtain interpretability information related to permissions and API calls;
[0020] Step 3-3: Match the extracted features with their corresponding interpretability descriptions and input them into the behavior description template created in Step 3-1 to generate complete behavior description features.
[0021] Furthermore, step 4 specifically includes the following steps:
[0022] Step 4-1: Input the behavioral description features from the known dataset into the pre-trained embedding model BGE to perform text embedding and convert them into vectors. The text embedding process is expressed by the following formula: Given a dataset, x i Let x represent the i-th sample in the dataset, where N is the total number of samples. The pre-trained embedding model BGEM is used to embed x... i Convert to vector representation v i :v i =M(x i for all i = 1, 2, ..., N;
[0023] Step 4-2: Calculate text similarity for the samples after text embedding, where the malicious application sample set is denoted as . It contains M malicious application samples and a benign software sample set. It contains B benign software samples, and for any pair of malicious application samples and benign software samples The dot product similarity between them is defined as: The most similar malicious applications and benign software pairs are obtained based on the calculated similarity.
[0024] Step 4-3: For the most similar sample pairs obtained in Step 4-2, compare the behavioral description features of each malicious application with the corresponding most similar benign software, and remove the benign behavioral description features of the malicious sample that correspond to the most similar benign software.
[0025] Furthermore, in step 5, the malicious application sample dataset obtained in the previous step and 20% of the samples in the benign application sample dataset from step 1 are first input into a multi-layer stacked autoencoder neural network for sample reconstruction to obtain the reconstructed dataset. Then, together with other unreconstructed dataset data, the optimized LSTM classification model is trained.
[0026] The beneficial effects of this invention are as follows:
[0027] (1) This invention employs a training method that enhances the generalization ability of the model to combat unknown malicious applications on Android;
[0028] (2) This invention designs and implements a feature optimization and enhancement method based on NLP. By using text embedding technology to convert features into vector representations, and further using a text similarity measurement algorithm to filter out easily confused features used by malicious applications, this significantly reduces the resources used for feature extraction and extracts more representative features, thereby improving the detection effect of the model.
[0029] (3) This invention designs and implements a behavior description template, which better integrates static features into a complete and comprehensive natural language text, and finally generates behavior description features;
[0030] (4) This invention employs L1 regularization and stacked autoencoders to enhance the generalization ability of the model. Introducing L1 regularization into the model training can effectively make the model exhibit a strong dependence on a few key features when facing unknown malicious applications in the real world, rather than relying entirely on all features. SAE consists of multiple autoencoders, with the hidden layer of each autoencoder serving as the input layer of the next autoencoder, forming a multi-layer structure. Each layer can extract different features from the data.
[0031] (5) A novel method based on behavioral feature text embedding and an improved transfer learning model is proposed for detecting unknown Android malicious applications in the real world. This method extracts permissions, API calls, and URLs from the target application to capture its behavior, and maps these behavioral features to natural language descriptions by combining them with tailored behavioral description templates. The method encodes the behavioral descriptions of the target application based on text embedding and calculates the semantic similarity of the feature texts, thereby further extracting key features that influence the detection results. To enhance the model's robustness against unknown malicious applications in the real world, this method integrates transfer learning and employs L1 regularization during the training phase.
[0032] The present invention will now be described in detail with reference to the accompanying drawings and examples. Attached Figure Description
[0033] Figure 1 This is a flowchart of a method for detecting unknown malicious Android applications based on behavioral feature text embedding and improved transfer learning according to the present invention.
[0034] Figure 2 Build processes for data;
[0035] Figure 3 A flowchart for generating behavioral description features;
[0036] Figure 4 Flowchart for removing malicious applications that could easily confuse benign behaviors;
[0037] Figure 5A flowchart for the detection of unknown malicious applications based on model optimization;
[0038] Figure 6 This is a schematic diagram of a behavior description template. Detailed Implementation
[0039] The principles and features of the present invention are described below with reference to the accompanying drawings. The examples given are for illustrative purposes only and are not intended to limit the scope of the invention.
[0040] This invention provides a method for detecting unknown Android malicious applications based on behavioral feature text embedding and improved transfer learning. When the Android malicious application has not been learned by the model, behavioral description features are generated and optimized using NLP. A deep learning model is adopted, combined with L1 regularization and stacked autoencoders, to obtain the Android unknown malicious application detection model of this invention, which predicts whether an Android application is a malicious application.
[0041] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions used in this invention will be described clearly and completely below with reference to the accompanying drawings. The examples given are only for explaining this invention and are not intended to limit the scope of this invention.
[0042] This invention provides a method for detecting unknown malicious Android applications based on behavioral feature text embedding and improved transfer learning, as shown in the flowchart below. Figure 1 As shown, it includes:
[0043] S1: Dataset collection;
[0044] In step S1, this invention uses an authoritative dataset and a collected dataset of unknown malicious applications to construct two datasets. These two datasets are named the "Known Dataset" and the "Unknown Dataset," respectively. Furthermore, when labeling applications, this invention classifies software with a malicious threat level greater than 4 in VirtusTotal as malicious applications, and software with a malicious threat level less than 4 as benign software, according to established standards.
[0045] S2: Extract permissions, API calls, and URL characteristics;
[0046] To extract features, this invention uses the Apktool tool to decompile APK files to obtain a Manifest.xml file containing basic APK information. This Manifest.xml file contains the APK's permission information, and the file needs to be read and the permission features saved to the corresponding txt files. To remove redundant fields in the permission features, the txt file containing the permission information for each APK is first read, and the common string replacement method is used to replace the fields "android.permission." and "com.android.launcher.permission." with empty strings, and the remaining permission information is converted to lowercase. Simultaneously, this invention uses the Apktool tool to extract the program's API call features and URLs present in the program's source code. The API call features need to retain their class names, return values, and parameter information. Furthermore, all features are stored in the permission.txt, api.txt, and url.txt files located in the APK's directory.
[0047] S3: Generate behavioral description features using behavioral description templates;
[0048] First, this step requires accessing the Android developer website to obtain the description information of permissions and the behavioral logic description information of APIs based on permissions and API names. Then, this invention designs a behavioral description template that integrates permissions, API calls, interpretability information, and URLs. Specifically, behavioral description features are semantic-level features describing application behavior. From the perspective of malicious application analysts, they can determine whether an application is malicious by analyzing its behavioral descriptions. When an application executes an API call, it needs to request relevant permissions. Based on this operational logic, this invention first represents the logical order of permissions and associated API calls as a single part. Then, this invention constructs appropriate contexts for the features representing application behavior in the template. Not all API call features have behavioral description information in the documentation; this invention inserts the parameters and return values of these API calls into the template. Furthermore, URLs do not have specific description information, but considering that malicious application developers may include botnets in their source code, this invention incorporates URL features into the template. Figure 6 As shown.
[0049] S4: Combine similarity metrics to eliminate non-malicious behaviors in the description features of malicious application behavior.
[0050] To perform similarity measurement and input it into the neural network to train the model, the behavioral description features first need to be embedded into text. This invention uses a BGE pre-trained model for embedding the behavioral description features into text. This process can be expressed by a concise formula, assuming... Given a dataset, x i For each sample in the dataset, x is embedded using the pre-trained embedding model BGEM. i Convert to vector representation v i :v i =M(x i For all i = 1, 2, ..., N. However, in the real world, some malicious applications have highly similar behavioral logic to benign software. These malicious applications mimic the interface and functions of benign software to deceive users into downloading and installing them. Therefore, we designed an algorithm to remove easily confused benign behaviors from malicious applications that are highly similar to benign software. Let the malicious application sample set be... It contains M malicious application samples, and the benign software sample set is... It contains B benign software samples, and for any pair of malicious application samples and benign software samples The dot product similarity between them is defined as: The most similar malicious applications and benign software pairs are obtained based on the calculated similarity score. Based on these similarity-metric sample pairs, the malicious application is further compared with its most similar benign software, and similar benign behaviors within the malicious application are removed.
[0051] S5: Optimize the model trained on an existing dataset using L1 regularization and stacked autoencoders.
[0052] To detect unknown malicious applications, this invention focuses on optimizing the model's generalization ability. When training the model, this invention uses a known dataset. Furthermore, the neural network used in this invention is an LSTM neural network, which can focus on the contextual information of behavioral description features. To build this neural network, a correct torch environment needs to be configured and the model loaded. This invention aims to achieve a malicious application classification task, choosing binary cross-entropy as the loss function during training.
[0053]
[0054] Where L represents the loss function, L represents the number of samples, and y i p represents the true label of the i-th sample. i This represents the probability that the model predicts the i-th sample to be of the positive class.
[0055] To enhance the model's generalization ability, this invention incorporates L1 regularization during model training. Regularization improves generalization ability by introducing an additional penalty term into the loss function to constrain model complexity. L1 regularization adds a penalty term to the sum of the absolute values of the weights in the loss function, causing some weights to tend towards zero. Therefore, the final loss function of this invention is:
[0056]
[0057] Furthermore, this invention enhances the model's generalization ability by increasing the richness of samples. As a neural network-based model architecture, the autoencoder learns feature vectors to achieve data compression and subsequent decompression. The stacked encoder consists of multiple autoencoders, with the hidden layer of each autoencoder serving as the input layer for the next autoencoder, forming a multi-layer structure. Each layer can extract different features from the data. Therefore, this invention selects 20% of the feature vectors from the existing dataset to reconstruct it and inputs it into the model for training. During the training process, the parameters of this invention are optimized and adjusted, and the recommended parameters are shown in Table 1. After comparing experiments with different parameters, it was found that the model for detecting obfuscated malicious applications using L1 regularization combined with a binary crossover loss function, a batch size of 62, a defined learning rate of 1e-5, and 50 training epochs is optimal.
[0058] Table 1 Training parameters of LSTM
[0059]
[0060] To make the simulated scenarios in the dataset of this invention more realistic, the extracted features are randomly divided into an 8:2 training set and a test set. During training, sample features from the dataset are randomly selected and uploaded for training. After multiple rounds of training, the gradient of the unknown malicious application detection model is obtained.
[0061] S6: Verify its effectiveness by combining transfer learning with input of unknown malicious applications.
[0062] To verify the detection performance of the model of this invention against unknown malicious applications, we combine transfer learning methods, treating the model trained on an existing dataset as the source model. Simultaneously, this invention also processes the unknown dataset through feature extraction and other steps before inputting it into the source model for evaluation. The model's evaluation metrics are Accuracy, Precision, Recalculation, and F1-score. The specific formulas are:
[0063]
[0064] Wherein, TP (True Positive Examples) is the number of samples that the model predicts as positive and that are actually positive; TN (True Negative Examples) is the number of samples that the model predicts as negative and that are actually negative; FP (False Positive Examples) is the number of samples that the model predicts as positive but that are actually negative; and FN (False Negative Examples) is the number of samples that the model predicts as negative but that are actually positive.
[0065] Accuracy, Precision, Recall, and F1-score are important metrics used in machine learning to evaluate model performance. Accuracy measures the accuracy of a model's predictions across all samples; Precision focuses on the proportion of samples predicted as positive that were actually positive, reflecting the reliability of the positive predictions; Recall focuses on the proportion of all true positive samples correctly predicted by the model, measuring the model's ability to identify all true positives; and F1-score is the harmonic mean of Precision and Recall, used to balance the trade-offs between the two, providing a comprehensive way to evaluate model performance.
[0066] The method provided by this invention has the following advantages or beneficial technical effects:
[0067] The number of malicious Android applications is on the rise, posing a serious security threat to users' privacy and assets. Many studies have achieved good detection results on known datasets, but these detectors become significantly less effective when faced with malicious applications they haven't learned from. Our method achieves better detection results when dealing with unknown malicious applications, contributing to the detection of mobile network security threats.
[0068] The above description provides examples of the preferred embodiments of the present invention. Parts not detailed herein are common knowledge to those skilled in the art. The scope of protection of the present invention is determined by the claims. Any equivalent modifications based on the technical teachings of the present invention are also within the scope of protection of the present invention.
Claims
1. A method for detecting unknown malicious Android applications based on behavioral feature text embedding and transfer learning, characterized in that, Includes the following steps: Step 1: Establish a dataset of malicious application samples and a dataset of benign application samples; Step 2: Extract features from each APK sample in the dataset. The extracted features include permission information, API call records, and URL features in the source code. Step 3: Set up a behavior description template. Insert the permission information, API call records, and URL features extracted in the previous step, along with the interpretable text of the corresponding permissions and API calls, into the behavior description template to generate behavior description features. Step 4: Embed the behavioral description features of all samples into text using a pre-trained embedding model BGE; Step 5: Calculate the dot product similarity between the text embedded by each malicious sample and the text embedded by all benign samples. Select the benign sample with the highest dot product similarity as the most similar sample. Remove the benign behavior description features corresponding to the most similar sample from the behavior description features of each malicious sample to obtain the processed malicious application sample dataset. Step 6: Build an optimized LSTM classification model. Use the processed malicious application sample dataset obtained in the previous step and the benign application sample dataset from Step 1 to train the optimized LSTM classification model to obtain the trained malicious application detection model. The loss function of the optimized LSTM classification model adopts the L1 regularized loss function. Step 7: Input the application to be detected into the malicious application detection model, and the model outputs the detection results.
2. The method for detecting unknown malicious Android applications based on behavioral feature text embedding and transfer learning according to claim 1, characterized in that, The extraction method in step 2 specifically includes the following steps: Step 2-1: Use Apktool to decompile the APK file and obtain the permissionandroid:name information from the Manifest.xml file of the original APK file as the permission feature of the APK file. Remove redundant fields and symbols from the permission feature. Step 2-2: Use Apktool to decompile the obtained classes.dex file and extract the API call characteristics and URL characteristics of the APK.
3. The method for detecting unknown malicious Android applications based on behavioral feature text embedding and transfer learning according to claim 1, characterized in that, Step 3 specifically includes the following steps: Step 3-1: Create a behavior description template and construct appropriate contexts for the features that represent the behavior of the application within the template; Step 3-2: According to the developer documentation, obtain interpretability information related to permissions and API calls; Step 3-3: Match the extracted features with their corresponding interpretability descriptions and input them into the behavior description template created in Step 3-1 to generate complete behavior description features.
4. The method for detecting unknown malicious Android applications based on behavioral feature text embedding and transfer learning according to claim 1, wherein step 4 specifically includes the following steps: Step 4-1: Input the behavioral description features from the known dataset into the pre-trained embedding model BGE to perform text embedding and convert them into vectors. The text embedding process is expressed by the following formula: Given a dataset, x i Let x represent the i-th sample in the dataset, where N is the total number of samples. The pre-trained embedding model BGE M is used to embed x... i Convert to vector representation v i :v i =M(x i for all i = 1, 2, ..., N; Step 4-2: Calculate text similarity for the samples after text embedding, where the malicious application sample set is... It contains M malicious application samples and a benign application sample set. It contains B benign application samples, and for any pair of malicious application samples and positive application samples The dot product similarity between them is defined as: The most similar malicious and benign application pairs are obtained based on the calculated similarity. Step 4-3: For the most similar sample pairs obtained in Step 4-2, compare the behavioral description features of each malicious application with the corresponding most similar benign application, and remove the benign behavioral description features of the malicious sample that correspond to the most similar benign application.
5. The method for detecting unknown Android malicious applications based on behavioral feature text embedding and transfer learning according to claim 1, wherein in step 5, the malicious application sample dataset obtained after the previous step and 20% of the samples in the benign application sample dataset of step 1 are first input into a multi-layer stacked autoencoder neural network for sample reconstruction to obtain the reconstructed dataset, and then the optimized LSTM classification model is trained together with other unreconstructed dataset data.