An Android application chipping identification method and device and storage medium

By using machine learning models, especially random forest models, and leveraging the frequency differences of dex file strings to identify Android application obfuscation, the problem of poor recognition performance and low efficiency in existing technologies has been solved, achieving fast and accurate obfuscation recognition.

CN115618346BActive Publication Date: 2026-06-26DALIAN MARITIME UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
DALIAN MARITIME UNIVERSITY
Filing Date
2022-10-28
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies are inefficient and have poor performance when identifying blocked Android applications. In particular, artificial intelligence methods lack the ability to process sample data, resulting in unsatisfactory prediction results and excessively long processing times.

Method used

By employing machine learning models, especially random forest models, feature data is obtained and classified by statistically analyzing the frequency differences of strings in dex files, thus quickly determining whether an Android application is protected by a shell.

Benefits of technology

It significantly improves the accuracy and efficiency of code-protection detection, and can determine whether software of about 150M is protected within 1 second, reducing false positives.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115618346B_ABST
    Figure CN115618346B_ABST
Patent Text Reader

Abstract

The application provides a method and device for identifying a shell of an Android application and a storage medium. The method comprises the following steps: obtaining a training Android application, and obtaining a dex file after decompression; counting the number of occurrences of each string in the dex file, and calculating the frequency of character occurrence based on the number of occurrences of the string; obtaining the string with the highest difference in frequency of character occurrence as feature data according to the frequency of character occurrence of the shell Android application and the frequency of character occurrence of the unshell Android application; training a shell identification model through the feature data; obtaining a to-be-identified Android application, obtaining a to-be-identified dex file after decompression, and identifying the to-be-identified dex file through the trained shell identification model, so as to obtain an identification result. The application provides a method for quickly determining whether an Android application is shelled, and uses a machine learning model to identify the shell, thereby further improving the efficiency and accuracy of identification.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer technology, and more particularly to a method, apparatus, and storage medium for identifying the obfuscation of Android applications. Background Technology

[0002] Packing refers to using special algorithms to compress the resources within an executable file. The compressed file can run independently; the decompression process is completely hidden and occurs entirely in memory. It is attached to the original program, loaded into memory by a loader, and executes before the original program, gaining control. During execution, it decrypts and restores the original program, then returns control to the original program to execute its original code. With the packer added, the original program code typically exists in encrypted form on the disk file, only being restored in memory during execution. This effectively prevents unauthorized modification of the program file by attackers and also prevents static decompilation.

[0003] Determining whether an application is protected by a code is a necessary step before application analysis. Early code protection detection mainly relied on hooks to extract necessary information for judgment. However, due to rapid technological iteration, this technique is no longer sufficient to support the complex and ever-changing code protection techniques. Artificial intelligence-based code protection detection methods can also be used, but these methods focus more on the design of the recognition algorithm and lack processing of the sample data itself, resulting in unsatisfactory prediction results. Furthermore, the large amount of sample data leads to excessively long training and prediction times for AI models, resulting in poor recognition performance and low efficiency. Summary of the Invention

[0004] To address the technical problems of poor recognition performance and low efficiency in existing methods for identifying encrypted Android applications, this application provides a method, apparatus, and storage medium for identifying encrypted Android applications. This invention provides a method for quickly determining whether an Android application is encrypted, and uses a machine learning model for encryption identification, further improving the efficiency and accuracy of the identification.

[0005] The technical means employed in this invention are as follows:

[0006] A method for identifying the obfuscation of Android mobile applications includes the following steps:

[0007] Obtain the training Android application, and extract the dex file after decompression; the training Android application includes the Android application itself and whether it is packed or not.

[0008] Count the number of times each string appears in the dex file, and calculate the frequency of each string based on the number of times it appears.

[0009] The string with the highest difference in the frequency of characters observed in both packed and unpacked Android applications is used as feature data.

[0010] The feature data is used to train the obfuscation detection model. The obfuscation detection model is used to classify the feature data to obtain the category corresponding to each feature. The category corresponds to whether the Android program is obfuscated.

[0011] Obtain the Android application to be identified, decompress it to obtain the dex file to be identified, and use the trained obfuscation recognition model to identify the dex file to obtain the recognition result.

[0012] Furthermore, after obtaining the training Android application and decompressing it to obtain the dex file, the process also includes obtaining the AndroidManifest.xml file and using the AndroidManifest.xml file to determine whether the dex file of the Android application has been segmented.

[0013] Furthermore, the shell recognition model is a random forest model.

[0014] This invention also discloses an Android mobile application protection detection device, comprising:

[0015] The training data acquisition module is used to acquire the training Android application and obtain the dex file after decompression; the training Android application includes the Android application itself and whether it is packed or not.

[0016] The statistics module is used to count the number of times each string appears in the dex file, and calculate the frequency of character occurrences based on the number of times the string appears;

[0017] The feature data acquisition module is used to obtain the string with the highest difference as feature data based on the frequency of character occurrences in the packed Android application and the frequency of character occurrences in the unpacked Android application.

[0018] The model training module is used to train the obfuscation recognition model using the feature data. The obfuscation recognition model is used to classify the feature data to obtain the category corresponding to each feature. The category corresponds to whether the Android program is obfuscated.

[0019] The recognition module is used to obtain the Android application to be recognized, decompress it to obtain the dex file to be recognized, and use a trained anti-packaging recognition model to recognize the dex file to obtain the recognition result.

[0020] Furthermore, the device also includes an AndroidManifest.xml file acquisition module, used to acquire the AndroidManifest.xml file and determine whether the dex file of the Android application has been segmented based on the AndroidManifest.xml file.

[0021] Furthermore, the shell recognition model is a random forest model.

[0022] The present invention also discloses a storage medium comprising a stored program, wherein, when the program is executed, it performs the method for identifying the obfuscation of an Android mobile application as described in any of the preceding claims.

[0023] Compared with the prior art, the present invention has the following advantages:

[0024] 1. The packing detection method provided by this invention can determine whether an application software is packed simply by counting the frequency of strings in the dex file, greatly improving the detection speed. Experiments show that it only takes 1 second to obtain results for a software file of about 150MB.

[0025] 2. This invention uses the random forest algorithm for learning to prevent certain applications from being mistakenly identified as having hidden features by traditional methods due to their inclusion of certain characteristic data for special reasons. Attached Figure Description

[0026] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0027] Figure 1 This is a flowchart of a method for identifying the obfuscation of an Android mobile application according to the present invention.

[0028] Figure 2 This describes the training process of the random forest classifier in the example. Detailed Implementation

[0029] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0030] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0031] like Figure 1 As shown, this invention provides a method for identifying the obfuscation of Android mobile applications. First, the corresponding DEX file is obtained from the samples, and the occurrence count of all strings is acquired. Parameters are passed to a second layer to calculate the probability of each string's occurrence and statistically analyze the differences. The string with the most significant difference after obfuscation is identified, and a threshold is calculated. Then, the corresponding feature data is passed to a random forest algorithm to train the model. Finally, the model is tested using test data, and the results determine whether further learning is needed. Specifically, the method includes the following steps:

[0032] S1. Obtain the training Android application, and extract the dex file after decompression; the training Android application includes the Android application itself and whether it is packed or not.

[0033] In a preferred embodiment of the present invention, after extracting the dex file, the AndroidManifest.xml file is also extracted. The multiDexEnabled field in the AndroidManifest.xml file is used to determine whether the dex file of the Android application has been segmented. If it has been segmented, all dex files need to be extracted in a loop; otherwise, extraction only needs to be performed once.

[0034] Because a dex file is a binary file containing all the code files in an Android program, whether or not it's obfuscated directly affects the structure of the source code, thus affecting the final form of the dex file (for example, obfuscating variables can turn a regular string into a map to another string, and the map processing is added to the code). At the same time, the AndroidManifest.xml file indicates whether the application's dex file has been split, i.e., whether the multi-dex pattern exists, which in turn affects how we extract the dex file.

[0035] S2. Count the number of occurrences of each string in the dex file, and calculate the frequency of each string based on the number of occurrences. Specifically, a pre-developed program reads the dex file and counts the occurrences of all strings sequentially. Then, the frequency of each string is calculated.

[0036] S3. Based on the frequency of characters appearing in the Android application with and without a protective cover, obtain the string with the highest difference as feature data.

[0037] Specifically, by statistically comparing the string differences between packed and unpacked dex files, the strings with the highest differences are selected as feature data, as shown in Table 1. The minimum frequency of this string among all characters is then calculated and set as the threshold used for random forest judgment.

[0038] Table 1 Feature Data

[0039]

[0040] S4. The obfuscation detection model is trained using the feature data. The obfuscation detection model is used to classify the feature data to obtain the category corresponding to each feature. The category corresponds to whether the Android program is obfuscated. Preferably, the present invention uses a random forest model as the obfuscation detection model.

[0041] This invention preferably compares nearly one hundred programs, both encrypted and unencrypted, to determine the strings that will be included regardless of the encryption algorithm. After obtaining the corresponding feature data, a classification task begins, such as... Figure 2 As shown, the features shown in Table 1 are used as input data for training the random forest. The learning process is as follows:

[0042] Initially, it checks if the frequency of char(127) = 'DEL' is greater than 0.002. If it is, it is determined that the string is packed. If it is less than 0.002, it continues to check if the frequency of char(59) = ';' is less than 0.138.

[0043] If the frequency of char(59) = ':' is greater than 0.138, continue to judge the frequency of char(62) = ">". If the frequency of char(62) = ">" is greater than 0.011, it is determined that the application has not been packed (ND). If the frequency of char(62) = ">" is less than 0.011, it is determined that it has been packed (D).

[0044] If the frequency of char(59) = ':' is less than 0.138, continue to judge the frequency of char(46) = ".". If it is greater than 0.008, the application is judged as unpacked (ND), otherwise it is judged as packed (D).

[0045] The decision tree analyzes known variables to obtain the variable with the greatest influence and the threshold, which is the threshold obtained in S3.

[0046] From Table 1 and Figure 2 As can be seen, analyzing the frequency of strings appearing in the code of packed and unpacked sample applications, and the differences between the corresponding strings when packed and unpacked, reveals that the most significant differences can be used as the root node. The remaining differences can be used as leaf nodes for further discrimination.

[0047] Generally speaking, the deeper the random forest, the better the fitting result to the specified dataset, but overfitting may occur. Conversely, if the decision tree is not deep enough, there will be a large number of misjudgments, which is what we often call insufficient generalization ability. Figure 2 The decision tree shown is the result of learning from the samples. Through comparative learning, the present invention has determined that four layers are the best result.

[0048] S5. Obtain the Android application to be identified, decompress it to obtain the dex file to be identified, and use the trained obfuscated recognition model to identify the dex file to obtain the recognition result.

[0049] The following specific application example will further illustrate the solution and effects of the present invention.

[0050] In this embodiment, the trained model is placed on a designated service. Developers open the web application, upload the application (apk file) to the designated server, the data processing logic obtains the apk, extracts the dex file, counts strings, processes the string information, and then passes the feature data to the pre-trained model for judgment. The judgment result is directly returned to the web interface and displayed to the developers in the form of a popup to determine whether the application is protected.

[0051] This invention also discloses an Android mobile application protection detection device, comprising:

[0052] The training data acquisition module is used to acquire the training Android application and obtain the dex file after decompression; the training Android application includes the Android application itself and whether it is packed or not.

[0053] The statistics module is used to count the number of times each string appears in the dex file, and calculate the frequency of character occurrences based on the number of times the string appears;

[0054] The feature data acquisition module is used to obtain the string with the highest difference as feature data based on the frequency of character occurrences in the packed Android application and the frequency of character occurrences in the unpacked Android application.

[0055] The model training module is used to train the obfuscation recognition model using the feature data. The obfuscation recognition model is used to classify the feature data to obtain the category corresponding to each feature. The category corresponds to whether the Android program is obfuscated.

[0056] The recognition module is used to obtain the Android application to be recognized, decompress it to obtain the dex file to be recognized, and use a trained anti-packaging recognition model to recognize the dex file to obtain the recognition result.

[0057] Furthermore, the device also includes an AndroidManifest.xml file acquisition module, used to acquire the AndroidManifest.xml file and determine whether the dex file of the Android application has been segmented based on the AndroidManifest.xml file.

[0058] Furthermore, the shell recognition model is a random forest model.

[0059] As for the embodiment of the device for identifying the obfuscation of an Android mobile application according to the present invention, since it corresponds to the embodiment of the method for identifying the obfuscation of an Android mobile application above, the description is relatively simple. For related similarities, please refer to the description in part of the embodiment of the method for identifying the obfuscation of an Android mobile application above, and it will not be described in detail here.

[0060] The present invention also discloses a storage medium comprising a stored program, wherein, when the program is executed, it performs the method for identifying the obfuscation of an Android mobile application as described in any of the preceding claims.

[0061] The sequence numbers of the above embodiments of the present invention are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.

[0062] In the above embodiments of the present invention, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0063] In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. The device embodiments described above are merely illustrative; for example, the division of units can be a logical functional division, and in actual implementation, there may be other division methods. For instance, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the displayed or discussed mutual coupling, direct coupling, or communication connection may be through some interfaces; the indirect coupling or communication connection between units or modules may be electrical or other forms.

[0064] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0065] Furthermore, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0066] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, read-only memory (ROM), random access memory (RAM), portable hard drives, magnetic disks, or optical disks.

[0067] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for identifying the obfuscation of an Android mobile application, characterized in that, Includes the following steps: Obtain the training Android application, and extract the dex file after decompression; the training Android application includes the Android application itself and whether it is packed or not. Count the number of times each string appears in the dex file, and calculate the frequency of each character based on the number of times the string appears; The string with the highest difference in the frequency of characters observed in both packed and unpacked Android applications is used as feature data. The feature data is used to train the obfuscation detection model. The obfuscation detection model is used to classify the feature data to obtain the category corresponding to each feature. The category corresponds to whether the Android program is obfuscated. Obtain the Android application to be identified, decompress it to obtain the dex file to be identified, and use the trained obfuscation recognition model to identify the dex file to obtain the recognition result.

2. The method for identifying the obfuscation of an Android mobile application according to claim 1, characterized in that, After obtaining the training Android application and decompressing it to get the dex file, the process also includes obtaining the AndroidManifest.xml file and using the AndroidManifest.xml file to determine whether the dex file of the Android application has been segmented.

3. The method for identifying the obfuscation of an Android mobile application according to claim 1, characterized in that, The shell recognition model is a random forest model.

4. A device for identifying the protection of Android mobile applications, characterized in that, include: The training data acquisition module is used to acquire the Android application for training, and after decompression, obtain the dex file; The training Android application includes the Android application itself and whether it is protected by a shell. The statistics module is used to count the number of times each string appears in the dex file, and calculate the frequency of character occurrences based on the number of times the string appears; The feature data acquisition module is used to obtain the string with the highest difference as feature data based on the frequency of character occurrences in the packed Android application and the frequency of character occurrences in the unpacked Android application. The model training module is used to train the obfuscation recognition model using the feature data. The obfuscation recognition model is used to classify the feature data to obtain the category corresponding to each feature. The category corresponds to whether the Android program is obfuscated. The recognition module is used to obtain the Android application to be recognized, decompress it to obtain the dex file to be recognized, and use a trained anti-packaging recognition model to recognize the dex file to obtain the recognition result.

5. The device for identifying the cover of an Android mobile application according to claim 4, characterized in that, The device also includes an AndroidManifest.xml file acquisition module, used to acquire the AndroidManifest.xml file and determine whether the dex file of the Android application has been segmented based on the AndroidManifest.xml file.

6. The device for identifying the obfuscation of an Android mobile application according to claim 4, characterized in that, The shell recognition model is a random forest model.

7. A storage medium, characterized in that, The storage medium includes a stored program, wherein when the program is executed, it performs the method for identifying the obfuscation of an Android mobile application as described in any one of claims 1 to 3.