[0057] Process flow of the present invention such as figure 1 As shown, the main part is the feature extraction using the sensitive behavior analysis method. What needs to be done is to obtain the data that can be used by the support vector machine algorithm by analyzing the installation file APK of the Android application. For Android applications with known security, its features are obtained through analysis and organized as a training data set; for Android applications with unknown security, features other than security can be obtained through feature analysis, and these features are combined with data training As a result, its safety is predicted.
[0058] Step 1: The APK file cannot be analyzed directly, and Manifest.xml, class.dex, and layout.xml files related to the layout need to be obtained after decompression. These files cannot be directly used as the input for the next step of analysis. We need to use the tool dex2jar (http://sourceforge.net/projects/dex2jar/) to decompile class.dex into a jar package to obtain the Java code of the Android application; at the same time, we need Use the tool apktooI (http://ibotpeaches.github.io/Apktool/) to decode the XML file to obtain a readable XML file.
[0059] Step 2: After obtaining the code of the Android application, use the WALA tool (http://sourceforge.net/projects/wala/) to obtain the control flow graph CFG and the function call relationship graph, and further extract the sensitive API, sensitive data and related information the behavior of. Since we mainly focus on which sensitive APIs in Android applications operate on which sensitive data, we define the following predicates to find sensitive APIs and sensitive data in the code, expressed in the Datalog language.
[0060] Whether there is a sensitive API, hasSenAPI(F, SenAPI, L): there is a function F at line L of the program, and a sensitive API—(SenAPI) is called in F.
[0061] Whether there is sensitive data, hasSenData(L, SenData): the sensitive data SenData is operated on line L of the program.
[0062] After finding the sensitive API in the code, we need to know how the two are passed in the application and what function will trigger the sensitive API to operate on the sensitive data. Therefore, the following predicates are defined to indicate the call relationship between functions.
[0063] Direct function call relationship, directInvoke(F*, F', L): at line L of the program, function F* directly calls function F'.
[0064] Indirect function call relationship, indirectInvoke(F*, F'): In an event-driven environment, function F' is the ultimate goal of F*.
[0065] In Android, the ICC (InterComponent Communication) mechanism can be used to make function calls between components. We define the following predicates to fully express the process of using the ICC mechanism to make function calls.
[0066] Whether it is Intent, isIntent(L1, X): The type of parameter X defined at line L1 of the program is Intent.
[0067] Intent initialization, intentInitial(L2, X, Y): The parameter X is initialized at line L2 of the program, and the first real parameter of initialization is Y.
[0068] A function call using the ICC mechanism, iccInvoke(F*, L, X): at line L of the program, the function F calls other functions using parameters of the Intent type.
[0069] Whether in a certain component, inComponent(F', Y): The function F' is in the component Y.
[0070] The following uses the code to specifically illustrate a case of using the ICC mechanism to call a function.
[0071]
[0072] The above code shows that: in the function F*, the parameter X of the Intent type whose initialization parameters are Y and url is defined, and X is started in the statement "startActivity(X);", the function F' in Y will be called, that is, the function F* The function F' is called through the ICC mechanism.
[0073] Step 3: After obtaining the predicate expressed by the Datalog language obtained in the above steps, we define the derivation rules also expressed by the Datalog language, and use the obtained predicate expression as input to obtain the sensitive behavior of concern—— The UI component triggers some sensitive API to operate on the sensitive data, which can be expressed as (UIFun, SenAPI, SenData). The derivation rules are expressed as follows.
[0074] Derivation rule R1, call between functions (direct call relationship) invoke(F*, F′, L): -directInvoke(F*, F′, L): F* directly calls F′ at line L of the program, It can be deduced that F* calls F' at line L of the program.
[0075] Derivation rule R2, call between functions (function call relationship using icc mechanism) invoke(F*, F′, L): -isIntent(L1, X)&intentInitial(L2, X, Y)&iccInvoke(F*, L, X)&inComponent(F′, Y): This rule is based on the Intent transfer mechanism in Android applications. F* uses an Intent type parameter X at line L of the program to communicate within the component. The type of X is at line L1 of the program. Defined as Intent at line L2 of the program, X is initialized at line L2 of the program, the first real parameter of the initialization is Y, and the component Y contains the function F', so it can be deduced that F* calls the program F' at line L of the program.
[0076] Derivation rule R3, transitivity of function calls (call relationship and indirect call relationship), invoke(F*, F′, L): -invoke(F*, F, L) & indirectInvoke(F, F′): by F* F is called at line L of the program and F does not call F' directly, so it is deduced that F* calls F' at line L of the program.
[0077] Derivation rule R4, transitivity of function calls (call relationship and call relationship), invoke(F*, F, L): -invoke(F*, F, L) & invoke(F, F′, L′): by F *calls F at line L of the program and F calls F' at line L' of the program, deduces F* calls F' at line L of the program.
[0078] Derivation rule R5, whether there is sensitive behavior, hasSenAction(F*, SenApi, SenData, L): -hasSenAPI(F*, SenAPI, L)&hasSenData(L, SenData): there is SenAPI at line L of the program by F* and in There is SenData at line L of the program, and it is inferred that F* uses SenAPI to perform related operations on SenData at line L of the program.
[0079] Derivation rule R6, transfer of sensitive behavior between functions, hasSenAction(F*, SenAPI, SenData, L): -hasSenAction(F′, SenAPI, SenData, L′)&invoke(F*, F′, L): by in F' at line L' of the program uses SenAPI to perform related operations on SenData and F* calls F' at line L of the program, and it is deduced that F* uses SenAPI to perform related operations on SenData at line L of the program.
[0080] Step 4: Sensitive behaviors (UIFun, SenAPI, SenData) cannot be directly used as the characteristics of the application. What we care about is whether the intent expressed by the UI component itself is consistent with the sensitive API triggered by it. To this end, we need to extract the of the text. From the layout.xml file obtained in the first step, we can obtain the text UIText on the UI component, replace it with UIFun, and we can get sensitive behaviors (UIText, SenAPI, SenData).
[0081] Step 5: In order to control the scale and sparseness of the training data set, we also need to deal with sensitive behaviors (UIText, SenAPI, SenData).
[0082] First of all, we mainly focus on the text of the button and the pop-up dialog box. The number of these text characters is relatively small. For the text with a large number of characters in UIText, we will replace it with "Long".
[0083] Secondly, there are a large number of sensitive APIs, but most of them can be divided into several categories. We don't need to know the specific behavior of the sensitive API, we only need to know its function. According to the type of operation that SenAPI belongs to, replace SenAPI with the SenAct it belongs to. SenAct includes the following types: SendMessage (sending information), Call (calling), Internet (surfing the Internet), Install (installing applications), UseDevice (use such as camera, peripherals such as GPS). Except for the SenAct of surfing the Internet, the sensitive data types processed by other sensitive actions are determined, and we set the corresponding SenData as "DEFAULT"; for the sensitive action of surfing the Internet, according to the data type of the corresponding SenData, Replace it with Message (information), AddressBook (address book), UserAccount (user account information), SensitiveFile (private file), etc. After the above processing, we can get sensitive behaviors (UIText, SenAct, SenData).
[0084] Step 6: By analyzing applications with known security, we can know which sensitive behaviors (UIText, SenAct, SenData) they have, and whether the application has certain sensitive behaviors (UIText, SenAct, SenData) as one of the characteristics. If there is such a sensitive behavior, the feature value of this feature is "1", otherwise it is "0". From the Mainifest.xml obtained in the first step, we can obtain the application's allowed request list, and whether the application has some kind of allowed request is also one of the characteristics. If there is such permission request, the characteristic value of this characteristic is "1", otherwise it is "0". In addition, the security of the application is used as the label of the application, and a unique identification code is given to the application, thus forming a piece of data in the training data set. The data structure of the final training data set is as follows figure 2 shown.
[0085] Step 7: After performing feature extraction on applications with known security, we also need to perform feature extraction on applications with unknown security, but because the security of these applications is unknown, we can only obtain their sensitive behavior characteristics and allow requests feature. Then we began to analyze and predict Android applications with unknown security. The specific method is to use the support vector machine to train the training data set. After the training results are available, we can predict its security through other features of the application.
[0086] The feature extraction process of the present invention is described in detail below in conjunction with the APK installation file of the domestic famous Android market Peapod. The present invention is not only applicable to this example. The specific process of extracting features for this application is as follows:
[0087] Step 1: Change the apk suffix of the pea pod installation file obtained from the Internet to zip, and obtain the key files classes.dex, AndroidManifest.xml and the layout file located in the res folder after decompression. By processing it accordingly, we can get the application's Java code and a readable XML file.
[0088] Step 2: Use the tool WALA to analyze the Java code and get the function call diagram. For example, by analyzing the following code snippets, we can get image 3 The function call diagram shown.
[0089] code segment:
[0090]
[0091]
[0092]
[0093]
[0094] Step 3: By analyzing the function call graph, we can find out the relationship between sensitive APIs and function calls. like image 3 The function call relationship shown, the useful information in it is expressed in Datalog language as follows.
[0095] hasSenAPI(Fn(), openStream(), 20);
[0096] directInvoke(Fm(), Fm+1(), 16); directInvoke(Fm+1(), Fm+2(), X); ...;
[0097] directInvoke(Fm+n-m-1(), Fn(), Y);
[0098] directInvoke(onClick(), Fm(), 09);
[0099] Step 4: From these useful information, we need to find the sensitive behavior of the application, that is, what kind of UI control will trigger the sensitive API. The process of finding out the sensitive behavior from the above code snippet is expressed in Datalog language as follows.
[0100] According to the derivation rule R1: Invoke(F m (), F m + 1 (), 16); Invoke(F m + 1 (), F m + 2 (), X);...: -directInvoke(F m (), F m + l (), 16); directInvoke(F m + 1 (), F m + 2 (), X);...;
[0101] According to the derivation rule R1: Invoke(F m + n-m-1 (), F n (), Y): -directInvoke(F m+n - m-1 (), F n (), Y);
[0102] According to the derivation rule R1: Invoke(onClick(), F m (), 09): -directInvoke(onClick(), F m (), 09);
[0103] According to the derivation rule R4: Invoke(F m (), F n (), 16): -Invoke(F m (), F m + 1 (), 16) & Invoke(F m + 1 (), F m + 2 (),X)&...&Invoke(F m + n-m-1 (), F n (), Y);
[0104] According to the derivation rule R5: hasSenAction(F n (), openStream(), NULL, 20): -hasSenApi(Fn(), openStream(), 20);
[0105] According to the derivation rule R4: Invoke(onClick(), F n (), 09): -Invoke(onClick(), F m (),09)&Invoke(F m (), F n (), 16);
[0106] According to the derivation rule R6: hasSenAction(onClick(), openStream(), NULL, 09): -hasSenAction(Fn(), openStream(), NULL, 20)&Invoke(onClick(), Fn(), 09);
[0107] We can get that the app contains sensitive behavior (onClick(), openStream(), NULL). In this example, the sensitive API does not operate on sensitive data, so the sensitive data is set to NULL.
[0108] Step 5: For the sensitive behavior obtained in Step 4, we need to replace the UI controls with the text on the UI. In the code snippet of this example, we can find that the id of the button is 2131296347, and convert it to the hexadecimal number 7F09005B. The button id can be found in the layout file obtained in the first step, and the text on the button is "View detailed page". Thus we get the sensitive behavior ("view detailed page", openStream(), NULL), change the API in it to its type, and finally get that the application contains sensitive behavior ("view detailed page", Internet, NULL). The code snippets mentioned are relatively simple. If the text is long or sensitive data exists, it needs to be dealt with accordingly.
[0109]Step 6: From the AndroidManifest.xml file obtained in the first step, we can get the list of allowed requests contained in the application, and whether it contains some kind of request is one of the characteristics of the application. At the same time, whether the application contains certain sensitive behaviors is one of the characteristics of the application. The eigenvalues applied in this example are shown below.
[0110] App name: wdj
[0111] Whether to contain sensitive behavior ("view detailed page", Internet, NULL): 1
[0112] Whether to contain sensitive behavior ("OK", Internet, UserAccount): 1
[0113] Whether to include sensitive behavior ("install", Install, NULL): 1
[0114] Whether to contain sensitive behavior ("send", SendMessage, Message): 0
[0115] Does it contain sensitive behavior ("synchronized address book", Internet, AddressBook): l
[0116] Whether to include the permission request "android.permission.ACCESS_WIFI_STATE": 1
[0117] Whether to include the permission request "android.permission.READ_PHONE_STATE": l
[0118] Whether to include permission request "android.permission.ACCESS_NETWORK_STATE": 1
[0119] Whether to include the permission request "android.permission.INTERNET": 1
[0120] Whether to include permission request "android.permission.GET_ACCOUNTS": 1
[0121] Whether to include the permission request "android.permission.READ_CONTACTS": 1
[0122] Whether to include the permission request "android.permission.SENSOR_ENABLE": 0
[0123] Whether to include the permission request "android.permission.READ_SMS": 1
[0124] Whether to include the permission request "android.permission.RECEIVE_SMS": 1
[0125] Whether to include the permission request "android.permission.WAKE_LOCK": 1
[0126] After performing the above feature extraction on n applications with known security, they are integrated into a feature matrix to obtain the training data set we need.
[0127] Step 7: Use the support vector machine algorithm to judge the security of the Android application. The support vector machine algorithm first maps the vector to a high-dimensional space, and then finds the generalized optimal classification surface in this space. The formula is as follows, where ω T Used to map the vector x into a high-dimensional space, the vector x is the feature vector of the application, (ω) T x*+b) is the optimal classification surface found through the training data set (x* is the feature vector satisfying the hyperplane, obtained from the training result). f(x) is the predicted value of the security label of the application, if f(x) t It is determined by the comparison of multiple sets of experimental results. Generally, the value of f(x) is reduced to [0, 1], and the value of t is 0.5. )