[0053] see figure 1 , the flow chart of the first embodiment of generating a model for identifying program types for this application:
[0054] Step 101: Input the extracted massive programs, where the massive programs include malicious programs and non-malicious programs.
[0055] Step 102: Extract class behavior features from each input program, and classify the extracted class behavior features.
[0056] Specifically, analyze each program file, extract predefined class behavior features from the program file, generate feature vectors according to the extracted class behavior features, and the black and white attributes of each feature vector, according to the known compiler entry instruction sequence Determines the type of compiler that compiles the corresponding program.
[0057] The class behavior features in the embodiments of the present application are described in detail below. The class behavior features can be divided into import table library features and import table API (Application Programming Interface, application programming interface) features as a whole. They are described as follows:
[0058] 1. Import table library features
[0059] The dynamic library imported by the import table usually has special functions, which can represent the functions that the program itself may achieve. For example, a program that imports the table library WS2_32.DLL generally indicates that networking operations are required. Therefore, by checking the import library names of the import table, several dynamic libraries used by common malicious programs can be preselected. Specifically, a HASH (hash) table can be established for these dynamic libraries, that is, after normalizing the selected dynamic library feature strings, a HASH value is calculated, and a HASH table is established according to the calculated HASH value. After extracting its import table, the program can look up the HASH table to determine the import table characteristics, so as to achieve the purpose of determining whether it is a malicious program.
[0060] For example, the import table library class feature can be further subdivided into the following feature types:
[0061] 1) Network class features (including RPC), examples are as follows:
[0062] DNSAPI.DLL
[0063] MSWSOCK.DLL
[0064] NDIS.SYS
[0065] NETAPI32.DLL
[0066] WININET.DLL
[0067] WSOCK32.DLL
[0068] WS2_32.DLL
[0069] MPR.DLL
[0070] RPCRT4.DLL
[0071] URLMON.DLL
[0072] 2) Advanced Win32 application program interface class features, examples are as follows:
[0073] ADVAPI32.DLL
[0074] 3) System kernel class features, examples are as follows:
[0075] KERNEL32.DLL
[0076] NTDLL.DLL
[0077] NTOSKRNL.EXE
[0078] 4) Windows user interface related application program interface class features, examples are as follows:
[0079] USER32.DLL
[0080] 5) Windows application common GUI graphical user interface module class features, examples are as follows:
[0081] COMCTL32.DLL
[0082] GDI32.DLL
[0083] GDIPLUS.DLL
[0084] 6) Windows hardware extraction layer module class features, examples are as follows:
[0085] HAL.DLL
[0086] 7) Microsoft MCF Library class features, examples are as follows:
[0087] MFC42.DLL
[0088] 8) Microsoft Visual Basic virtual machine related module class features, examples are as follows:
[0089] MSVBVM60.DLL
[0090] 9) Standard C runtime program class features, examples are as follows:
[0091] MSVCP60.DLL
[0092] MSVCR71.DLL
[0093] MSVCRT.DLL
[0094] 10) Object linking and embedding related module class features, examples are as follows:
[0095] OLE32.DLL
[0096] OLEAUT32.DLL
[0097] 11) Windows system process state supports module class features, examples are as follows:
[0098] PSAPI.DLL
[0099] 12) 32-bit shell dynamic link library file class characteristics of Windows, examples are as follows:
[0100] SHELL32.DLL
[0101] 13) UNC and URL address dynamic link library file class features, used to register key values and color settings, examples are as follows:
[0102] SHLWAPI.DLL
[0103] 2. Import table API features
[0104] The import table API feature is the function feature selected from the import table library, and these functions can further describe the behavior function of the program. The specific normalization format is as follows:
[0105] DLLNAME! APINAME
[0106] DLLNAME unified into uppercase, such as ADVAPI32.DLL! AddAccessAllowedAce
[0107] For the advanced Win32 application program interface class feature ADVAPI32.DLL, its function features can be further selected. Examples are as follows:
[0108] ADVAPI32.DLL! AddAccessAllowedAce
[0109] ADVAPI32.DLL! AddAce
[0110] ADVAPI32.DLL! AdjustTokenPrivileges
[0111] ADVAPI32.DLL! AllocateAndInitializeSid
[0112] ADVAPI32.DLL! ChangeServiceConfig2A
[0113] ADVAPI32.DLL! ChangeServiceConfig2W
[0114] ADVAPI32.DLL! CheckTokenMembership
[0115] ADVAPI32.DLL! CloseServiceHandle
[0116] ADVAPI32.DLL! ControlService
[0117] ADVAPI32.DLL! ConvertSidToStringSidW
[0118] For another example, for the common GUI graphical user interface module class feature COMCTL32.DLL of a Windows application, an example of its function feature can be further selected as follows:
[0119] COMCTL32.DLL! 13
[0120] COMCTL32.DLL! 14
[0121] COMCTL32.DLL! 17
[0122] COMCTL32.DLL! CreatePropertySheetPageA
[0123] COMCTL32.DLL! DestroyPropertySheetPage
[0124] COMCTL32.DLL! FlatSB_GetScrollInfo
[0125] COMCTL32.DLL! FlatSB_SetScrollInfo
[0126] COMCTL32.DLL! FlatSB_SetScrollPos
[0127] COMCTL32.DLL! ImageList_Add
[0128] COMCTL32.DLL! ImageList_AddMasked
[0129] The above is only an exemplary description, and the function features corresponding to each specific import table library feature will not be repeated one by one.
[0130] For the above-mentioned function features, a HASH (hash) table can also be established for it. After normalizing the selected function feature strings, a HASH value is calculated, and a HASH table is established according to the calculated HASH value. After extracting the API function features of the import table, the HASH table can be searched for the purpose of determining whether it is a malicious program.
[0131] Step 103: According to the classification result, different categories of features are trained using different decision machines to generate a training model or training model set for identifying malicious programs.
[0132]Among them, different decision machines use the same or different ways to train features, including: using support vector machines for training, or using decision trees for training; training models can be coded training models, or Compressed training model.
[0133] see figure 2 , a schematic diagram of an application example of generating a model for identifying program types for this embodiment of the present application:
[0134] Among them, a number of PE files are the input massive executable program files, including malicious programs and non-malicious programs, which include k decision machines and k training models corresponding to the k decision machines according to the classification of class behavior characteristics. After analyzing the executable program file, extract the corresponding class behavior features, put the extracted class behavior features into a corresponding feature vector, and classify the features according to the features that have been extracted. Take the description of the import table library feature as an example, it is divided into network class features, advanced WIN32 application program interface class features, system kernel class features, operating system user interface related application program interface class features, and operating system application common image user interface module class Features, operating system hardware extraction layer module class features, virtual machine related module class features, standard C runtime library program class features, object linking and embedding related module class features, operating system process state support module class features, operating system 32-bit shell dynamics Link library file class features, address dynamic link library file class features; according to the classification results, the feature vectors and black and white attributes of different categories of program files are trained by different decision machines to obtain the corresponding training model.
[0135] For example, different feature classifications contain different numbers of specific features. Taking the feature classification being network class features as an example, the network class features may specifically include: DNSAPI.DLL, MSWSOCK.DLL, NDIS.SYS, NETAPI32.DLL , WININET.DLL, WSOCK32.DLL, WS2_32.DLL, MPR.DLL, RPCRT4.DLL, URLMON.DLL, etc. In this embodiment of the present application, a classification identifier may be assigned to each feature classification. For example, the classification identifier of a network class feature is "1", and for each specific network class feature, a feature identifier may be further assigned, for example, dynamic The characteristic identifier of the library DNSAPI.DLL is "1", the characteristic identifier of the dynamic library MSWSOCK.DLL is "2", and the characteristic identifier of the dynamic library NETAPI32.DLL is "3". When generating a feature vector according to the extracted features, each feature array in the feature vector is characterized by its classification ID and feature ID. For example, the extracted feature is the "dynamic library DNSAPI.DLL" in the common section table features ", the corresponding classification identifier is "1", and the feature identifier is "1", so the information corresponding to the "section table feature of the code section" in the feature vector is represented as "1:1"; similarly, it belongs to other features The specific features of the classification are also expressed in the above form, as shown below, an example of a feature vector with 4 features extracted from a program: 1:0 2:121 100:12345678 5000:365.
[0136] The black and white attribute of the feature vector is used to indicate whether the program containing the features in the feature vector is a malicious program or a non-malicious program, where the attribute is "white", it corresponds to a non-malicious program, and the attribute is "black", it corresponds to a malicious program; further , which can be marked as "0" for the white attribute definition and "1" for the black attribute definition. Then, after generating a feature vector for each program, you can assign an attribute identifier to it according to the information contained in the feature vector. For example, assign the attribute identifier to the above feature vector "1:0 2:121 100:12345678 5000:365" as the white attribute. "0", the corresponding information can be expressed as "0 1:0 2:121100:12345678 5000:365". The above representation method can also be directly represented by an array, and the value at the nth position of the array is the value of the nth feature.
[0137] see image 3 , which is a flowchart of an embodiment of the machine learning-based program identification method of the application:
[0138] Step 301: Analyze the input unknown program, and extract the class behavior feature in the unknown program. The class behavior feature includes the import table library feature and the import table API feature.
[0139] as mentioned above figure 1 It can be known from the description of the illustrated embodiment that the import table library features include: network class features, advanced WIN32 application program interface class features, system kernel class features, operating system user interface related application program interface class features, and operating system application program common image user interface Module class features, operating system hardware extraction layer module class features, virtual machine related module class features, standard C runtime library program class features, object linking and embedding related module class features, operating system process state support module class features, operating system 32-bit The shell dynamic link library file class feature and the address dynamic link library file class feature; and the import table API feature is the function feature selected from the import table library.
[0140] Step 302: Roughly classify the unknown program according to the extracted class behavior features.
[0141] Step 303: According to the result of the rough classification, input the unknown program into the generated training model and the corresponding decision machine for judgment.
[0142] Specifically, according to the result of the rough classification, the unknown program can be input into a plurality of generated training models and corresponding decision-making machines respectively for judgment, and according to the preset weight of each feature classification in each training model, each Each training model and the corresponding decision machine perform weighted calculation on the judgment result of the unknown program.
[0143] Step 304: Output the identification result of the unknown program, where the identification result is a malicious program or a non-malicious program.
[0144] Specifically, the identification result of the location program is output according to the result of the weighted calculation, and the identification result is a malicious program or a non-malicious program.
[0145] see Figure 4 , which is a schematic diagram of an application example of identifying program types in this embodiment of the present application:
[0146] Among them, the PE file is the input unknown program file, which includes k decision machines and k training models corresponding to the k decision machines according to different feature classifications. After analyzing the PE file, extract the corresponding class behavior features, put the extracted class behavior features into a corresponding feature vector, and classify the features according to the class behavior features that have been extracted. For example, according to the import table library class features It can be divided into network features, advanced WIN32 application program interface features, system kernel features, operating system user interface related application program interface features, operating system application shared image user interface module features, operating system hardware extraction layer module features Features, virtual machine related module class features, standard C runtime library program class features, object linking and embedding related module class features, operating system process state support module class features, operating system 32-bit shell dynamic link library file class features, address dynamic linking According to the classification results, different decision-making machines and training models are used to make corresponding judgments. According to the judgment results obtained by the corresponding decision-making machines and models, the scoring results are weighted according to the weights of the classifications to obtain the scoring results. Whether the file is a malicious program or a normal program.
[0147] For the input unknown program, when using different decision machines and training models to make corresponding judgments according to the classification results, the initial black and white attribute value of all class behavior features can be set to 0, and the class behavior features are extracted from the location program. , normalize these behavioral features, and look them up in the HASH table established above. If the corresponding feature is found, set its black and white attribute value from 0 to 1, otherwise it will not be processed.
[0148] When the scoring results are obtained by weighting according to the weight of the classification, it is assumed that there are k decision-making machines in total, and there are m types of classifications, namely classification 1, 2, ..., m, and the preset weight of the i-th classification is (w i1 , w i2 ,...,w ik ), then the result of the decision machine discrimination of the corresponding sample category i is (r i1 , r i2 ,...,r ik ), the resulting comprehensive result is (w i1 , w i2 ,...,w ik )*(r i1 , r i2 ,...,r ik ). A result judgment threshold may be preset, and when the judgment result is less than the threshold, the unknown program is determined to be a non-malicious program, and when the judgment result is greater than the threshold, the unknown program is determined to be a malicious program.
[0149] Corresponding to the embodiments of the machine learning-based program identification method of the present application, the present application also provides an embodiment of the machine learning-based program identification apparatus.