Program recognition method and device based on machine learning

A program identification and machine learning technology, applied in the computer field, can solve the problems of low efficiency and lag in identifying malicious programs, and achieve the effect of saving manpower and improving identification efficiency

Active Publication Date: 2012-07-11
三六零数字安全科技集团有限公司
2 Cites 30 Cited by

AI-Extracted Technical Summary

Problems solved by technology

[0004] The embodiment of the present application provides a program identification method and device based on ma...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Method used

[0166] Through the description of the above embodiments, it can be seen that when the embodiment of the present application identifies the type of an unknown program based on the class behavior feature, the input unknown program is analyzed, and the class behavior feature in the unknown program is extracted. The class behavior feature includes the import table library Features and import table API features, roughly classify the unknown program according to the extracted class behavior characteristics, according to the result of the rough classification, input the unknown program into the generated training model and the corresponding decision machine for judg...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Abstract

The embodiment of the invention discloses a program recognition method and device based on machine learning. The method comprises analyzing an inputted unknown program and extracting class behavior features of the unknown program, the class behavior features including library feature and application programming interface API (Application Program Interface) feature of an import table; coarsely classifying the unknown program according to the extracted class behavior features; inputting the unknown program to a generated training model and a corresponding decision machine to judge the unknown program according to the coarse classification result; and outputting the recognition result which shows that the unknown program is a malicious program or a non-malicious program. Based on machine learning technology, the method provided by the invention can obtain a model for recognizing malicious programs based on class behaviors by extracting and analyzing class behavior features of a large amount of program samples, and the model can save a large amount of man power and can improve malicious program recognition efficiency.

Application Domain

Technology Topic

Image

  • Program recognition method and device based on machine learning
  • Program recognition method and device based on machine learning
  • Program recognition method and device based on machine learning

Examples

  • Experimental program(2)

Example

[0053] see figure 1 , the flow chart of the first embodiment of generating a model for identifying program types for this application:
[0054] Step 101: Input the extracted massive programs, where the massive programs include malicious programs and non-malicious programs.
[0055] Step 102: Extract class behavior features from each input program, and classify the extracted class behavior features.
[0056] Specifically, analyze each program file, extract predefined class behavior features from the program file, generate feature vectors according to the extracted class behavior features, and the black and white attributes of each feature vector, according to the known compiler entry instruction sequence Determines the type of compiler that compiles the corresponding program.
[0057] The class behavior features in the embodiments of the present application are described in detail below. The class behavior features can be divided into import table library features and import table API (Application Programming Interface, application programming interface) features as a whole. They are described as follows:
[0058] 1. Import table library features
[0059] The dynamic library imported by the import table usually has special functions, which can represent the functions that the program itself may achieve. For example, a program that imports the table library WS2_32.DLL generally indicates that networking operations are required. Therefore, by checking the import library names of the import table, several dynamic libraries used by common malicious programs can be preselected. Specifically, a HASH (hash) table can be established for these dynamic libraries, that is, after normalizing the selected dynamic library feature strings, a HASH value is calculated, and a HASH table is established according to the calculated HASH value. After extracting its import table, the program can look up the HASH table to determine the import table characteristics, so as to achieve the purpose of determining whether it is a malicious program.
[0060] For example, the import table library class feature can be further subdivided into the following feature types:
[0061] 1) Network class features (including RPC), examples are as follows:
[0062] DNSAPI.DLL
[0063] MSWSOCK.DLL
[0064] NDIS.SYS
[0065] NETAPI32.DLL
[0066] WININET.DLL
[0067] WSOCK32.DLL
[0068] WS2_32.DLL
[0069] MPR.DLL
[0070] RPCRT4.DLL
[0071] URLMON.DLL
[0072] 2) Advanced Win32 application program interface class features, examples are as follows:
[0073] ADVAPI32.DLL
[0074] 3) System kernel class features, examples are as follows:
[0075] KERNEL32.DLL
[0076] NTDLL.DLL
[0077] NTOSKRNL.EXE
[0078] 4) Windows user interface related application program interface class features, examples are as follows:
[0079] USER32.DLL
[0080] 5) Windows application common GUI graphical user interface module class features, examples are as follows:
[0081] COMCTL32.DLL
[0082] GDI32.DLL
[0083] GDIPLUS.DLL
[0084] 6) Windows hardware extraction layer module class features, examples are as follows:
[0085] HAL.DLL
[0086] 7) Microsoft MCF Library class features, examples are as follows:
[0087] MFC42.DLL
[0088] 8) Microsoft Visual Basic virtual machine related module class features, examples are as follows:
[0089] MSVBVM60.DLL
[0090] 9) Standard C runtime program class features, examples are as follows:
[0091] MSVCP60.DLL
[0092] MSVCR71.DLL
[0093] MSVCRT.DLL
[0094] 10) Object linking and embedding related module class features, examples are as follows:
[0095] OLE32.DLL
[0096] OLEAUT32.DLL
[0097] 11) Windows system process state supports module class features, examples are as follows:
[0098] PSAPI.DLL
[0099] 12) 32-bit shell dynamic link library file class characteristics of Windows, examples are as follows:
[0100] SHELL32.DLL
[0101] 13) UNC and URL address dynamic link library file class features, used to register key values ​​and color settings, examples are as follows:
[0102] SHLWAPI.DLL
[0103] 2. Import table API features
[0104] The import table API feature is the function feature selected from the import table library, and these functions can further describe the behavior function of the program. The specific normalization format is as follows:
[0105] DLLNAME! APINAME
[0106] DLLNAME unified into uppercase, such as ADVAPI32.DLL! AddAccessAllowedAce
[0107] For the advanced Win32 application program interface class feature ADVAPI32.DLL, its function features can be further selected. Examples are as follows:
[0108] ADVAPI32.DLL! AddAccessAllowedAce
[0109] ADVAPI32.DLL! AddAce
[0110] ADVAPI32.DLL! AdjustTokenPrivileges
[0111] ADVAPI32.DLL! AllocateAndInitializeSid
[0112] ADVAPI32.DLL! ChangeServiceConfig2A
[0113] ADVAPI32.DLL! ChangeServiceConfig2W
[0114] ADVAPI32.DLL! CheckTokenMembership
[0115] ADVAPI32.DLL! CloseServiceHandle
[0116] ADVAPI32.DLL! ControlService
[0117] ADVAPI32.DLL! ConvertSidToStringSidW
[0118] For another example, for the common GUI graphical user interface module class feature COMCTL32.DLL of a Windows application, an example of its function feature can be further selected as follows:
[0119] COMCTL32.DLL! 13
[0120] COMCTL32.DLL! 14
[0121] COMCTL32.DLL! 17
[0122] COMCTL32.DLL! CreatePropertySheetPageA
[0123] COMCTL32.DLL! DestroyPropertySheetPage
[0124] COMCTL32.DLL! FlatSB_GetScrollInfo
[0125] COMCTL32.DLL! FlatSB_SetScrollInfo
[0126] COMCTL32.DLL! FlatSB_SetScrollPos
[0127] COMCTL32.DLL! ImageList_Add
[0128] COMCTL32.DLL! ImageList_AddMasked
[0129] The above is only an exemplary description, and the function features corresponding to each specific import table library feature will not be repeated one by one.
[0130] For the above-mentioned function features, a HASH (hash) table can also be established for it. After normalizing the selected function feature strings, a HASH value is calculated, and a HASH table is established according to the calculated HASH value. After extracting the API function features of the import table, the HASH table can be searched for the purpose of determining whether it is a malicious program.
[0131] Step 103: According to the classification result, different categories of features are trained using different decision machines to generate a training model or training model set for identifying malicious programs.
[0132]Among them, different decision machines use the same or different ways to train features, including: using support vector machines for training, or using decision trees for training; training models can be coded training models, or Compressed training model.
[0133] see figure 2 , a schematic diagram of an application example of generating a model for identifying program types for this embodiment of the present application:
[0134] Among them, a number of PE files are the input massive executable program files, including malicious programs and non-malicious programs, which include k decision machines and k training models corresponding to the k decision machines according to the classification of class behavior characteristics. After analyzing the executable program file, extract the corresponding class behavior features, put the extracted class behavior features into a corresponding feature vector, and classify the features according to the features that have been extracted. Take the description of the import table library feature as an example, it is divided into network class features, advanced WIN32 application program interface class features, system kernel class features, operating system user interface related application program interface class features, and operating system application common image user interface module class Features, operating system hardware extraction layer module class features, virtual machine related module class features, standard C runtime library program class features, object linking and embedding related module class features, operating system process state support module class features, operating system 32-bit shell dynamics Link library file class features, address dynamic link library file class features; according to the classification results, the feature vectors and black and white attributes of different categories of program files are trained by different decision machines to obtain the corresponding training model.
[0135] For example, different feature classifications contain different numbers of specific features. Taking the feature classification being network class features as an example, the network class features may specifically include: DNSAPI.DLL, MSWSOCK.DLL, NDIS.SYS, NETAPI32.DLL , WININET.DLL, WSOCK32.DLL, WS2_32.DLL, MPR.DLL, RPCRT4.DLL, URLMON.DLL, etc. In this embodiment of the present application, a classification identifier may be assigned to each feature classification. For example, the classification identifier of a network class feature is "1", and for each specific network class feature, a feature identifier may be further assigned, for example, dynamic The characteristic identifier of the library DNSAPI.DLL is "1", the characteristic identifier of the dynamic library MSWSOCK.DLL is "2", and the characteristic identifier of the dynamic library NETAPI32.DLL is "3". When generating a feature vector according to the extracted features, each feature array in the feature vector is characterized by its classification ID and feature ID. For example, the extracted feature is the "dynamic library DNSAPI.DLL" in the common section table features ", the corresponding classification identifier is "1", and the feature identifier is "1", so the information corresponding to the "section table feature of the code section" in the feature vector is represented as "1:1"; similarly, it belongs to other features The specific features of the classification are also expressed in the above form, as shown below, an example of a feature vector with 4 features extracted from a program: 1:0 2:121 100:12345678 5000:365.
[0136] The black and white attribute of the feature vector is used to indicate whether the program containing the features in the feature vector is a malicious program or a non-malicious program, where the attribute is "white", it corresponds to a non-malicious program, and the attribute is "black", it corresponds to a malicious program; further , which can be marked as "0" for the white attribute definition and "1" for the black attribute definition. Then, after generating a feature vector for each program, you can assign an attribute identifier to it according to the information contained in the feature vector. For example, assign the attribute identifier to the above feature vector "1:0 2:121 100:12345678 5000:365" as the white attribute. "0", the corresponding information can be expressed as "0 1:0 2:121100:12345678 5000:365". The above representation method can also be directly represented by an array, and the value at the nth position of the array is the value of the nth feature.
[0137] see image 3 , which is a flowchart of an embodiment of the machine learning-based program identification method of the application:
[0138] Step 301: Analyze the input unknown program, and extract the class behavior feature in the unknown program. The class behavior feature includes the import table library feature and the import table API feature.
[0139] as mentioned above figure 1 It can be known from the description of the illustrated embodiment that the import table library features include: network class features, advanced WIN32 application program interface class features, system kernel class features, operating system user interface related application program interface class features, and operating system application program common image user interface Module class features, operating system hardware extraction layer module class features, virtual machine related module class features, standard C runtime library program class features, object linking and embedding related module class features, operating system process state support module class features, operating system 32-bit The shell dynamic link library file class feature and the address dynamic link library file class feature; and the import table API feature is the function feature selected from the import table library.
[0140] Step 302: Roughly classify the unknown program according to the extracted class behavior features.
[0141] Step 303: According to the result of the rough classification, input the unknown program into the generated training model and the corresponding decision machine for judgment.
[0142] Specifically, according to the result of the rough classification, the unknown program can be input into a plurality of generated training models and corresponding decision-making machines respectively for judgment, and according to the preset weight of each feature classification in each training model, each Each training model and the corresponding decision machine perform weighted calculation on the judgment result of the unknown program.
[0143] Step 304: Output the identification result of the unknown program, where the identification result is a malicious program or a non-malicious program.
[0144] Specifically, the identification result of the location program is output according to the result of the weighted calculation, and the identification result is a malicious program or a non-malicious program.
[0145] see Figure 4 , which is a schematic diagram of an application example of identifying program types in this embodiment of the present application:
[0146] Among them, the PE file is the input unknown program file, which includes k decision machines and k training models corresponding to the k decision machines according to different feature classifications. After analyzing the PE file, extract the corresponding class behavior features, put the extracted class behavior features into a corresponding feature vector, and classify the features according to the class behavior features that have been extracted. For example, according to the import table library class features It can be divided into network features, advanced WIN32 application program interface features, system kernel features, operating system user interface related application program interface features, operating system application shared image user interface module features, operating system hardware extraction layer module features Features, virtual machine related module class features, standard C runtime library program class features, object linking and embedding related module class features, operating system process state support module class features, operating system 32-bit shell dynamic link library file class features, address dynamic linking According to the classification results, different decision-making machines and training models are used to make corresponding judgments. According to the judgment results obtained by the corresponding decision-making machines and models, the scoring results are weighted according to the weights of the classifications to obtain the scoring results. Whether the file is a malicious program or a normal program.
[0147] For the input unknown program, when using different decision machines and training models to make corresponding judgments according to the classification results, the initial black and white attribute value of all class behavior features can be set to 0, and the class behavior features are extracted from the location program. , normalize these behavioral features, and look them up in the HASH table established above. If the corresponding feature is found, set its black and white attribute value from 0 to 1, otherwise it will not be processed.
[0148] When the scoring results are obtained by weighting according to the weight of the classification, it is assumed that there are k decision-making machines in total, and there are m types of classifications, namely classification 1, 2, ..., m, and the preset weight of the i-th classification is (w i1 , w i2 ,...,w ik ), then the result of the decision machine discrimination of the corresponding sample category i is (r i1 , r i2 ,...,r ik ), the resulting comprehensive result is (w i1 , w i2 ,...,w ik )*(r i1 , r i2 ,...,r ik ). A result judgment threshold may be preset, and when the judgment result is less than the threshold, the unknown program is determined to be a non-malicious program, and when the judgment result is greater than the threshold, the unknown program is determined to be a malicious program.
[0149] Corresponding to the embodiments of the machine learning-based program identification method of the present application, the present application also provides an embodiment of the machine learning-based program identification apparatus.

Example

[0150] see Figure 5 , which is a block diagram of the first embodiment of the program identification device based on machine learning:
[0151] The apparatus includes: an extraction unit 510 , a classification unit 520 , a judgment unit 530 and an output unit 540 .
[0152] Wherein, the extraction unit 510 is used to analyze the input unknown program, and extract the class behavior feature in the unknown program, and the class behavior feature includes the import table library feature and the import table application programming interface API feature;
[0153] A classification unit 520, configured to roughly classify the unknown program according to the extracted class behavior feature;
[0154] Judging unit 530, for inputting the unknown program into the generated training model and the corresponding decision-making machine for judgment according to the result of the rough classification;
[0155] The output unit 540 is configured to output an identification result of the unknown program, where the identification result is a malicious program or a non-malicious program.
[0156] The extraction unit 510 is specifically configured to extract the import table library feature and the import table API feature in the unknown program, where the import table library feature includes: network class feature, advanced WIN32 application program interface class feature, system kernel Class features, operating system user interface related application program interface class features, operating system application common image user interface module class features, operating system hardware extraction layer module class features, virtual machine related module class features, standard C runtime library program class features, Object linking and embedding related module class feature, operating system process state support module class feature, operating system 32-bit shell dynamic link library file class feature, address dynamic link library file class feature; the import table API feature is from the import table API feature Feature of the function selected from the library.
[0157] Specifically, the judging unit 530 may include ( Figure 5 (not shown in): a program input unit for inputting unknown programs into a plurality of generated training models and corresponding decision-making machines for judgment when multiple training models are included; a weighted calculation unit for determining according to preset The weight of each type of behavioral feature classification in each training model, and the result of judging the unknown program by each training model and the corresponding decision machine is weighted and calculated; the output unit 540 is specifically used for according to the The result of the weighting calculation outputs the identification result of the position program.
[0158] see Image 6 , which is a block diagram of the second embodiment of the program identification device based on machine learning of the present application, and Figure 5 In contrast, the program identification device further has the function of generating a model for identifying program types:
[0159] The apparatus includes: an input unit 610 , an extraction unit 620 , a classification unit 630 , and a generation unit 640 .
[0160]Wherein, the input unit 610 is used for inputting the extracted massive programs, and the massive programs include malicious programs and non-malicious programs;
[0161] an extraction unit 620, for extracting class behavior features from each program input;
[0162] A classification unit 630, configured to classify the extracted class behavior features;
[0163] The generating unit 640 is configured to use different decision-making machines to train different classes of behavioral features according to the result of the classification to generate a training model or a training model set for identifying malicious programs.
[0164] Specifically, the extraction unit 620 may include ( Image 6 not shown in): a class behavior feature extraction unit, used to analyze each program file, and extract predefined class behavior features from the program file; a vector attribute generation unit, used to generate features according to the extracted class behavior features vector, and the black and white properties of each feature vector.
[0165] Specifically, the classification unit 630 is configured to determine the type of the compiler for compiling and generating the corresponding program according to the entry instruction sequence of the known compiler.
[0166] As can be seen from the description of the above embodiments, when the embodiment of the present application identifies the type of an unknown program based on the class behavior feature, the input unknown program is analyzed, and the class behavior feature in the unknown program is extracted, and the class behavior feature includes the import table library feature and the import feature. Table API features, roughly classify the unknown program according to the extracted class behavior features, input the unknown program into the generated training model and the corresponding decision-making machine for judgment according to the result of the rough classification, and output the recognition result of the unknown program . This application uses machine learning technology to extract and analyze behavior-like features of a large number of program samples to obtain a model for identifying malicious programs based on behavior-like features. The use of this model can save a lot of manpower and improve the efficiency of identifying malicious programs. Moreover, based on the data mining of massive programs, the internal rules of programs can be found based on class behaviors, and malicious programs that have not occurred can be prevented, making it difficult to avoid malicious programs from being killed.
[0167] Those skilled in the art can clearly understand that the technology in the embodiments of the present application can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solutions in the embodiments of the present application can be embodied in the form of software products in essence or in the parts that make contributions to the prior art, and the computer software products can be stored in a storage medium, such as ROM/RAM , magnetic disk, optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in various embodiments or some parts of the embodiments of the present application.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Similar technology patents

All-intelligent chef robot

Owner:BEIJING ROC THEURGY TECH

Classification and recommendation of technical efficacy words

  • Save human effort
  • Improve recognition efficiency

People also interested in

Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products