Webshell detection method based on image analysis, terminal equipment and storage medium

A technology of image analysis and detection methods, applied in image analysis, image data processing, instruments, etc., which can solve the problem of detection of false negative rate and false positive rate that cannot cross domains, bottlenecks, well-defined behavioral characteristics and complete coverage of risk models and other issues to achieve the effect of improving detection performance, avoiding manual maintenance, and avoiding linear growth

Pending Publication Date: 2021-06-25
XIAMEN FUYUN INFORMATION TECH CO LTD
2 Cites 1 Cited by

AI-Extracted Technical Summary

Problems solved by technology

The essence of regular expressions is a finite state automaton, which cannot well define behavioral characteristics and complete coverage risk models, so there is a bottleneck that cannot cross domains in detectin...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Abstract

The invention relates to a Webshell detection method based on image analysis, terminal equipment and a storage medium. The method comprises the following steps: S1, collecting a plurality of normal and abnormal Webshell samples; S2, according to the collected Webshell samples and the Opcode sequence and the Opcode frequency of the Webshell to be detected, generating a corresponding RGB image; S3, calculating the distance between the RGB image of the to-be-tested Webshell and the RGB image of each Webshell sample, sorting the Webshell samples according to the sequence of the distances from small to large, selecting the first N Webshell samples from the sorting result, judging the types of the first N Webshell samples, and taking the type with the maximum corresponding number as the type of the to-be-tested Webshell. According to the method, the malicious behavior of the Webshell is represented by the Opcode feature, the two-dimensional gray level image is generated by using the Opcode sequence, and then the RGB image is synthesized by combining the gray level image generated by the Opcode frequency, so that the malicious behavior of the Webshell can be represented more completely, and the detection performance is improved.

Application Domain

Technology Topic

Image

  • Webshell detection method based on image analysis, terminal equipment and storage medium
  • Webshell detection method based on image analysis, terminal equipment and storage medium
  • Webshell detection method based on image analysis, terminal equipment and storage medium

Examples

  • Experimental program(2)

Example Embodiment

[0030] Example 1:
[0031] The embodiment of the present invention provides a webShell detection method based on image analysis, such as figure 1 As shown, the method includes the following steps:
[0032] S1: Collect two different types of different types of different types of normal and abnormalities.
[0033] S2: Generates a corresponding RGB image based on the acquired WebShell sample and the OPCode sequence and the OPCode frequency of WebShell to be tested.
[0034] This embodiment will be described as an example in the form of a PHP code. When the Zend virtual machine performs a PHP code, the following four steps are generally experienced:
[0035] 1) Scanning (lexing), the PHP code is converted to the language clip;
[0036] 2) PARSING, the language fragment is converted into a simple meaningful expression;
[0037] 3) Compilation, which is compiled into opcodes;
[0038] 4) Execution, that is, the Zend engine executes opcodes sequentially.
[0039] PHP is above the Zend Virtual Machine, and its OPCode is a category of bytecode. PHP's OPCode refers to an instruction that the Zend virtual machine is recognized. OpCode is a digital identifier that can be performed by the Zend Virtual Machine. After the PHP scans the people readable code and embeds it into the language segment, these language segments are combined in the parsing phase, these small expressions Compiled or converted to an operator. The opcode is an command executed by the Zend Virtual Machine as a unit, eventually executes and runs one by one in an array. According to the above analysis, OPCODE can be used as the bottom unit of PHP, and the execution of the PHP code can be seen as being translated into an OPCODE function, and executed in the order of order. PHP can be understood as composed of two parameters (OP1, OP2), return values, and processing functions.
[0040] PHPWebshell is actually just a PHP code, which ultimately performs some specific operations, such as executing commands, listing, and viewing files, etc., by encrypted confusion. If the traditional static detection method is used, the source code written directly, then there will be many factors affect the test results. However, these confusion methods can be bypassing the WebShell at the OPCode level.
[0041]For the extraction of OPCODE, VLD (Vulcan Logic Dumper) is used, which is a PHP extension, and the intermediate code generated by the PHP script is output in the Zend Engine mode (execution unit). The most common sentence of Trojan is: , Its OPCode output figure 2 Indicated.
[0042] OpCode reflects the least operation of the code, and it can speculate that there is a certain relationship between its sequences. If there is a connection between the Nth OPCode and the Nth + 1 OPCode, the entire code can be sequenceful, and the overall feature is referred to as global feature. In this embodiment, the OPCODE sequence is observed in a simple PHP code, and the PHP code file processing extracts the process of the OPCODE sequence. image 3 Indicated. From image 3 It can be learned that OPCODE of the PHP file is obtained after processing, and its OPCODE sequence can be represented as: Fetch_r, Fetch_DIM_R, ECHO, Echo, and Return.
[0043] The extracted OPCODE sequence is mapped to the two-dimensional matrix, where the rows of the two-dimensional matrix represent each OPCode included in the WebShell file, the value of each element in the matrix represents the rows of the element and the two OPCODEs corresponding to the column. The total number of continuous appearances in the OPCode sequence. The generated two-dimensional matrix is ​​shown in Table 1.
[0044] Table 1
[0045]
[0046]
[0047] Since the OPCODE sequence used in this embodiment is extremely short, the maximum element in the generated vector matrix is ​​1, when the WebShell sample file is relatively large, where the OPCode sequence relationship will change, the value of the elements in the corresponding matrix It will also become bigger. However, when the sample file is large enough, the value of some elements may exceed 255, that is, the range that can be represented by the grayscale image, in which case the grayscale image pixel range is required. The value is normalized and mapped to 255.
[0048] Due to the maximum value in the matrix, the range of value in most matrix elements is very small, and therefore, each element of the matrix is ​​characterized by each element of the matrix with its own value.
[0049]
[0050] Among them, VAL enhance (OS i | x j ) Indicates the value of elements, α represents an enhanced coefficient, VAL (OS i | x j ) Indicates the value before the generated enhancement, and the max represents the maximum value.
[0051] After the value of each element in the matrix is ​​enhanced, the two-dimensional matrix is ​​converted into a sequence grayscale map, and the post-processed matrix element position is the pixel point position of the sequence gradation graph.
[0052] In order to further characterize more features of WebShell, the gradation image is further converted into an RGB image, and the WebShell is characterized by the RGB image.
[0053] RGB image (color image) can be seen to do a three-dimensional matrix, such as 400 * 400 * 3 represents a two-dimensional matrix of 3 400 row 400 columns, which is called component, respectively, R, G, B component, and two The dimension matrix can see the grayscale value of the corresponding component. Each pixel point in the RGB image is composed of a gradation value of the corresponding pixel point in R, G, B, such as (R, G, B), which refers to the gradation value referred to herein is under respective components. Monochrome spectrometer. The grayscale image is called a single channel map, and the RGB image can be referred to as a three-way diagram. By combining three two-dimensional grayscale images characterized by the webshell characteristics into the RGB image, more features can be carried.
[0054] In this embodiment, it is not used to directly convert the gradation image to the RGB image, but a more meaningful information is used to populate the red, blue, and green channels in the RGB image. For blue channels, the sequence graph of OPCODE is filled, and the green channel and the red channel are filled with the frequency grayscale map of OPCODE, respectively. Specifically, the webshell file is divided into two parts, each calculates the frequency of each OPCode included in each portion, filling the red channel in the RGB image through the frequency grayscale map of the first portion, by the frequency of the second portion. The grayscale map fills the green channel in the RGB image.
[0055] The frequency grayscale map of OPCODE means that each OPCode appears in the code file, an OPCode corresponding to a pixel point, the gradation value of the pixel point is the number of times the corresponding opcode in the PHP code.
[0056] Similar to the grayscale map generated after the OPCode sequence map, the grayscale image is used to make the grayscale image in the corresponding frequency, and there is a case where the larger pixel gray value is extremely low, and there is also a corresponding corresponding to it. Normalization treatment and feature enhancement processing.
[0057] Due to the red channel, the green channel, and blue channels in the red channel, the green channel, and the blue channel may be inconsistent, so it also needs to normalize the red channel, the green channel, and the blue channel, so that the three have the same After the size, the red channel, the green channel, and the blue channel are combined with the RGB image. WebShell's RGB image synthesis process Figure 4 Indicated.
[0058] Since the RGB image is compared to the training process becomes very slow, it needs to be reduced to the main component analysis (PCA) algorithm before training. The primary component analysis algorithm is a non-arranging data reduction method that maps N-dimensional versatile to KD, where n> k, the mapping K-Dimension is called the main component of the image. The main idea is to establish high dimensional space, find the largest direction in which variance is large, and map data into a lower dimension of subspace. The process is mainly:
[0059] 1) Standardization of raw data;
[0060] 2) Construct the sample copracization difference matrix;
[0061] 3) Calculate the characteristic value and feature vector of the covariance matrix;
[0062] 4) Select the feature vector corresponding to the top k maximum feature value, where K is the dimension of the new feature space;
[0063] 5) Build a mapping matrix W by the first k feature vector.
[0064] 6) Convert Dimensional input data set X to the new Kwit syndrome by mapping matrix W.
[0065] In this embodiment, the value of K is set to 50, indicating that the RGB picture is reduced to 50 pixels after processing the main component analysis algorithm.
[0066] S3: Calculate the distance between the RGB image of WebShell and the RGB images of each WebShell sample, and sort each WebShell sample in order from a small to large, select the previous N WebShell sample and judgment type from the sort result. The type of WebShell is mostly tested in the type of the most amount.
[0067] In this embodiment, the RGB image is first converted to a vector form in the calculation distance, and the normalized feature vector is normalized.
[0068] The normalization method used in this embodiment is a linear transformation method, and the formula is expressed as:
[0069]
[0070] When the distance between the RGB images is calculated, since the RGB image is input, it is actually a distance calculation of the pixels of 50 * 50 * 3. Image distance metric mode This embodiment uses a pixel calculated method, ie calculating two images corresponding to the corresponding vector I. 1 And i 2 L1 distance The sum of the sums of operations here is the sum of all the pixels in the image. WebShell image distance calculation process Figure 5 Indicated.
[0071] In the embodiment of the present invention, an OPCODE sequence is used as a malicious feature representation method, which is mapped to a two-dimensional matrix to a two-dimensional gray image, and then binds to the grayscale image represented by the OPCODE frequency to generate the RGB three-dimensional feature image to characterize the characteristics of WebShell. And use the unique data processing method, reduce the complexity of the data, and then use the classification detection algorithm based on image distance, the purpose of the classification can be achieved compared to the conventional machine learning algorithm.

Example Embodiment

[0072] Example 2:
[0073] The present invention also provides an image analysis-based WebShell detecting terminal device, including a memory, a processor, and a computer program stored in the memory and can run on the processor, the processor performs the computer program. The steps in the above method embodiment of the embodiment of the present invention are implemented.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Similar technology patents

Classification and recommendation of technical efficacy words

Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products