End-to-end deep learning malware classification method based on machine code byte streams

By employing an end-to-end deep learning method based on machine code byte streams and utilizing techniques such as one-dimensional convolution and self-attention mechanisms, this approach addresses the issues of unclear feature extraction and poor interpretability in existing malware detection technologies, achieving efficient and accurate malware classification.

CN115987580BActive Publication Date: 2026-06-19BEIJING UNIV OF POSTS & TELECOMM

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING UNIV OF POSTS & TELECOMM
Filing Date
2022-12-07
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing deep learning-based malware detection technologies suffer from problems such as unclear feature extraction rules, large model parameter count, high hardware resource requirements, poor interpretability, and strong reliance on expert knowledge, making it difficult to effectively detect modified malware.

Method used

An end-to-end deep learning approach based on machine code byte streams is adopted. Vertical and horizontal dimensionality reduction is performed through one-dimensional convolution and one-dimensional pooling. Combined with self-attention mechanism and residual linking module, the malware is classified using neighborhood and global information, and a multilayer perceptron is used for output.

🎯Benefits of technology

It improves the accuracy and generalization ability of malware classification, reduces the reliance on expert knowledge, enhances the model's anti-interference ability, and can correctly detect modified malware, ensuring the safe and stable operation of computer systems.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115987580B_ABST
    Figure CN115987580B_ABST
Patent Text Reader

Abstract

This invention discloses an end-to-end deep learning-based malware classification method based on machine code byte streams. Taking malware machine code byte streams as input, the method processes malware domain representation vector encoding, malware full-area information extraction, residual linking, and classification output modules to provide a classification result for the malware. This invention accurately represents the features of the original malware while minimizing the length of the feature vectors, thereby reducing computational load. It addresses the problem of strong reliance on expert knowledge in malware detection, improves the generalization ability of the malware classification model, reduces manpower consumption, and increases the speed of security emergency response. The invention provides technical details of the implementation, offers explainable classification judgments, and improves the accuracy and precision of malware classification. This ensures that even modified malware can be correctly detected and classified, guaranteeing the secure and stable operation of computer systems and meeting the increasingly severe cybersecurity situation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of network security technology, and in particular to an end-to-end deep learning-based malware classification method based on machine code byte streams. Background Technology

[0002] In an era dominated by cloud computing, big data, and paperless offices, sensitive data such as personal privacy data and company business data, along with computer systems like office automation systems, cloud document systems, and enterprise resource planning systems, have become indispensable parts of people's work and lives. However, with the development of computer technology, malicious attackers have developed a large number of new types of malware to steal data assets. The scale of cyberattacks is constantly expanding, the mutation rate of malware is gradually accelerating, and the difficulty of malware detection is significantly increasing. Malware poses a significant threat to cybersecurity and hinders normal work and production. Against this backdrop, the importance of cybersecurity is becoming increasingly prominent, and research on cybersecurity technology has become a major focus.

[0003] An Intrusion Detection System (IDS) is a network security technology used to detect and proactively defend against intrusions. A crucial element of an IDS is monitoring computer operation, detecting malware, and taking necessary security measures. Traditional malware intrusion detection methods primarily employ fingerprint-based approaches. This involves creating a fingerprint database using information such as malware hashes and abnormal behavior, and then matching each operation against this database to determine if the protected device has been attacked by malware. However, with the emergence of increasingly numerous and novel network attacks, this method suffers from drawbacks such as poor detection effectiveness, insufficient generalization ability, and a massive workload in maintaining the fingerprint database.

[0004] With the continuous development of artificial intelligence technology, machine learning and deep learning methods are being applied by security researchers to the field of malware detection. Existing machine learning-based malware detection methods can rely on feature vectors to be trained using optimization methods such as gradient descent, automatically fitting the training data and features, and exhibiting a certain generalization ability when new samples appear. Although machine learning algorithms such as decision trees, random forests, and Naive Bayes avoid the workload of maintaining fingerprint databases, the selection of feature vectors still heavily relies on expert knowledge. When the feature vectors are not scientifically sound, the model often performs poorly. Furthermore, because the model classifies based on pre-constructed feature vectors, false positives are frequent, resulting in the misclassification of legitimate software. Existing research indicates that the application of a single machine learning algorithm is no longer sufficient to improve the detection performance of detection systems in complex data environments.

[0005] In recent years, deep learning technology has made breakthrough progress in fields such as computer vision and natural language processing, demonstrating stronger feature extraction and representation capabilities than machine learning technology. Therefore, deep learning technology, such as convolutional neural networks and deep neural networks, is used as a means of feature selection or feature extraction in malware detection and classification systems in order to obtain features that better reflect the samples to be detected. The system performs pattern recognition through these features, which can not only effectively detect unknown attack types and variants of known attack types, but also reduce the dependence on expert knowledge in the feature extraction stage. Although deep learning can effectively extract high-dimensional vector representations of malware, it still has the following disadvantages: (1) The feature extraction rules are not clear, and it is easy for malware to bypass detection or cause malware detectors to misclassify by simply modified and confused variant samples; (2) The model has a large number of parameters, which requires certain hardware resources of the running device; (3) There is insufficient understanding of the operating principle of malware, the interpretability of deep learning models is poor, and it is difficult to provide security personnel with an explainable reason for classification.

[0006] To address the increasingly severe cybersecurity situation, a new deep learning-based malware classification method needs to be developed. This method should not only improve the detection accuracy of malware, enhance the generalization ability of the model, and reduce reliance on expert knowledge, but also improve the anti-interference ability of the detection model, so that malware can be correctly detected and classified even after it has been modified, thus ensuring the safe and stable operation of computer systems. Summary of the Invention

[0007] This invention addresses the shortcomings of existing deep learning-based malware detection technologies by proposing an end-to-end deep learning-based malware classification method based on machine code byte streams.

[0008] To achieve the above objectives, the present invention provides the following technical solution:

[0009] An end-to-end deep learning-based malware classification method based on machine code byte streams includes the following steps:

[0010] S1. Taking the machine code byte stream of the malware sample as input, it is processed by malware domain representation vector encoding to output malware neighborhood information; the malware domain representation vector encoding uses one-dimensional convolution to achieve vertical dimensionality reduction of the malware vector and one-dimensional pooling to achieve horizontal dimensionality reduction of the malware vector.

[0011] S2. The malware neighborhood information is processed by the malware global information extraction module to output the malware global information; the malware global information extraction module uses a self-attention mechanism to process the dimensionality-reduced malware representation vector.

[0012] S3, malware neighborhood information, and global information are input into the residual linking module, which outputs a malware representation vector.

[0013] S4. The malware representation vector is processed by the classification output module to output the malware classification result.

[0014] Furthermore, before inputting the byte stream in step S1, two flag bits of the machine code byte stream are modified to render the sample harmless. Then, the first 204,800 bytes are truncated as the classification basis, and the machine code byte stream with a length of 204,800 bytes is used as input.

[0015] Furthermore, the process of encoding the malware domain representation vector in step S1 is as follows:

[0016] (1) First, the Embedding encoding result e is passed through a parameter-independent convolutional layer to extract high-dimensional features, resulting in two (400*32) vectors c. m and c p The formula for convolution is shown below:

[0017]

[0018] Where x is the input vector, kernel is the convolution kernel, and i, j, p, q are the convolutional relative coordinates;

[0019] (2) For c m and c p Taking the Hadamard operation on two vectors yields a low-dimensional vector c of size 400*32. h The Hadamard operation is a vector element-wise multiplication, and its calculation formula is as follows:

[0020]

[0021] (3) Then, perform one-dimensional pooling to reduce the dimensionality, resulting in a (400*1) malware intermediate vector representation r. c The calculation formula is as follows:

[0022]

[0023] The formula for calculating the Sigmoid activation function of a convolutional layer is shown below:

[0024]

[0025] The formula for calculating the Hadamard ReLU activation function is shown below:

[0026] ReLU(x) = max(0, x) (5)

[0027] Where x is the input value.

[0028] Furthermore, the process of the malware full-area information extraction module in step S2 is as follows:

[0029] Malware representation vector r c The rearranged vectors, forming a (25*16) vector, are modeled as a sentence containing 25 words, each represented by a vector of length 16. A self-attention layer is then used to extract usable information over long distances within the malware, resulting in a (400*1) malware summary vector representation r. a The formula for calculating the self-attention mechanism is as follows:

[0030]

[0031] Where words represent the malware representation vector r c The transformed matrix consists of 25 words, each 16 words long. Q W K W V All are parameter matrices obtained by training the model. Q, K, and V are matrices composed of the query, key, and value vectors defined in the self-attention mechanism for the 25 words, respectively.

[0032]

[0033] An 8-head attention model is used, and the final output result is r. a The formula for the representation vector containing the full-text information of the malware is as follows:

[0034]

[0035] Where i represents the i-th attention head, W i Q , All are parameter matrices obtained by training the model. The i-th head of the attention mechanism calculates the Q of the attention head based on Q, K, and V. i K i V i Q i K i V i These are matrices composed of the query, key, and value vectors of the i-th attention head as defined in the self-attention mechanism. `concat` represents concatenating the computation results of each head. W 0 It is the parameter matrix obtained by training the model.

[0036] Furthermore, in step S3, the residual connection module will transfer the malware neighborhood information r c and global information on malwarea The summation yields the representation vector r of the malware. The residual connection calculation formula is as follows:

[0037] r = r c +r a (9).

[0038] Furthermore, in step S4, the classification output module inputs the malware representation vector into a multilayer perceptron and uses the Softmax function to calculate the probability that the test sample belongs to each malware category.

[0039] Furthermore, the structure of the multilayer perceptron includes two fully connected layers and one normalization layer. The calculation method of the softmax activation function of the fully connected layers is as follows:

[0040]

[0041] Where i = 1, ..., 8, and z is the output of the last fully connected layer.

[0042] Furthermore, after the malware representation vector in step S4 is processed by the classification output module, the probability vector output of the malware belonging to each malware category is obtained. The category with the highest predicted probability is taken as the predicted category of the malware, and the prediction result is finally given.

[0043] The proposed end-to-end deep learning-based malware classification method based on machine code byte streams takes malware machine code byte streams as input and processes them sequentially through malware domain representation vector encoding, malware full-area information extraction module, residual linking module, and classification output module to give the malware classification result.

[0044] On the other hand, the present invention also provides an end-to-end deep learning malware classification apparatus based on machine code byte streams, comprising the following modules for implementing the method described in any of the above:

[0045] Malware domain representation vector encoding takes the machine code byte stream of malware samples as input, uses one-dimensional convolution to achieve vertical dimensionality reduction of malware vectors, uses one-dimensional pooling to achieve horizontal dimensionality reduction of malware vectors, and finally outputs malware neighborhood information.

[0046] The malware global information extraction module takes malware neighborhood information as input, uses a self-attention mechanism to process the dimensionality-reduced malware representation vector, and outputs global malware information.

[0047] The residual linking module takes malware neighborhood information and global information as common inputs and outputs a malware representation vector.

[0048] The classification output module takes malware representation vectors as input and outputs malware classification results.

[0049] Thirdly, the present invention also provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus; the memory is used to store computer programs; and the processor, when executing the program stored in the memory, implements the steps of the method described in any of the above-mentioned embodiments.

[0050] Fourthly, the present invention also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the method described in any of the preceding claims.

[0051] Fifthly, the present invention also provides a computer program product containing instructions, characterized in that, when run on a computer, it causes the computer to perform the steps of the method described in any of the preceding claims.

[0052] Compared with the prior art, the beneficial effects of the present invention are as follows:

[0053] (1) Based on the program operation mechanism, the principle of temporal locality, the principle of spatial locality, and other technologies, this invention proposes a scheme to extract high-dimensional features of malware by using "one-dimensional convolution" technology to achieve vertical dimensionality reduction of malware vectors and "one-dimensional pooling" technology to achieve horizontal dimensionality reduction of malware vectors. This scheme accurately represents the original malware features while minimizing the length of the feature vectors, thereby reducing the amount of computation.

[0054] (2) A model structure for malware classification is implemented using basic structures such as one-dimensional convolution, Hadamard, one-dimensional pooling, attention mechanism, and multilayer perceptron. It uses both domain information of malware fragments and information of the full text of malware to solve the problem of strong dependence on expert knowledge in malware detection, improve the generalization ability of malware classification model, reduce manpower consumption, and improve the speed of security emergency response. The technical details of the implementation are given. At the same time, it solves the problem of insufficient understanding of the operating principle of malware in existing technologies and poor interpretability of deep learning models. It provides security personnel with an explainable reason for classification judgment and improves the accuracy of malware classification.

[0055] (3) This invention models the full-text information of malware as a summary extraction problem in the field of natural language processing. It uses a self-attention mechanism to process the dimensionality-reduced malware representation vector and extracts the full-text information of malware. This solves the problems of unclear feature extraction rules in existing technologies, which make malware easily bypassed or evaded by modified and obfuscated variant samples, or cause malware detectors to misclassify it. This enables malware to be correctly detected and classified even after modification, ensuring the safe and stable operation of computer systems and meeting the increasingly severe network security situation.

[0056] (4) By combining the malware neighborhood information and global information through the residual structure, the convolution module focuses on extracting malware neighborhood information, and the self-attention module focuses on extracting malware global information, which ultimately improves the model classification accuracy.

[0057] (5) Use random sampling technology to address the imbalance of samples among different types of malware and accelerate the convergence speed of malware classification models. Attached Figure Description

[0058] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in this invention. For those skilled in the art, other drawings can be obtained based on these drawings.

[0059] Figure 1 This is an overview diagram of the end-to-end malware classification method based on machine code byte streams provided in the embodiments of the present invention.

[0060] Figure 2 A detailed structural diagram of the end-to-end malware classification model based on machine code byte streams provided in this embodiment of the invention. Detailed Implementation

[0061] This invention provides an end-to-end deep learning-based malware classification method based on machine code byte streams, such as... Figure 1 As shown, it includes the following steps:

[0062] S1. Take the machine code byte stream of the malware sample as input, encode it through the malware domain representation vector, and output the malware neighborhood information;

[0063] S2. Malware neighborhood information is processed by the malware global information extraction module, which outputs global malware information.

[0064] S3, malware neighborhood information, and global information are input into the residual linking module, which outputs a malware representation vector.

[0065] S4. The malware representation vector is processed by the classification output module to output the malware classification result.

[0066] To better understand this technical solution, the method of the present invention will be described in detail below with reference to the accompanying drawings.

[0067] This method uses the BODMAS dataset of pre-labeled malware. The labels of this dataset are Virus, Worm, Trojan, Ransomware, InformationStealer, Downloader, Dropper, and Backdoor (as shown in Table 1).

[0068] Table 1 Malware Sample Dataset Information

[0069] Malware categories Malware sample size virus 192 worm 16697 trojan 29972 ransomware 821 Information stealer 448 downloader 1031 dropper 715 backdoor 7331

[0070] The end-to-end malware classification method based on machine code byte stream of the present invention takes malware machine code byte stream as input, and processes it through malware domain representation vector encoding, malware full-area information extraction module, residual linking module and classification output module, thereby giving the probability of the malware belonging to each of the eight categories shown in Table 1.

[0071] Specifically, the end-to-end malware classification method based on machine code byte streams follows the process shown in Algorithm 1. It directly reads the machine code byte stream from the disk, modifies two flag bits in the machine code byte stream to render the sample harmless (protecting our device from attack), and then extracts the first 204,800 bytes as the classification criterion. The 204,800-byte machine code byte stream is then input as follows: Figure 2 In the model shown, the probability vector output of the malware belonging to each malware category is obtained. The category with the highest predicted probability is taken as the predicted category of the malware, and finally the prediction result label is given.

[0072]

[0073] The detailed structure of the end-to-end malware classification method model based on machine code byte streams is as follows: Figure 2 As shown, we will now provide an in-depth introduction to each module.

[0074] 1. Malware Domain Representation Vector Encoding

[0075] In step S1, before inputting the byte stream, two flag bits of the machine code byte stream are modified to make the sample harmless. Then, the first 204800 bytes are truncated as the classification basis. The machine code byte stream with a length of 204800 is used as input and encoded into a vector e of (204800*8) after Embedding.

[0076] During program execution, two principles apply: temporal locality (if an instruction in a program is executed, it is likely to be executed again soon afterward; if a block of data is accessed, it is likely to be accessed again soon afterward. A typical cause of temporal locality is the presence of numerous loop operations within a program) and spatial locality (once a program accesses a memory location, nearby memory locations will likely be accessed soon afterward; that is, the addresses accessed by a program within a certain period may be concentrated within a certain range, because instructions are usually stored and executed sequentially, and data is generally stored in clusters in the form of vectors, arrays, or tables). Based on these principles, malware samples share common information within their neighborhood. Therefore, we can use convolution and pooling operations to extract information from the neighborhood of the sample under test while reducing the dimensionality of the feature vector, extracting a vector r containing information about the malware sample's neighborhood from vector e. m .

[0077] The process of domain representation vector encoding for malware is as follows:

[0078] (1) First, the Embedding encoding result e is passed through a parameter-independent convolutional layer to extract high-dimensional features, resulting in two (400*32) vectors c. m and c p The formula for convolution is shown below:

[0079]

[0080] Where x is the input vector, kernel is the convolution kernel, and i, j, p, q are the convolutional relative coordinates;

[0081] (2) For c m and c p Taking the Hadamard operation on two vectors yields a low-dimensional vector c of size 400*32. h The Hadamard operation is a vector element-wise multiplication, and its calculation formula is as follows:

[0082]

[0083] (3) Then, perform one-dimensional pooling to reduce the dimensionality, resulting in a (400*1) malware intermediate vector representation r. c The calculation formula is as follows:

[0084]

[0085] The detailed structure of the convolutional computation module used in the domain representation vector encoding of malware is shown in Table 2.

[0086] Table 2 Convolutional Module Structure

[0087]

[0088] The formula for calculating the Sigmoid activation function of a convolutional layer is shown below:

[0089]

[0090] The formula for calculating the Hadamard ReLU activation function is shown below:

[0091] ReLU(x) = max(0, x) (5)

[0092] Where x is the input value. Lower dimensionality can reduce the amount of subsequent computation.

[0093] 2. Malware whole-region information extraction module

[0094] In addition to local features, program runtime also exhibits global features (such as code within a code segment needing to access certain constants in the data segment). Therefore, we also need to extract the global features of malware, and self-attention mechanisms (Transformer / Attention) have demonstrated excellent full-text information retrieval capabilities in the field of natural language processing (especially in translation).

[0095] Malware representation vector r c After rearranging to form a (25*16) vector, it is modeled as a sentence containing 25 words, with each word represented by a vector of length 16. This models the full-text information of malware as a summary extraction problem from the field of natural language processing. A self-attention layer extracts usable information over long distances within the malware, resulting in a (400*1) malware summary vector representation r. a The formula for calculating the self-attention mechanism is as follows:

[0096]

[0097] Where words represent the malware representation vector r c The transformed matrix consists of 25 words, each 16 words long. Q W K W VAll are parameter matrices obtained by training the model. Q, K, and V are matrices composed of the query, key, and value vectors defined in the self-attention mechanism for the 25 words, respectively.

[0098]

[0099] This invention employs an 8-head attention model to counter malware development techniques such as obfuscation and encryption that bypass security detection, ultimately outputting a result r. a The formula for the representation vector containing the full-text information of the malware is as follows:

[0100]

[0101] Where i represents the i-th attention head, W i Q , All are parameter matrices obtained by training the model. The i-th head of the attention mechanism calculates the Q of the attention head based on Q, K, and V. i K i V i Q i K i V i These are matrices composed of the query, key, and value vectors of the i-th attention head as defined in the self-attention mechanism. `concat` represents concatenating the computation results of each head. W 0 It is the parameter matrix obtained by training the model.

[0102] Using this matrix, we can transform the multi-head attention result into a (25*16) summary information vector, and finally rearrange it into a (400*1) malware summary information vector r. a .

[0103] 3. Residual connection module

[0104] To avoid the vanishing or exploding gradient problem that occurs as the model becomes more complex, we added a residual link after the self-attention module to store the malware neighborhood information r. c and global information on malware a The summation yields the representation vector r of the malware. The residual connection calculation formula is as follows:

[0105] r = r c +r a (9).

[0106] 4. Classification Output Module

[0107] The malware representation vector r is input into a multilayer perceptron (MLP layer), and the probability of the test sample belonging to each malware category is calculated using the Softmax function. The output vector has a size of (8*1), and each element value is ∈ [0, 1].

[0108] The detailed structure of the MLP module is shown in Table 3.

[0109] Table 3 Detailed Structure of MLP Module

[0110]

[0111]

[0112] The calculation method for the Softmax activation function of a fully connected layer is as follows:

[0113]

[0114] Where i = 1, ..., 8, and z is the output of the last fully connected layer.

[0115] After the malware representation vector is processed by the classification output module, the probability vector output of the malware belonging to each malware category is obtained. The category with the highest predicted probability is taken as the predicted category of the malware, and the final prediction result is given.

[0116] This invention addresses the imbalanced sample problem by using cross-entropy loss as the loss function, stochastic gradient descent as the optimizer, and random sampling. After training the model for 10 epochs, it achieved a macro-precision score of 91.6% and a macro-F1 score of 90.0%, surpassing current models such as DNN, MalConv, and 2D-CNN. This demonstrates the effectiveness and rationality of our end-to-end deep learning malware classifier.

[0117] On the other hand, the present invention also provides an end-to-end deep learning malware classification device based on machine code byte streams, such as... Figure 1 As shown, the following modules are included to implement the method described in any of the above:

[0118] Malware domain representation vector encoding takes the machine code byte stream of malware samples as input, uses one-dimensional convolution to achieve vertical dimensionality reduction of malware vectors, uses one-dimensional pooling to achieve horizontal dimensionality reduction of malware vectors, and finally outputs malware neighborhood information.

[0119] The malware global information extraction module takes malware neighborhood information as input, uses a self-attention mechanism to process the dimensionality-reduced malware representation vector, and outputs global malware information.

[0120] The residual linking module takes malware neighborhood information and global information as common inputs and outputs a malware representation vector.

[0121] The classification output module takes malware representation vectors as input and outputs malware classification results.

[0122] Corresponding to the end-to-end deep learning malware classification method based on machine code byte stream provided in the above embodiments of the present invention, the present invention also provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus; the memory is used to store computer programs; and the processor is used to implement the steps of the method described in any of the above embodiments when executing the program stored in the memory.

[0123] The communication bus mentioned in the above electronic devices can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. This communication bus can be divided into address bus, data bus, control bus, etc. For ease of illustration, only one thick line is used to represent it in the diagram, but this does not mean that there is only one bus or one type of bus.

[0124] The communication interface is used for communication between the aforementioned electronic devices and other devices.

[0125] The memory may include random access memory (RAM) or non-volatile memory (NVM), such as at least one disk storage device. Optionally, the memory may also be at least one storage device located remotely from the aforementioned processor.

[0126] The processors mentioned above can be general-purpose processors, including central processing units (CPUs), network processors (NPs), etc.; they can also be digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.

[0127] A computer-readable storage medium stores a computer program that, when executed by a processor, implements the steps of any of the end-to-end deep learning malware classification methods based on machine code byte streams provided in the embodiments of the present invention.

[0128] In another embodiment of the present invention, a computer program product containing instructions is also provided, which, when run on a computer, causes the computer to perform the steps of any of the end-to-end deep learning malware classification methods based on machine code byte streams provided in the embodiments of the present invention.

[0129] In the above embodiments, implementation can be achieved entirely or partially through software, hardware, firmware, or any combination thereof. When implemented using software, it can be implemented entirely or partially in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present invention are generated. The computer can be a general-purpose computer, a special-purpose computer, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., a solid-state drive (SSD)).

[0130] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0131] The various embodiments in this specification are described in a related manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the device embodiments, electronic device embodiments, computer-readable storage medium embodiments, and computer program product embodiments are basically similar to the method embodiments, so the descriptions are relatively simple; relevant parts can be referred to the descriptions of the method embodiments.

[0132] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. However, these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An end-to-end deep learning-based malware classification method based on machine code byte streams, characterized in that, Includes the following steps: S1. Taking the machine code byte stream of the malware sample as input, it is processed by malware domain representation vector encoding to output malware neighborhood information; the malware domain representation vector encoding uses one-dimensional convolution to achieve vertical dimensionality reduction of the malware vector and one-dimensional pooling to achieve horizontal dimensionality reduction of the malware vector. The process of encoding the malware domain representation vector in step S1 is as follows: (1) First, the embedding encoding result e is passed through a parameter-independent convolutional layer to extract high-dimensional features, resulting in two 400*32 vectors. and The formula for convolution is shown below: (1) Where x is the input vector, kernel is the convolution kernel, and i, j, p, q are the convolutional relative coordinates; the Embedding encoding result e is the first 204800 bytes truncated as the classification basis, taking a machine code byte stream of length 204800 as input, and encoding it into a vector e of 204800*8 after Embedding; (2) To and Taking the Hadamard operation on two vectors yields a low-dimensional vector of 400*32. The Hadamard operation is a vector element-wise multiplication, and its calculation formula is as follows: , i is the row number and j is the column number (2) (3) Then, perform one-dimensional pooling to reduce the dimensionality of the malware to obtain a 400*1 intermediate vector representation of the malware. The calculation formula is as follows: , i is the row number and j is the column number (3) The formula for calculating the Sigmoid activation function of a convolutional layer is shown below: , where x is the input value (4) The formula for calculating the Hadamard ReLU activation function is shown below: (5) Where x is the input value; S2. The malware neighborhood information is processed by the malware global information extraction module to output the malware global information; the malware global information extraction module uses a self-attention mechanism to process the dimensionality-reduced malware representation vector. The process of the malware full-area information extraction module in step S2 is as follows: Malware representation vectors After being rearranged into 25*16 vectors, this is modeled as a sentence containing 25 words, with each word represented by a vector of length 16. A self-attention layer is then used to extract usable information over long distances within the malware, resulting in a 400*1 malware summary vector representation. The formula for calculating the self-attention mechanism is as follows: (6) Where words represent malware representation vectors. The resulting matrix consists of 25 words, each 16 words long. , , All are parameter matrices obtained by training the model. Q, K, and V are matrices composed of the query, key, and value vectors defined in the self-attention mechanism for the 25 words, respectively. (7) An 8-head attention model is used to output the final result. The formula for the representation vector containing the full-text information of the malware is as follows: (8) in Represents the i-th attention head. , , All are parameter matrices obtained by training the model. The i-th head of the attention mechanism calculates the value of the attention head based on Q, K, and V. , , , , , These are matrices composed of the query, key, and value vectors of the i-th attention head as defined in the self-attention mechanism. `concat` represents concatenating the computation results of each head. It is the parameter matrix obtained by training the model; S3, malware neighborhood information, and global information are input together into the residual connection module, which outputs a malware representation vector; the residual connection module inputs malware neighborhood information... and global information on malware The summation yields the representation vector of the malware. The formula for calculating residual connectivity is as follows: (9) S4. The malware representation vector is processed by the classification output module to output the malware classification result.

2. The end-to-end deep learning malware classification method based on machine code byte streams according to claim 1, characterized in that, Before inputting the byte stream in step S1, two flag bits of the machine code byte stream are modified to render the sample harmless.

3. The end-to-end deep learning malware classification method based on machine code byte streams according to claim 1, characterized in that, In step S4, the classification output module inputs the malware representation vector into a multilayer perceptron and uses the Softmax function to calculate the probability that the test sample belongs to each malware category.

4. The end-to-end deep learning malware classification method based on machine code byte streams according to claim 3, characterized in that, The structure of a multilayer perceptron consists of two fully connected layers and one normalization layer. The softmax activation function of the fully connected layers is calculated as follows: (10) Where i=1,…,8, and z is the output of the last fully connected layer.

5. The end-to-end deep learning malware classification method based on machine code byte streams according to claim 1, characterized in that, Step S4: After the malware representation vector is processed by the classification output module, the probability vector output of the malware belonging to each malware category is obtained. The category with the highest predicted probability is taken as the predicted category of the malware, and the prediction result is finally given.

6. An end-to-end deep learning malware classification device based on machine code byte streams, characterized in that, The following modules are included to implement the method described in any one of claims 1-5: Malware domain representation vector encoding takes the machine code byte stream of malware samples as input, uses one-dimensional convolution to achieve vertical dimensionality reduction of malware vectors, uses one-dimensional pooling to achieve horizontal dimensionality reduction of malware vectors, and finally outputs malware neighborhood information. The malware global information extraction module takes malware neighborhood information as input, uses a self-attention mechanism to process the dimensionality-reduced malware representation vector, and outputs global malware information. The residual connection module takes malware neighborhood information and global information as common inputs and outputs a malware representation vector. The classification output module takes malware representation vectors as input and outputs malware classification results.

7. An electronic device, characterized in that, The system includes a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other via the communication bus; the memory is used to store computer programs; and the processor, when executing the program stored in the memory, implements the steps of the method described in any one of claims 1-5.