A malicious sample data packet analysis method, device and electronic equipment
By screening, classifying, and grouping malicious sample data packets, and extracting common features using local sequence alignment algorithms, the problem of identifying unknown protocol threats has been solved, thereby improving the ability to detect network threats and enriching the feature database.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HARBIN ANTIY TECH
- Filing Date
- 2022-12-21
- Publication Date
- 2026-06-23
AI Technical Summary
Existing technologies are insufficient to effectively identify and prevent malicious sample threats from unknown protocols, leading to increased internal network security risks and potential data breaches.
By filtering, classifying, and grouping malicious sample data packets, common features are extracted. Local sequence alignment algorithms such as the Smith-Waterman algorithm are then used for packet grouping and feature extraction to construct a malicious sample feature library.
It enhances the ability to identify malicious code threats from unknown protocols, enriches the malicious sample feature library, and improves the accuracy and efficiency of network threat detection.
Smart Images

Figure CN116192448B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of network security technology, and in particular to a method, apparatus and electronic device for analyzing malicious sample data packets. Background Technology
[0002] With the development of network information technology, services and applications surrounding networks and data have experienced explosive growth. These diverse application scenarios have exposed an increasing number of cybersecurity risks and problems, generating widespread and profound impacts. The complexity and variability of the network environment, along with the vulnerability of information systems, determine that cybersecurity threats will continue to exist objectively.
[0003] Some organizations face the risk of intrusion through social engineering and exploitation of public network vulnerabilities. Because these organizations are generally large, malicious code spreading internally can cause widespread infection, potentially leading to the theft and leakage of core and confidential data. Failure to promptly detect and identify potential threats from malicious code using unknown protocols (also known as malicious samples) can result in significant losses. Summary of the Invention
[0004] In view of this, embodiments of the present invention provide a method, apparatus and electronic device for analyzing malicious sample data packets, which facilitates the extraction of common features from malicious samples of unknown protocols, thereby improving the ability to identify and discover malicious code threats of unknown protocols.
[0005] In a first aspect, the malicious sample data packet analysis method provided by the embodiments of the present invention includes the following steps: acquiring multiple malicious sample data packets; filtering the multiple malicious sample data packets to obtain multiple target malicious sample data packets; classifying each target malicious sample data packet according to malicious sample feature information; for each type of target malicious sample data packet, packetizing each target malicious sample data packet to obtain at least a first group of target malicious sample data packets, wherein the first group of target malicious sample data packets contains at least two similar target malicious sample data packets; and extracting common features from each group of malicious sample data packets.
[0006] Optionally, the common features include: immutable field fields, variable field fields, and / or constant fields.
[0007] Optionally, the malicious sample feature information includes: the virus family to which the malicious sample belongs;
[0008] The step of classifying each target malicious sample data packet according to the malicious sample feature information includes: classifying each target malicious sample data packet into multiple categories according to the virus family to which the malicious sample belongs, with each category containing multiple target malicious sample data;
[0009] Before packet grouping each target malicious sample data packet, the method further includes: classifying each target malicious sample data packet into packets for each type of target malicious sample data packet, and determining the packet type of the target malicious sample data packet;
[0010] The step of classifying each target malicious sample data packet to determine its packet type includes: for each virus family category, obtaining multiple malicious sample online data packets contained in that category; the target malicious sample data packet contains: malicious sample online data packets; for each malicious sample online data packet, extracting the byte type of the malicious sample online data packet; the byte type includes: character type and / or binary type; determining whether the number of character type bytes exceeds a predetermined threshold; if yes, then the malicious sample online data packet is determined to be a text-type packet; if no, then the malicious sample online data packet is determined to be a binary-type packet.
[0011] Optionally, classifying each target malicious sample data packet to determine its message type further includes: for each virus family category, obtaining multiple malicious sample instruction data packets contained in that category; the target malicious sample data packet contains: malicious sample instruction data packets; for each malicious sample instruction data packet, extracting the byte type of the malicious sample instruction data packet; the byte type includes: character type and / or binary type; determining whether the number of character type bytes exceeds a predetermined threshold; if yes, then the malicious sample instruction data packet is determined as a text message; if no, then the malicious sample instruction data packet is determined as a binary message.
[0012] Optionally, the malicious sample feature information further includes: communication data packet length and communication data packet transmission direction. The step of packet grouping each target malicious sample data packet to obtain at least a first group of target malicious sample data packets includes: after determining the packet type of the target malicious sample data packet, obtaining the preliminary grouping of the target malicious sample data packet.
[0013] For each target malicious sample data packet, that target malicious sample data packet is designated as the first target malicious sample data packet;
[0014] Based on the communication data packet length and transmission direction, a local sequence alignment algorithm is used to compare the first target malicious sample data packet with target malicious sample data packets of different message types to determine whether there is a message category that matches the first target malicious sample data packet. If there is, the first target malicious sample data packet is added to the preliminary group corresponding to the matching message category to form the first message group. If there is no matching message group, a new message group is created and the first target malicious sample data packet is added to the new message group to form the second message group.
[0015] Optionally, for binary messages, when determining whether there is a message category that matches the first target malicious sample data packet, the restrictions on whether the first target malicious sample data can be added to the message category of that type include: length restriction: the length distance between the two messages is not greater than a specified threshold; content restriction: the edit distance between the two messages is not greater than a specified threshold; format restriction: the two messages of the same type have the same number and order of text fields, binary fields and Unicode fields.
[0016] Optionally, for text-type messages, when determining whether there is a message category that matches the first target malicious sample data packet, the restrictions on whether the first target malicious sample data can be added to the message category of that type include: fragment restriction: the message is divided into different fragments using newline characters as delimiters, and the difference in the number of message sequences is not greater than a specified threshold; length restriction: the length difference between two corresponding fragments between messages does not exceed a specified threshold; content restriction: the edit distance between two corresponding fragments between two messages of the same type does not exceed a specified threshold; string restriction: for a predetermined fragment, a character set is divided using special symbols as boundaries, and the number of common strings between corresponding fragments between two messages of the same type meets a predetermined threshold condition.
[0017] Optionally, the step of packet grouping each target malicious sample data packet to obtain at least a first group of target malicious sample data packets further includes: when two target malicious sample data packets meet the restriction conditions for binary type message classification or text type message classification, then it is determined that the two target malicious sample data packets have similar segments; and the two target malicious sample data packets are assigned to the same message group.
[0018] The extraction of common features from each group of malicious sample data packets includes: extracting similar segments from two target malicious sample data packets in the group and identifying them as common features; or...
[0019] When two or more target malicious sample data meet the restrictions of binary message classification or text message classification, it is determined that the two or more target malicious sample data have similar segments; similar segments are extracted from the two or more target malicious sample data, and segments with smaller edit distances between similar segments are identified as common features.
[0020] Secondly, the malicious sample data packet analysis apparatus provided in this embodiment of the invention includes: an acquisition program module for acquiring multiple malicious sample data packets; a filtering program module for filtering the multiple malicious sample data packets to obtain multiple target malicious sample data packets; a preliminary classification program module for classifying each target malicious sample data packet according to malicious sample feature information; a grouping program module for grouping each target malicious sample data packet into packets for each type of target malicious sample data packet to obtain at least a first group of target malicious sample data packets, wherein the first group of target malicious sample data packets contains at least two similar target malicious sample data packets; and a feature extraction program module for extracting common features from each group of malicious sample data packets.
[0021] Thirdly, the electronic device provided in the embodiments of the present invention includes: a housing, a processor, a memory, a circuit board, and a power supply circuit, wherein the circuit board is disposed inside the space enclosed by the housing, and the processor and the memory are disposed on the circuit board; the power supply circuit is used to supply power to various circuits or devices of the above-mentioned electronic device; the memory is used to store executable program code; the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory, for executing the malicious sample data packet analysis method described in any of the first aspects.
[0022] The malicious sample data packet analysis method, apparatus, and electronic device provided in this invention reverse-engineer a large number of malicious samples of unknown network protocols, group malicious samples of various virus families into packets, and extract common features from each group of malicious sample data packets. This facilitates the extraction of common features from malicious samples of unknown protocols, thereby identifying the malicious attributes of malicious samples of a certain virus family. By extracting common features, the feature library of malicious samples of various virus families can be enriched, thereby improving the ability to identify and detect malicious code threats of unknown protocols. Attached Figure Description
[0023] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0024] Figure 1 This is a flowchart illustrating an embodiment of the malicious sample data packet analysis method of the present invention;
[0025] Figure 2 This is a flowchart illustrating another embodiment of the malicious sample data packet analysis method of the present invention;
[0026] Figure 3 This is a schematic partial order alignment diagram obtained based on the Smith-Waterman algorithm in this invention;
[0027] Figure 4 This is an architectural diagram of an embodiment of the network asset fingerprint feature identification device of the present invention;
[0028] Figure 5 This is a schematic diagram of the structure of an embodiment of the electronic device of the present invention. Detailed Implementation
[0029] The embodiments of the present invention will now be described in detail with reference to the accompanying drawings.
[0030] It should be understood that the described embodiments are merely some, not all, of the embodiments of the present invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without inventive effort are within the scope of protection of the present invention.
[0031] The malicious sample data packet analysis method provided in this embodiment collects a large number of threat sample (also known as malicious sample, which is referred to as "malicious sample" in most cases in this article) communication data packets, performs preliminary classification of them into virus families, further groups them using message grouping, and extracts the message structure and common features of threat sample communication data packets through unknown network protocol reverse analysis technology. This can enrich the threat sample feature library and effectively improve the ability to identify various types of unknown protocol malicious code programs lurking in network data traffic.
[0032] Example 1
[0033] Figure 1 This is a flowchart illustrating an embodiment of the malicious sample data packet analysis method of the present invention. Please refer to it. Figure 1 As shown, the malicious sample data packet analysis method provided in this embodiment of the invention can be applied to scenarios such as malicious sample feature library completion and network threat detection and discovery. It is used to analyze malicious sample data packets of unknown network protocols, thereby extracting corresponding common features that characterize malicious attributes.
[0034] It should be noted that this method can be embedded in a manufactured physical product in the form of software. When a user needs to analyze a malicious sample data packet, the method flow of this application can be triggered and reproduced.
[0035] See Figure 1 As shown, the malicious sample data packet analysis method may include the following steps:
[0036] S110: Obtain multiple malicious sample data packets.
[0037] In this example, Pcap packet capture tools can be used to capture and collect Pcap packets from a large number of threat-active sample sites or malicious sample libraries, such as VirusShare and Wireshark.
[0038] Pcab is also a packet capture library, and many software programs use it as a packet capture tool, such as Wireshark mentioned above. Pcab packets are generally a new data format that differs from the original data stream format.
[0039] S120. The multiple malicious sample data packets are filtered to obtain multiple target malicious sample data packets.
[0040] After acquiring a large number of Pcap data packets, the Pcap data packets are processed, such as through data cleaning and filtering, to extract a large number of online, heartbeat, and instruction data packets. As needed, the online data packets, heartbeat data packets, and instruction data packets can be used as the target malicious sample data packets.
[0041] It should be understood that malicious sample data packets often exist statically in malicious sample libraries and most of them do not have heartbeat data packets.
[0042] S130. Classify each target malicious sample data packet according to the malicious sample feature information.
[0043] The characteristics of malicious samples generally include: the virus family to which they belong, the length of the communication data packet, and the direction of communication data packet transmission. After obtaining the target malicious sample data packet, a preliminary simple classification of all target malicious sample data packets can be performed according to the virus family. The communication data packet length and transmission direction can be used as the basis for subsequent message grouping.
[0044] S140. For each type of target malicious sample data packet, packet grouping is performed on each target malicious sample data packet to obtain at least a first group of target malicious sample data packets, wherein the first group of target malicious sample data packets contains at least two similar target malicious sample data packets.
[0045] S150. Extract common features from each group of malicious sample data packets.
[0046] It should be understood that this common feature can be used to characterize the common malicious attributes of a specific type of malicious sample. When encountering a malicious sample with an unknown network protocol, by extracting the malicious attribute features it carries, the malicious sample with the unknown network protocol can be quickly detected and identified, thereby improving the ability to detect latent malicious samples.
[0047] To identify malicious samples of unknown network protocols, it is necessary to obtain the characteristics or malicious attributes of similar or identical samples. Therefore, in this embodiment, each target malicious sample data packet is packetized, and data packets of target malicious samples with at least two similarities are grouped into a unified packet. Then, reverse packet parsing is performed to extract their common features as malicious attribute features of a specific type of malicious sample. This leads to the construction of a rich malicious attribute feature library, which can enable the detection of malicious samples of a certain unknown network protocol, thereby improving the detection and identification capabilities of various types of hidden unknown protocol malicious code programs (i.e., malicious samples) in network traffic.
[0048] It should be understood that the premise of detecting and identifying malicious samples is to perform reverse analysis of network traffic data on a large number of malicious samples and to uncover the common malicious attributes of a certain type of malicious samples. Therefore, in the reverse analysis process, classifying and grouping network traffic data according to virus families is an effective means of analyzing the malicious attribute characteristics of malicious samples with unknown network protocols.
[0049] Figure 2 This is a schematic flowchart of a malicious sample data packet analysis method provided in another embodiment of this application. In some embodiments, the malicious sample feature information includes: the virus family to which the malicious sample belongs; wherein, each virus family has specific propagation characteristics, and of course, viruses from different virus families can also have the same propagation characteristics. For example, a study found that most samples from a certain botnet virus family have the behavior of starting new processes, and these new processes exhibit abnormal behavior, including setting delayed thread startup and attempting to remotely connect and communicate with another control terminal.
[0050] The step of classifying each target malicious sample data packet according to the malicious sample feature information (step S130) includes: classifying each target malicious sample data packet into multiple categories according to the virus family to which the malicious sample belongs, with each category containing multiple target malicious sample data;
[0051] In step S140, before packet grouping each target malicious sample data packet, the method further includes: S135, for each type of target malicious sample data packet, packet classification is performed on each target malicious sample data packet (since data packets are generally referred to as packets in the communication field, packets are also referred to as packets in this document) to determine the packet type of the target malicious sample data packet.
[0052] Among these steps, message classification, as the foundation for message format parsing and protocol state machine inference, is a crucial step (which can also be considered preliminary grouping). Specifically, for the acquired messages, a preliminary byte type extraction process is performed, and the result of this process is used for preliminary message grouping.
[0053] For details, please refer to Figure 2 As shown, the step of classifying each target malicious sample data packet and determining its packet type includes: for each virus family category, obtaining multiple malicious sample online data packets contained in that category; the target malicious sample data packet contains: malicious sample online data packets; for each malicious sample online data packet, extracting the byte type of the malicious sample online data packet; the byte type includes: character type and / or binary type; determining whether the number of character type bytes exceeds a predetermined threshold; if yes, then the malicious sample online data packet is determined to be a text-type packet; if no, then the malicious sample online data packet is determined to be a binary-type packet.
[0054] Since the target malicious sample data packet also contains: instruction data packets and / or heartbeat packets, message classification of the instruction data packets and / or heartbeat packets is performed based on the same message classification method as the aforementioned online data packets. Specifically, the message classification of each target malicious sample data packet to determine the message type of the target malicious sample data packet (step S130) further includes: for each virus family category, obtaining multiple malicious sample instruction data packets contained in that category; the target malicious sample data packet contains: malicious sample instruction data packets;
[0055] In step S140, for each malicious sample instruction data packet, the byte type of the malicious sample instruction data packet is extracted; the byte type includes: character type and / or binary type; it is determined whether the number of character type bytes exceeds a predetermined threshold; if yes, the malicious sample instruction data packet is determined to be a text message; if no, the malicious sample instruction data packet is determined to be a binary message.
[0056] As can be seen from the message classification scheme for online data packets and instruction data packets described in this embodiment, during the message classification process, it is necessary to first determine the message type based on the type of bytes in the message, mainly based on the following two principles:
[0057] a. Analyze each different byte in each message. If it is a printable character, it is generally represented by a newline character 0X0D0A. Then the byte is determined to be a character type. Otherwise, it is determined to be a binary byte.
[0058] b. After determining the type of each byte in each message, the entire message is analyzed. If the number of character bytes in the message exceeds a predetermined threshold, the message is determined to be a text message; otherwise, it is determined to be a binary message.
[0059] For principle b, the rule for determining a message as a text message can be, for example, setting a predetermined threshold so that all bytes appearing in the entire message are character bytes, or setting it so that more than a predetermined number of bytes are character bytes, then it can be determined as a text message; otherwise, it can be determined as a binary message.
[0060] In this embodiment, by classifying different types of target malicious sample data packets according to virus families and based on message characteristics, message groups are initially determined, and one or more malicious samples are initially labeled with relatively specific category tags, which facilitates the accurate extraction of common features of a certain type of malicious sample in the future.
[0061] After the initial processing and classification of the individual messages, the next step is to group the messages (a more specific division than message classification). Please continue reading... Figure 2 As shown, in some embodiments, the malicious sample feature information further includes: communication data packet length and communication data packet transmission direction. In step S140, the step of packet grouping each target malicious sample data packet to obtain at least a first group of target malicious sample data packets includes: after determining the message type of the target malicious sample data packet, obtaining a preliminary group of the target malicious sample data packet; for each target malicious sample data packet, determining that target malicious sample data packet as a first target malicious sample data packet; based on the communication data packet length and communication data packet transmission direction, comparing the first target malicious sample data packet with target malicious sample data packets of different message types using a local sequence alignment algorithm to determine whether there is a message category matching the first target malicious sample data packet; if there is, adding the first target malicious sample data packet to the preliminary group corresponding to the matching message category to form a first message group; if there is no such group, creating a new message group and adding the first target malicious sample data packet to the new message group to form a second message group.
[0062] The local sequence alignment algorithm is used to find segments with similar heights in two message sequences to determine whether there is a message category that matches the first target malicious sample data packet, thereby determining the specific grouping of the target malicious sample data packet.
[0063] In some embodiments, the local sequence alignment algorithm adopts the Smith-Waterman local alignment algorithm, which performs calculations diagonally from one point to the upper left corner, and can also be figuratively called the partial order alignment algorithm.
[0064] In this embodiment, the specific grouping strategy can be as follows: After obtaining the byte type of a message A, the Smith-Waterman partial order alignment algorithm is used to compare message A with the already divided message groups (initially the previous message categories, i.e., preliminary groups) to see if there is a matching message category. If so, the message is directly added to the corresponding message group; otherwise, a new message group is created. The number of iterations for the comparison can be set to achieve more accurate grouping.
[0065] Specifically, the Smith-Waterman partial order alignment algorithm is used to obtain partial order alignment maps for multiple messages. By comparing the partial order alignment maps, it can be determined whether two messages have the same segment. If so, they can be assigned to the same message group.
[0066] Binary messages can only be in the same packet as other binary messages, and text messages can only be in the same packet as other text messages. Similarly, only messages traveling in the same direction can be stored in the same packet. Messages traveling in different directions cannot be stored in the same packet.
[0067] Specifically, for binary messages, when determining whether there is a message category that matches the first target malicious sample data packet, the restrictions on whether the first target malicious sample data can be added to the message category of that type include: 1. Length restriction: the length distance between the two messages is not greater than a specified threshold; 2. Content restriction: the edit distance between the two messages is not greater than a specified threshold; 3. Format restriction: the two messages of the same type have the same number and order of text fields, binary fields and Unicode fields.
[0068] For text-type messages, when determining whether there is a message category that matches the first target malicious sample data packet, the restrictions on whether the first target malicious sample data can be added to the message category of that type include: 1. Fragment restriction: The message is divided into different fragments using the newline character 0x0D0A as the delimiter, and the difference in the number of message sequences is not greater than a specified threshold; 2. Length restriction: The length difference between two corresponding fragments between messages does not exceed a specified threshold; 3. Content restriction: The edit distance between two corresponding fragments between two messages of the same type does not exceed a specified threshold; 4. String restriction: For a predetermined fragment, a character set is divided using special symbols as boundaries, and the number of common strings between corresponding fragments of two messages of the same type meets the predetermined threshold condition.
[0069] The step of packet grouping each target malicious sample data packet to obtain at least the first group of target malicious sample data packets further includes: when two target malicious sample data packets meet the restrictions of binary class message classification or text type message restrictions, then it is determined that the two target malicious sample data packets have similar segments; and the two target malicious sample data packets are assigned to the same message group.
[0070] The extraction of common features from each group of malicious sample data packets includes: extracting similar segments from the two target malicious sample data packets in the group and identifying them as common features.
[0071] Alternatively, in other embodiments, when two or more target malicious sample data meet the restrictions of binary message classification or text message classification, it is determined that the two or more target malicious sample data have similar segments; similar segments are extracted from the two or more target malicious sample data, and segments with smaller edit distances between similar segments are identified as common features.
[0072] Edit distance, also known as Levenshtein distance, is a quantitative measure of the difference between two strings (such as English characters). It measures the minimum number of processing steps required to transform one string into another. The closer the edit distance, the smaller the difference between the two strings.
[0073] In this embodiment, for a certain type of message, the corresponding segments of two messages can only be judged to be similar after the above corresponding restrictions are met simultaneously, and thus the two messages may be the same group of messages. If multiple messages are judged to be similar messages, the segment with the smallest edit distance is selected as the best matching feature of the message type, and the message type is added to the corresponding message type group, and the best matching feature is used as the invariant field field in the type group.
[0074] Please continue reading. Figure 2As shown, in some embodiments, after obtaining the variable field field and invariant field field of multiple communication data packets based on the partial order comparison map, the method further includes: filtering common characters in each communication data packet from the variable field field and invariant field field to obtain the final constant field, variable field field and invariant field field.
[0075] Common characters are those frequently seen in normal samples, such as ASCII character set and encoding, GB2312 character set and encoding, Unicode character set and encoding, UTF-8 encoding, etc. Immutable character fields are immutable objects carried in communication data packets, while mutable character fields are mutable objects carried in communication data packets.
[0076] In this embodiment, after performing local sequence alignment operations on each communication data packet using the Smith-Waterman partial order alignment algorithm, the variable and invariant character fields of multiple communication data packets in the message group can be determined based on the obtained partial order alignment map. Thus, the message structure and common features of a certain type of malicious sample can be obtained from the character field, which can be used to enrich the malicious sample feature library of unknown network protocols.
[0077] To help understand the technical solutions and their effects provided in the embodiments of the present invention, please refer to... Figure 3 The following is a detailed description of one embodiment, with reference to a specific example:
[0078] A cybersecurity department needs to detect and identify cyberattacks using unknown network protocols. To improve the detection rate, it needs to collect a large number of malicious samples (threat samples), perform reverse engineering on them, and extract the malicious attribute features of a specific type of malicious sample to build a detection feature library. The specific reverse engineering process is as follows:
[0079] Step S2: Data collection, which involves acquiring a large number of threat-active samples from a database containing a large number of malicious samples and collecting pcap data packets;
[0080] In step S2: the pcap data packets are processed, specifically by extracting a large number of data packets of specified categories such as online, heartbeat, and command from the pcap data packets;
[0081] Before step S2, step S1 may be included, which involves providing a large library of malicious sample Pcab data packets.
[0082] Step S4: Perform preliminary classification of data packets, such as classifying all communication packets according to virus families; of course, further classification can be performed based on information such as communication packet length and transmission direction.
[0083] Before step S4, step S3 may be included: calculating basic information for each communication message, such as sample characteristic information like data packet length and transmission direction, to be used as a basis for subsequent grouping. Of course, this step can also be performed during subsequent grouping.
[0084] Step S5, message classification, includes steps S5a and S5b. Depending on the type of the acquired target malicious sample data, only one step may be executed. Preliminary message classification is a crucial step, forming the basis for message format parsing and protocol state machine inference. Specifically, for the acquired messages, a preliminary byte type extraction process is performed, and the result is used for preliminary message grouping. During byte type extraction, the first step is to determine the message type based on the type of characters within the message, primarily based on the following two principles:
[0085] a. Analyze each different byte in each message. If it is a printable character, determine that the character is a character type; otherwise, determine that it is a binary byte.
[0086] b. After judging each byte of each message, analyze the entire message. If the message contains only character bytes, including 0X0D0A, then the message is determined to be a text message; otherwise, it is determined to be a binary message.
[0087] Steps S6 to S8: The process of message grouping and feature extraction for each communication data packet. After the initial processing of each message, the next step is to group the messages. Please refer to... Figure 2 As shown, the specific grouping strategy is as follows: After obtaining the byte type of a message, the Smith-Waterman partial order alignment algorithm is used to compare the message with the previously divided message groups to see if there is a matching message category. If so, the message is directly added to the corresponding message group. A schematic partial order alignment diagram is shown below. Figure 3 As shown, based on the partial order comparison diagram, the identical segments in the two communication data packets are GTT and AC, which are considered immutable field fields. Variable field fields can be different segments from the diagram. Constant fields can be the filtered-out ASCII codes.
[0088] Step 9: Final common feature extraction. Through the above steps, the invariant field, variable field, and constant field of the instruction features under each family name can be obtained. From the invariant field, variable field, and constant field, the message structure and common features of the specific malicious sample message group can be obtained.
[0089] After obtaining the message structure and common features, the malicious attribute feature library can be enriched for the detection and identification of malicious samples of a specific category.
[0090] Therefore, the malicious sample data packet analysis method provided in this embodiment of the invention performs reverse message parsing on a large number of malicious samples of unknown network protocols, groups malicious samples of various virus families into packets, and extracts common features from each group of malicious sample data packets. This facilitates the extraction of common features from malicious samples of unknown protocols, thereby identifying the malicious attributes of malicious samples of a certain virus family. By extracting common features, the feature library of malicious samples of various virus families can be enriched, thereby improving the ability to identify and discover malicious code threats of unknown protocols.
[0091] Example 2
[0092] Figure 4 This is an architectural diagram of an embodiment of the malicious sample data packet analysis device of the present invention. See also... Figure 4 As shown, the network asset fingerprint feature identification device of this embodiment includes: an acquisition program module 210 for acquiring multiple malicious sample data packets; a filtering program module 220 for filtering the multiple malicious sample data packets to obtain multiple target malicious sample data packets; a preliminary classification program module 230 for classifying each target malicious sample data packet according to malicious sample feature information; a grouping program module 240 for grouping each target malicious sample data packet into packets for each type of target malicious sample data packet to obtain at least a first group of target malicious sample data packets, wherein the first group of target malicious sample data packets contains at least two similar target malicious sample data packets; and a feature extraction program module 260 for extracting common features from each group of malicious sample data packets.
[0093] The apparatus of this embodiment can be used to perform Figure 1 The technical solutions of the method embodiments shown are similar in principle and in effect, and will not be described again here.
[0094] Furthermore, the apparatus of this embodiment can also be used to execute other embodiments of the corresponding malicious sample data packet analysis method in the aforementioned embodiment one. Where details are not described in detail, they can be referred to in relation to each other, and will not be repeated here.
[0095] Example 3
[0096] Figure 5 This is a schematic diagram of the structure of an embodiment of the electronic device of the present invention. Based on the method provided in Embodiment 1 and the device provided in Embodiment 2, the present invention also provides an electronic device, such as... Figure 5As shown, the steps of any of the embodiments described in Embodiment 1 of the present invention can be implemented. The electronic device may include: a housing 41, a processor 42, a memory 43, a circuit board 44, and a power supply circuit 45. The circuit board 44 is disposed inside the space enclosed by the housing 41, and the processor 42 and the memory 43 are disposed on the circuit board 44. The power supply circuit 45 is used to supply power to the various circuits or devices of the electronic device. The memory 43 is used to store executable program code. The processor 42 runs the program corresponding to the executable program code by reading the executable program code stored in the memory 43, and is used to execute the malicious sample data packet analysis method described in any of the foregoing embodiments.
[0097] For details on the specific execution process of the above steps by the processor 42 and the steps further executed by the processor 42 by running executable program code, please refer to the description of Embodiment 1 of the present invention, which will not be repeated here.
[0098] The present invention also provides a computer-readable storage medium storing encrypted data as described in any of the embodiments in the first embodiment. The executable decryption program contained in the encrypted data can be executed by one or more processors to implement the malicious sample data packet analysis method as described in any of the preceding claims in the first embodiment.
[0099] In summary, compared to existing asset data identification schemes based on feature matching, the malicious sample data packet analysis method and apparatus provided in this invention reverse-engineer a large number of malicious samples of unknown network protocols, group malicious samples of various virus families into packets, and extract common features from each group of malicious sample data packets. This facilitates the extraction of common features from malicious samples of unknown protocols, thereby identifying the malicious attributes of malicious samples of a certain virus family. By extracting common features, the feature library of malicious samples of various virus families can be enriched, thereby improving the ability to identify and discover malicious code threats of unknown protocols.
[0100] Furthermore, during the reverse analysis process, a series of communication data packets are input, and based on packet classification, the packets are further grouped to infer the format structure information of each packet class, thereby obtaining the format model of the protocol packets. Common features are then extracted from this model, which can be used to accurately identify a specific type of malicious sample.
[0101] The aforementioned electronic devices exist in various forms, including but not limited to:
[0102] (1) Mobile communication devices: These devices are characterized by their mobile communication capabilities and primarily aim to provide voice and data communication. These terminals include: smartphones (e.g., iPhones), multimedia phones, feature phones, and low-end phones, etc.
[0103] (2) Ultra-mobile personal computer devices: These devices fall under the category of personal computers, possessing computing and processing capabilities, and generally also have mobile internet access features. These terminals include PDAs, MIDs, and UMPCs, such as the iPad.
[0104] (3) Portable entertainment devices: These devices can display and play multimedia content. This category includes: audio and video players (such as iPods), handheld game consoles, e-books, as well as smart toys and portable car navigation devices.
[0105] (4) Server: A device that provides computing services. The components of a server include a processor, hard disk, memory, system bus, etc. Servers are similar to general computer architectures, but because they need to provide highly reliable services, they have higher requirements in terms of processing power, stability, reliability, security, scalability, and manageability.
[0106] (5) Other electronic devices with data interaction functions.
[0107] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0108] The various embodiments in this specification are described in a related manner. The same or similar parts between the various embodiments can be referred to each other. Each embodiment focuses on describing the differences from other embodiments.
[0109] In particular, the device embodiment is basically similar to the method embodiment, so the description is relatively simple. For relevant details, please refer to the description of the method embodiment.
[0110] For ease of description, the above apparatus is described by dividing it into various functional units / modules. Of course, in implementing this invention, the functions of each unit / module can be implemented in one or more software and / or hardware.
[0111] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. The storage medium can be a magnetic disk, optical disk, read-only memory (ROM), or random access memory (RAM), etc.
[0112] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.
Claims
1. A method for analyzing malicious sample data packets, characterized in that, Including the following steps: Acquire multiple malicious sample data packets; The multiple malicious sample data packets are filtered to obtain multiple target malicious sample data packets; Each target malicious sample data packet is classified according to its malicious sample characteristics. For each type of target malicious sample data packet, each target malicious sample data packet is packetized to obtain at least a first group of target malicious sample data packets, and the first group of target malicious sample data packets contains at least two similar target malicious sample data packets. Extract common features from each group of malicious sample data packets; The malicious sample feature information includes: the virus family to which the malicious sample belongs; The step of classifying each target malicious sample data packet according to the malicious sample feature information includes: classifying each target malicious sample data packet into multiple categories according to the virus family to which the malicious sample belongs, with each category containing multiple target malicious sample data; Before packet grouping each target malicious sample data packet, the method further includes: classifying each target malicious sample data packet into packets for each type of target malicious sample data packet to determine the packet type of the target malicious sample data packet; The step of classifying each target malicious sample data packet to determine the packet type of the target malicious sample data packet includes: for each virus family category, obtaining multiple malicious sample online data packets contained in that category; the target malicious sample data packet contains: malicious sample online data packets; For each malicious sample online data packet, the byte type of the malicious sample online data packet is extracted; the byte type includes: character type and / or binary type; Determine whether the number of bytes in the character type exceeds a predetermined threshold; If so, the malicious sample's online data packet is identified as a text-based message; If not, the malicious sample's online data packet is identified as a binary message.
2. The malicious sample data packet analysis method according to claim 1, characterized in that, The common features include: immutable field fields, variable field fields, and / or constant field fields.
3. The malicious sample data packet analysis method according to claim 1, characterized in that, The step of classifying each target malicious sample data packet and determining the packet type of the target malicious sample data packet further includes: for each virus family category, obtaining multiple malicious sample instruction data packets contained in that category; the target malicious sample data packet contains: malicious sample instruction data packets; For each malicious sample instruction data packet, the byte type of the malicious sample instruction data packet is extracted; the byte type includes: character type and / or binary type; Determine whether the number of bytes in the character type exceeds a predetermined threshold; If so, the malicious sample instruction data packet is identified as a text-based message; If not, the malicious sample instruction data packet is identified as a binary message.
4. The malicious sample data packet analysis method according to claim 1 or 3, characterized in that, The malicious sample feature information also includes: communication data packet length and communication data packet transmission direction. The step of packet grouping each target malicious sample data packet to obtain at least the first group of target malicious sample data packets includes: after determining the packet type of the target malicious sample data packet, obtaining the preliminary grouping of the target malicious sample data packet. For each target malicious sample data packet, that target malicious sample data packet is designated as the first target malicious sample data packet; Based on the communication data packet length and the communication data packet transmission direction, the first target malicious sample data packet is compared with target malicious sample data packets of different message types through a local sequence comparison algorithm to determine whether there is a message category that matches the first target malicious sample data packet. If it exists, the first target malicious sample data packet is added to the preliminary group corresponding to the matched message category to form the first message group; If it does not exist, a new message packet is created, and the first target malicious sample data packet is added to the new message packet to form a second message packet.
5. The malicious sample data packet analysis method according to claim 4, characterized in that, For binary packets, when determining whether there is a packet category that matches the first target malicious sample data packet, the restrictions on whether the first target malicious sample data can be added to that type of packet category include: Length limit: The length distance between two messages shall not exceed the specified threshold; Content restrictions: The edit distance between two messages must not exceed the specified threshold; Format restrictions: Two messages of the same type have the same number and order of text fields, binary fields, and Unicode fields.
6. The malicious sample data packet analysis method according to claim 4, characterized in that, For text-type messages, when determining whether there is a message category that matches the first target malicious sample data packet, the limiting conditions for whether the first target malicious sample data can be added to that type of message category include: Fragment restriction: The message is divided into different fragments using newline characters as delimiters, and the difference in the number of message sequences does not exceed a specified threshold; Length limit: The length difference between two corresponding segments in a message shall not exceed a specified threshold; Content restrictions: The edit distance between two corresponding segments of two messages of the same type shall not exceed the specified threshold; String restriction: For a predetermined segment, a character set is divided by a special symbol. The number of common strings between corresponding segments of two messages of the same type meets the predetermined threshold condition.
7. The malicious sample data packet analysis method according to claim 1, characterized in that, The step of packet grouping each target malicious sample data packet to obtain at least the first group of target malicious sample data packets further includes: when there are two target malicious sample data that meet the restriction conditions of binary class message classification or text type message restriction conditions, it is determined that the two target malicious sample data have similar segments. The two malicious sample data targets were grouped into the same message packet; The extraction of common features from each group of malicious sample data packets includes: Extract similar segments from the two target malicious sample data sets and identify them as common features; or, When two or more target malicious sample data meet the restrictions of binary message classification or text message classification, it is determined that the two or more target malicious sample data have similar segments. Extract similar segments from the data of the two or more target malicious samples, and determine the segment with the smallest edit distance between the similar segments as the common feature.
8. A malicious sample data packet analysis device, characterized in that, include: The acquisition module is used to acquire multiple malicious sample data packets; The filtering module is used to filter the multiple malicious sample data packets to obtain multiple target malicious sample data packets; The initial classification module is used to classify each target malicious sample data packet according to the malicious sample feature information; The packetization module is used to perform packet segmentation on each target malicious sample data packet for each type of target malicious sample data packet, so as to obtain at least a first group of target malicious sample data packets, wherein the first group of target malicious sample data packets contains at least two similar target malicious sample data packets. The feature extraction module is used to extract common features from each group of malicious sample data packets; The malicious sample feature information includes: the virus family to which the malicious sample belongs; The step of classifying each target malicious sample data packet according to the malicious sample feature information includes: classifying each target malicious sample data packet into multiple categories according to the virus family to which the malicious sample belongs, with each category containing multiple target malicious sample data; Before packet grouping each target malicious sample data packet, the method further includes: classifying each target malicious sample data packet into packets for each type of target malicious sample data packet to determine the packet type of the target malicious sample data packet; The step of classifying each target malicious sample data packet to determine the packet type of the target malicious sample data packet includes: for each virus family category, obtaining multiple malicious sample online data packets contained in that category; the target malicious sample data packet contains: malicious sample online data packets; For each malicious sample online data packet, the byte type of the malicious sample online data packet is extracted; the byte type includes: character type and / or binary type; Determine whether the number of bytes in the character type exceeds a predetermined threshold; If so, the malicious sample's online data packet is identified as a text-based message; If not, the malicious sample's online data packet is identified as a binary message.
9. An electronic device, characterized in that, The electronic device includes: a housing, a processor, a memory, a circuit board, and a power supply circuit, wherein the circuit board is disposed inside the space enclosed by the housing, and the processor and the memory are disposed on the circuit board; the power supply circuit is used to supply power to various circuits or devices of the electronic device; the memory is used to store executable program code; the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory, for executing the malicious sample data packet analysis method according to any one of claims 1 to 7.