Malware Detection Method Based on Multi-Granularity Network Traffic Feature Fusion

By constructing a multi-granularity network traffic heterogeneous graph and a multi-task learning model, the problem of insufficient accuracy in existing Android malware detection is solved, achieving high-precision malware identification and family discrimination, which is applicable to mobile devices.

CN121485979BActive Publication Date: 2026-06-30NORTHEAST DIANLI UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NORTHEAST DIANLI UNIVERSITY
Filing Date
2025-10-31
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing Android malware detection methods cannot effectively construct multi-granularity behavior representations, resulting in insufficient detection accuracy, inability to handle complex behavior strategies and variant samples, and high computational resource requirements, making them difficult to deploy in resource-constrained environments.

Method used

We construct a multi-granularity heterogeneous network traffic graph, integrate packet, flow, and session granular features, and utilize heterogeneous graph neural networks and multi-task learning models to generate a unified network traffic behavior vector, thereby achieving high-precision malware identification.

Benefits of technology

It improves the ability to model complex communication structures and potential malicious linkage patterns, enhances the detection accuracy and adaptability of malware, is suitable for lightweight scenarios such as mobile devices, and has non-intrusive detection characteristics.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121485979B_ABST
    Figure CN121485979B_ABST
Patent Text Reader

Abstract

This invention relates to the fields of network security and software detection technology, specifically providing a malware detection method based on multi-granularity network traffic feature fusion. The method includes: capturing network traffic data of a target application, analyzing packet-level granularity features, flow-level granularity features, and session-level granularity features, constructing a heterogeneous graph of the target application, extracting deep semantic representations of various nodes in the heterogeneous graph, and modeling the interaction relationships between multi-granularity nodes. This process gradually completes the transition from session-level heterogeneous subgraph embedding learning to whole-graph feature fusion, thereby generating a unified network traffic behavior vector that can characterize the network behavior of the target application. A multi-task classification model is then constructed, and the classification model is trained for end-to-end malware identification. This invention, by constructing a multi-granularity heterogeneous graph based on network traffic, solves the problems of traditional solutions, such as lack of cross-granularity design, difficulty in depicting the structural relationships of network behavior, and neglect of the dynamic changes of behavior during multiple interactions.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of network security and software detection technology, and in particular relates to a malware detection method based on multi-granularity network traffic feature fusion. Background Technology

[0002] With the rapid development of the mobile internet and the Android ecosystem, the number of mobile applications installed and running on smart terminals continues to grow. However, alongside this growth, the security threats facing the Android platform are becoming increasingly severe. Statistics show that the number of newly added malicious application samples globally has increased exponentially in recent years, with their harmful behaviors encompassing various types such as user privacy theft, malicious advertising, financial fraud, and remote control, causing serious economic losses and data security risks to users.

[0003] Against this backdrop, achieving accurate detection of Android malware has become a research hotspot in the field of cybersecurity. Compared to traditional static analysis-based detection methods (such as decompiling to extract APIs, permissions, and code features), dynamic behavior modeling methods have attracted widespread attention due to their greater robustness against obfuscation and variants. Among these methods, network traffic, as one of the most critical dynamic data carriers during application operation, carries multi-dimensional information such as the application's communication behavior, data interaction patterns, protocol usage habits, and access target structure. It can realistically reflect the application's operational behavior and potential intentions, making it a key data source in dynamic detection.

[0004] Compared to methods like system calls or API tracing, network traffic-based detection methods offer advantages such as no need to modify the APK, no reliance on root privileges, low data acquisition costs, and wide coverage, making them a crucial research direction in current mobile security detection. Especially noteworthy is that network behavior data can be passively collected and has a clear structure, providing a solid foundation for analysis, all while ensuring user privacy and without interfering with application execution logic. However, the key technical challenges remain: how to extract effective patterns from complex application traffic behavior, construct multi-granularity behavioral representations, and improve detection accuracy.

[0005] Currently, the main technical approaches for detecting Android malware based on network traffic are as follows:

[0006] First, there are machine learning methods based on statistical features. These methods typically use traffic feature extraction tools (such as CICFlowMeter) to convert network data into flow-level structured samples, extracting statistical indicators such as packet count, flow duration, uplink and downlink byte rates, and protocol ratios. These indicators are then used as input vectors for traditional machine learning models (such as SVM, Random Forest, and KNN) to classify malicious / benign behavior. Typical works include evaluation using the CICAndMal2017 dataset. While these methods are simple to implement and computationally inexpensive, they have significant drawbacks: they rely on manually designed statistical feature templates and cannot automatically learn the deep structure of application behavior; they ignore the contextual dependencies and interaction semantics between data packets; and their model generalization ability is limited, with insufficient detection capabilities for complex behavioral strategies and variant samples.

[0007] Second, deep learning methods based on time series modeling: These methods typically use packet sequences or flow sequences from network traffic as time series inputs to models such as LSTM, GRU, and Transformer to learn the temporal characteristics of behavioral changes. These methods improve the model's ability to perceive behavioral evolution patterns and alleviate the problem of manually constructing features to some extent. However, their main drawbacks are: most focus only on linear temporal structures, neglecting the structural dependencies between communication behaviors (such as frequent target access, same-domain access clustering, etc.); they fail to construct more expressive structured graph models to reflect the interaction logic between behaviors; and they also fail to consider the need for fusion modeling of behavioral patterns at different levels of granularity, resulting in limited expressive power of the final behavioral features.

[0008] Third, structural modeling methods based on static or single-granularity graph models have seen some recent studies attempting to model network traffic as graph structures, such as using IP addresses as nodes and connections as edges, extracting graph topology indicators as behavioral representations. However, these methods often rely on static graph construction, failing to consider the evolution of network behavior and multi-granularity coupling mechanisms. Furthermore, the semantic abstraction capabilities of nodes and edges in the graph are weak, making it difficult to accurately express the fine-grained characteristics and protocol structures between data packets in actual communication behavior. In addition, most graph structure models rely on large neural networks, requiring significant computational resources during training and inference, making them difficult to deploy in resource-constrained environments such as edge terminals or mobile devices.

[0009] In summary, existing technical solutions mostly focus on single-granularity feature representation, extracting only simple statistical information or sequence patterns, lacking a modeling mechanism for behavioral dependencies between different granularities. Especially in real-world scenarios, malicious behavior often exhibits a complex cross-granularity evolution process: "covert triggering at the packet level—abnormal flow at the session level—overall behavioral patterns at the application level." Existing methods cannot construct a unified modeling framework integrating multiple behavioral granularities, resulting in a lack of understanding of behavioral context and affecting detection accuracy and generalization ability. While some methods introduce time-series modeling (such as LSTM and GRU), they remain at the linear input level, failing to characterize the structural relationships in network behavior, such as communication graph features like multi-host coordinated access and phased command distribution. The failure to construct a structural graph with semantic nodes and multiple edge relationships prevents the model from explicitly expressing potential attack chain structures and behavioral coupling patterns, limiting the detection capability of advanced attack strategies. Furthermore, existing solutions typically only model based on static snapshots, ignoring the dynamic changes in behavior during multiple interactions. For example, some malware employs delayed triggering and phased behavior breakdown to evade detection. If the trajectory and feature updates of behavior over time cannot be captured, the model is prone to misjudgment or missed detection, making it difficult to cope with covert attacks. In addition, most existing methods only support binary classification tasks and cannot simultaneously achieve fine-grained identification of malware types and families. Their detection functions are limited and cannot support subsequent source tracing analysis and precise defense. Summary of the Invention

[0010] In view of this, the present invention aims to provide a malware detection method based on multi-granularity network traffic feature fusion. By constructing a multi-granularity network traffic heterogeneous graph, fusing multi-session network behavior features, and combining an end-to-end multi-task learning model, it achieves high-precision malware identification and multi-category family discrimination for Android applications, thus solving the problem of insufficient detection accuracy in traditional malware detection methods.

[0011] To achieve the above objectives, the technical solution created by this invention is implemented as follows:

[0012] This invention provides a malware detection method based on multi-granularity network traffic feature fusion, comprising:

[0013] Collect network traffic data of the target application and extract packet granularity features, flow granularity features and session granularity features from the network traffic data;

[0014] Construct a heterogeneous graph, which includes: packet nodes, flow nodes, session nodes, and heterogeneous edges between packet nodes, flow nodes, and session nodes;

[0015] The feature vectors of packet nodes, flow nodes and session nodes are initialized based on packet granularity features, flow granularity features and session granularity features; the heterogeneous graph neural network model is used to perform semantic representation of the session-level heterogeneous subgraphs in the heterogeneous graph, generate session embedding vectors, and fuse multiple session embedding vectors of the same target application through temporal modeling and attention mechanism to generate a unified network traffic behavior vector of the target application.

[0016] A multi-task classification model is constructed, which takes a unified network traffic behavior vector as input to perform malicious / benign classification, malware type classification, and malware family classification.

[0017] Preferably, data packets with the same communication quintuple are formed into a stream, and streams within a single run of the target application or within a set time window are aggregated into a session.

[0018] Preferably, the packet granularity features include at least one of the following: payload, packet length, protocol type, transmission direction, and timestamp;

[0019] Stream granular characteristics include at least one of the following: stream duration, total number of bytes in the stream, average packet length in the stream, number of packets in the stream, byte rate in the stream, and number of forward / backward packets in the stream;

[0020] Session granularity features include at least one of the following: session duration, number of intra-session flows, and number of target domains.

[0021] Preferred, heterogeneous diagram for:

[0022] ;

[0023] ;

[0024] ;

[0025] in, Represents a set of nodes. Represents the set of data packet nodes. Represents a set of flow nodes. Represents a set of session nodes. Represents a set of heterogeneous edges. Indicates data packet, Indicates flow, Indicates a session.

[0026] Preferably, feature vector initialization is performed on packet nodes, flow nodes, and session nodes based on packet granularity features, flow granularity features, and session granularity features, including:

[0027] The payload of the data packet is transformed into a fixed-length byte sequence or a fixed-size image, and then input into a visual Transformer model to extract high-level semantic features. These high-level semantic features are then used as feature vectors for the data packet nodes.

[0028] Use flow granularity features as feature vectors for flow nodes;

[0029] Use session granular features as the feature vectors of session nodes.

[0030] Preferably, initializing feature vectors for packet nodes, flow nodes, and session nodes based on packet granularity features, flow granularity features, and session granularity features further includes:

[0031] Trainable type embedding vectors are assigned to data packet nodes, stream nodes, and session nodes respectively. The feature vectors of the nodes are concatenated with the embedding vectors to obtain concatenated vectors. The concatenated vectors are then projected onto a semantic embedding space of the same dimension using a multilayer perceptron.

[0032] Preferably, a heterogeneous graph neural network model is used to perform semantic representation of the session-level heterogeneous subgraphs in the heterogeneous graph, generating session embedding vectors, including:

[0033] Define multiple meta-paths for the session-level heterogeneous subgraph. Meta-paths are used to describe the semantic relationships between different types of nodes.

[0034] For each node on any meta-path, the features of the neighboring nodes on that meta-path are aggregated through a graph attention mechanism to generate a specific embedding representation of that node under each meta-path;

[0035] A semantic-level attention mechanism is used to fuse the specific embedding representations of nodes in different meta-paths, and each node obtains a final embedding.

[0036] Aggregate the final embeddings of each node to generate the session embedding vector of the session-level heterogeneous subgraph.

[0037] Preferably, multiple session embedding vectors of the same target application are fused through temporal modeling and attention mechanisms to generate a unified network traffic behavior vector at the application level, including: determining a sequence of session embedding vectors based on the session-level heterogeneous subgraphs contained in the target application;

[0038] The session embedding vector sequence is input into a time-series network based on gated recurrent units to generate a hidden state sequence;

[0039] The attention intensity of each session-level heterogeneous subgraph to the global representation of the target application is calculated using an attention mechanism, and the unified network traffic behavior vector of the target application is calculated based on the attention intensity of each session-level heterogeneous subgraph to the global representation of the target application.

[0040] Preferably, the multi-task classification model includes: a task module for binary classification of target applications as malicious software and benign software, a task module for type classification of malicious software, and a task module for family identification of malicious software.

[0041] The preferred loss function for a multi-task classification model for:

[0042] ;

[0043] in, The cross-entropy loss represents the binary classification of malware and benign software. Indicates the loss in malware type classification, Indicates the loss of identification of malware families, for The weighting coefficients, for The weighting coefficients, for The weighting coefficients.

[0044] Compared with the prior art, the present invention can achieve the following beneficial effects:

[0045] This invention creatively maps three types of behavioral granularity information—packet, flow, and session—of network traffic generated during application operation to multiple types of nodes. It defines various heterogeneous edge relationships, including "packet-flow," "packet-session," and "flow-session," constructing a multi-granularity heterogeneous graph that reflects the structure and granularity of application communication behavior, overcoming the limitations of traditional single-granularity analysis methods. This multi-granularity heterogeneous graph structure not only includes independent features of each granularity but also explicitly expresses the multi-level semantic associations of "packet-flow-session" through heterogeneous edge relationships, constructing a complete representation system from micro-communication details to macro-behavioral patterns, providing a rich structural foundation for subsequent behavioral feature extraction.

[0046] This invention employs a heterogeneous graph neural network to perform structured modeling of heterogeneous graphs at the session granularity. It utilizes a meta-path aggregation strategy and a multi-head attention mechanism to achieve deep extraction of node semantics and interaction patterns, significantly enhancing the model's ability to model complex communication structures and potential malicious linkage patterns. The node information aggregation strategy based on the multi-head attention mechanism enables the model to adaptively focus on key connection patterns under different semantic relationships.

[0047] To address the common evasion strategies employed by malware, such as delayed triggering and behavior segmentation, this invention introduces a cross-session modeling network based on GRU and attention mechanisms. This network merges multiple session graph representations within the same application into a unified global network behavior feature vector, characterizing the application's behavioral evolution trajectory throughout its runtime lifecycle and improving the detection capability for distributed or phased malicious behaviors. This mechanism not only considers the temporal dependencies between sessions but also dynamically identifies key session segments that contribute most to the overall behavior judgment through attention weights, effectively solving the problem of insufficient long-term dependency modeling in traditional methods.

[0048] This invention integrates heterogeneous graph construction, feature learning, and multi-task recognition into a unified end-to-end training framework. Through joint optimization of multi-task loss functions, it achieves an end-to-end joint training mechanism, ensuring deep collaboration between feature extraction and task objectives. This design avoids the disconnect between feature design and classification tasks present in traditional pipelined approaches, ensuring that the learned feature representations have stronger discriminative power.

[0049] This invention is based entirely on passively collected network traffic data for modeling. It does not require root privileges, does not require decompiling APKs, and does not interfere with the normal operation of applications. It has truly non-intrusive detection characteristics, making this invention applicable to lightweight scenarios such as mobile devices. Attached Figure Description

[0050] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments and descriptions of the invention are used to explain the invention and do not constitute an undue limitation of the invention. In the drawings:

[0051] Figure 1 This is a flowchart of a malware detection method based on multi-granularity network traffic feature fusion provided in an embodiment of the present invention. Detailed Implementation

[0052] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are only for explaining the invention and do not constitute a limitation thereof. Similar elements in different embodiments are referred to by associated similar element reference numerals. In the following embodiments, many details are described to facilitate a better understanding of the invention. However, those skilled in the art will readily recognize that some features may be omitted in different situations, or may be replaced by other elements, materials, or methods. In some cases, some operations related to the invention are not shown or described in the specification. This is to avoid obscuring the core parts of the invention with excessive description. For those skilled in the art, detailed description of these related operations is not necessary; they can fully understand the related operations based on the description in the specification and general technical knowledge in the art.

[0053] It should be noted that, unless otherwise specified, the embodiments and features described in this invention can be combined to form various implementations. Furthermore, the order of the steps or actions in the method description can be changed or adjusted in a manner readily apparent to those skilled in the art. Therefore, the various orders in the specification and drawings are merely for the clear description of a particular embodiment and do not imply a mandatory order, unless otherwise stated that a particular order must be followed.

[0054] The present invention will now be described in detail with reference to the accompanying drawings and embodiments.

[0055] Please see Figure 1 In one embodiment of the present invention, a malware detection method based on multi-granularity network traffic feature fusion is provided. This method constructs a multi-granularity network traffic heterogeneous graph structure, fuses multi-session network behavior features, and combines an end-to-end multi-task learning model to achieve high-precision malware identification and multi-category family discrimination for Android applications. Specifically, the method includes:

[0056] S1: Collect network traffic data of the target application and extract packet granularity features, flow granularity features and session granularity features from the network traffic data;

[0057] Construct a heterogeneous graph, which includes packet nodes, flow nodes, session nodes, and heterogeneous edges between packet nodes, flow nodes, and session nodes.

[0058] Specifically, S11: First, comprehensively collect network traffic data generated during the target application's operation. Professional network traffic capture tools, such as Wireshark and tcpdump, can be deployed on Android devices or emulators to monitor and capture all network communication data generated during the application's entire runtime in real time. During this process, all communication behaviors between the target application and external servers or terminals need to be captured, each transmitted data packet recorded, and the obtained network traffic data stored in a .pacp file. Further, the .pacp file is parsed to extract key fields for each data packet, including but not limited to: packet length, payload, protocol type, timestamp, direction (inbound / outbound), source IP address, destination IP address, source port, and destination port, etc. The parsing results will serve as the basis for subsequent multi-granularity feature extraction.

[0059] S12: Based on the parsing results obtained in S11, perform data feature classification, extract packet-level features, flow-level features, and session-level features from the network traffic data, and model features according to the three granularities of packets, flows, and sessions. In the packet modeling process, select original fields and derived statistics from the parsed network traffic data to construct packet-level features vectors. Typical packet-level features include, but are not limited to, payload, packet length, protocol type, transmission direction, timestamp, and time difference between packets. Utilize one or more of these features to model the packets. Characterize it.

[0060] In the process of flow modeling, data packets are grouped based on the communication 5-tuple (source IP, destination IP, source port, destination port, protocol), and data packets with the same communication 5-tuple are constructed into a flow. The system performs statistical analysis on each stream, including stream duration, total number of bytes in the stream, average packet length within the stream, number of packets within the stream, byte rate within the stream, and number of forward / backward packets within the stream, and uses one or more of these features to characterize the stream.

[0061] During session modeling, multiple streams generated during a single run of the target application or within a defined time window are aggregated into one. Session. Session granularity features include at least one of the following: session duration, number of intra-session flows, and number of target domains.

[0062] S13: To effectively integrate multi-granularity network traffic characteristics, this embodiment of the invention uses multi-granularity features to construct a unified heterogeneous graph. Specifically, firstly, three types of nodes and heterogeneous edges between nodes are defined. The three types of nodes are packet nodes, flow nodes, and session nodes. Heterogeneous edges include those between packet nodes and flow nodes, and between flow nodes and session nodes. Hierarchical and temporal relationships between granularities are captured through structural connections. This yields an application-level heterogeneous graph for the target application. for:

[0063] ;

[0064] ;

[0065] ;

[0066] in, Represents a set of nodes. Represents the set of data packet nodes. Represents a set of flow nodes. Represents a set of session nodes. Represents a set of heterogeneous edges. Indicates data packet, Indicates flow, Indicates a session. This represents a heterogeneous edge between a packet node and a flow node. This represents a heterogeneous edge between a stream node and a session node.

[0067] Through the above graph structure modeling process, a heterogeneous graph containing multi-granularity feature nodes and hierarchical connections is formed, providing a structured input foundation for graph neural network representation learning.

[0068] S2: Initialize the feature vectors of packet nodes, flow nodes and session nodes based on packet granularity features, flow granularity features and session granularity features; use a heterogeneous graph neural network model to perform semantic representation of the session-level heterogeneous subgraphs in the heterogeneous graph, generate session embedding vectors, and fuse multiple session embedding vectors of the same target application through temporal modeling and attention mechanism to generate a unified network traffic behavior vector of the target application.

[0069] Specifically, S21: First, the nodes in the heterogeneous graph are initialized with features to construct an input vector, thereby realizing a complete expression of the semantics of the heterogeneous graph.

[0070] Specifically, to reduce the size of data packet nodes while preserving the semantic information of their payloads, and to reduce the computational load required during target application detection, this invention allows the method to be applied to lightweight scenarios such as Android mobile devices. It overcomes the limitations of existing software detection models that place high demands on operating devices, maintaining high-precision detection while considering deployment efficiency and practical performance, thus meeting the urgent need in the current mobile security environment for efficient, fine-grained, and deployable malware detection solutions. This invention designs a packet semantic modeling method based on the Visual Transformer (ViT) model and a unified node embedding representation mechanism to map all types of nodes to a unified semantic vector space, thereby improving the expressive power and learning effect of graph neural networks in heterogeneous graph structures. Specifically, the payload of each data packet is first converted into a fixed-length byte sequence or a fixed-size image. The processed data packet is then input into a pre-trained or fine-tuned Visual Transformer model to extract high-level semantic features, which are then used as the feature vectors of the data packet nodes. :

[0071] ;

[0072] in, This indicates that the payload of the data packet is converted into a fixed-length byte sequence or image format. It represents a multidimensional real vector space.

[0073] For the feature initialization of a flow node, this embodiment of the invention uses statistical features such as flow duration, total number of bytes in the flow, average packet length in the flow, number of packets in the flow, byte rate in the flow, and number of forward / backward data packets in the flow as its initialization features.

[0074] For the feature initialization of session nodes, this embodiment of the invention uses statistical features such as session duration, number of intra-session flows, and number of target domain names as its initialization features.

[0075] Furthermore, to eliminate the differences in semantic space and feature dimensions among the three types of nodes—data packets, streams, and sessions—this embodiment of the invention designs a unified embedding representation mechanism for heterogeneous nodes, projecting all nodes onto the same semantic embedding space. Specifically, a trainable type embedding vector is assigned to each type of node—data packet nodes, stream nodes, and session nodes. For ease of representation, the above three types of nodes are uniformly represented as follows: The trainable type embedding vector is represented as And embed the type of each node into a vector. This is concatenated with the original initial features to construct a heterogeneous graph neural network model required for deep representation learning of the heterogeneous subgraphs in each subsequent target application, serving as a high semantic graph embedding vector for downstream malicious behavior identification. Even if the initial feature vectors of the above three types of nodes are uniformly represented as... The concatenated vector obtained by this concatenation Represented as:

[0076] ;

[0077] in, This represents the dimension of the initial feature vector for each node. This represents the dimension of the embedding vector for each node type.

[0078] Furthermore, the concatenated vectors are projected onto a semantic embedding space of the same dimension using a multilayer perceptron. Specifically, a two-layer multilayer perceptron (MLP) is constructed to concatenate the vectors. Perform a unified projection, mapping to a semantic embedding space of a unified dimension. , This represents the embedding dimension of the final unified output, and the concatenated vector after dimension unification. It can be represented as:

[0079] ;

[0080] in, , , and All of these are model parameters. and Both represent nonlinear activation functions.

[0081] S22: Utilizing a Heterogeneous Graph Attention Network (HAN) model, deep representation learning is performed on the session-level heterogeneous subgraphs in each target application to generate high-semantic session embedding vectors that can be used for downstream malicious behavior identification. This fully integrates multi-granularity structural information at the packet, flow, and session levels of communication behavior to construct a heterogeneous graph representing behavior across different levels of granularity. The process of constructing the session embedding vectors specifically includes the following steps:

[0082] First, multiple meta-paths are defined for the session-level heterogeneous subgraph. These meta-paths describe the semantic relationships between different types of nodes, thus realizing the graph structure definition and meta-path configuration. For each session-level heterogeneous subgraph in the application-level heterogeneous graph, it includes the set of nodes within that session and heterogeneous edges between nodes. To model the relationships between different types of nodes, this embodiment of the invention employs a heterogeneous graph attention network (HAN), introducing multiple meta-paths, and representing the set M of each meta-path as follows:

[0083] ;

[0084] in, This represents each specific meta path, including but not limited to packet→flow→session; packet→session, etc.

[0085] For any point on each meta-path, a graph attention-based mechanism is used to aggregate the neighboring nodes of that node on that meta-path, generating a specific embedding representation of that node under each meta-path. Specifically, firstly, the attention weights of the node's neighboring nodes on each meta-path are determined using a graph attention mechanism. For each node's attention weight with any neighboring node... It can be represented as:

[0086] ;

[0087] in, This indicates the node currently being processed. This indicates the neighboring nodes of the node currently being processed. This indicates the identifier of the metapath where the node currently being processed is located. Represents the set of neighboring nodes of the node currently being processed. This represents the feature vector of the node currently being processed. This represents the feature vector of the neighboring nodes of the node currently being processed. Represents the node transformation matrix. This indicates a splicing operation. It is a non-linear activation function. It is an exponential function. This represents the attention mechanism weight vector. This represents the feature vector of the k-th neighbor node. It's important to note that a neighbor node is a node that has a heterogeneous edge with the node currently being processed.

[0088] Based on the attention weights calculated above, the features of neighboring nodes are weighted and aggregated to generate a specific embedding representation of the node under each meta-path. for:

[0089] .

[0090] To integrate the semantics of multiple meta-paths, a semantic-level attention mechanism is employed to fuse the specific embedding representations of nodes in different meta-paths. A weighted combination of node embeddings at different granularities is calculated, and each node obtains a final embedding. :

[0091] ;

[0092] in, The weight represents the importance of the meta-path m. , This represents the learnable weight matrix of the meta-path m. This represents the hyperbolic tangent activation function. This represents a learnable attention query vector.

[0093] Furthermore, the READOUT operation is used to aggregate the final embedding of each node, resulting in the overall session embedding vector of the session-level heterogeneous subgraph. :

[0094] .

[0095] Using session embedding vectors Characterize network behavior at the session level.

[0096] S23: Session embedding vectors based on multiple session-level heterogeneous subgraphs obtained in S22 A unified cross-graph fusion mechanism is designed to integrate these into a unified representation reflecting the network behavior patterns of the entire application, which is then used for subsequent malware identification and classification. Considering the differences in semantic structure at multiple granularities within the application, this embodiment of the invention constructs a temporal modeling mechanism based on gated recurrent units (GRUs) and combines it with an attention mechanism to perform behavioral semantic aggregation guided by key sessions. This fuses multiple session embedding vectors of the same target application to generate a unified network traffic behavior vector at the application level.

[0097] Specifically, for any target application (app), its N session-level heterogeneous subgraphs are represented as follows: and its session embedding vector sequence is represented as For any of the session embedding vectors It integrates structural semantic information at three levels of granularity: data packets, streams, and sessions.

[0098] To identify the semantic contributions of different sessions to the overall behavior, a GRU-based temporal modeling approach is employed to capture the evolutionary features between graph vectors. Specifically, the session embedding vector sequence is input into a gated recurrent unit-based temporal network to generate a hidden state sequence. For any one of the hidden states .

[0099] To identify the most representative session segments for representing application behavior across multiple sessions, an attention mechanism is further introduced to calculate the attention strength of each session-level heterogeneous subgraph to the global representation of the target application. Then, the attention strength of any i-th session to the global representation... for:

[0100] ;

[0101] in, This represents a trainable transformation matrix.

[0102] Based on the attention intensity of each session-level heterogeneous subgraph to the global representation of the target application, a weighted summation method is used to obtain the unified network traffic behavior vector of the target application. for:

[0103] .

[0104] Unified Network Traffic Behavior Vector This invention characterizes the temporal behavioral features of the entire application at a multi-granularity network traffic level, possessing both local semantic aggregation and global dynamic modeling capabilities, serving as the input representation basis for subsequent malicious behavior identification models. Compared to traditional methods of directly averaging or concatenating session graph vectors, this embodiment introduces a GRU model to model the behavioral evolution trend in the sequence, improving the ability to perceive temporal features; it uses an attention mechanism to dynamically identify key behavioral segments, avoiding interference from redundant information; the resulting unified network traffic behavior vector is interpretable and traceable to its originating important sessions; and it supports multi-granularity information modeling, combining a three-layer structure of data packets, streams, and sessions, preserving semantic integrity.

[0105] S3: Construct a multi-task classification model, using the unified network traffic behavior vector as input to perform malicious / benign classification, malware type classification, and malware family classification. To achieve multi-task optimized end-to-end malware recognition training, this embodiment of the invention constructs an end-to-end trainable malware recognition framework, combining the multi-granularity unified network traffic behavior vector obtained in S2. Simultaneously performing multiple detection tasks (malicious / benign binary classification, malicious type multi-class classification, and family identification) improves the model's generalization ability and recognition accuracy. By designing a joint loss function and a shared-specific decoupling structure, the feature generation modules (HAN and GRU) are incorporated into a unified training path, achieving synergistic optimization of feature extraction and classification tasks.

[0106] Specifically, S31: First, a set of shared-decoupled neural network structures is constructed to perform unified mapping and task recognition on the multi-granularity heterogeneous graph representation results, and the unified network traffic behavior vector obtained in S2 is then used. As input, it is mapped to a shared feature space to unify network traffic behavior vectors. The process of mapping to the shared feature space is as follows:

[0107] ;

[0108] in, This represents a shared feature encoding network, which consists of two fully connected neural network (MLP) layers with ReLU activation. To share the training parameters of the feature encoding network, This indicates a shared feature space.

[0109] It should be noted that, in this embodiment of the invention, heterogeneous graph construction, unified network traffic behavior vector extraction, and multi-task learning model are trained end-to-end as a whole architecture. Both the HAN network and the GRU network are included in the end-to-end training process. Specifically, the HAN network takes each session heterogeneous subgraph as input, extracts local structural semantics through a graph attention mechanism, and outputs session embedding vectors. The GRU network receives the sequence of session graph vectors from all applications, performs temporal modeling and attention fusion, and generates the final application-level unified network traffic behavior vector. The parameters of the above modules are jointly optimized with the multi-task classification model to achieve integrated training of feature generation and discrimination tasks.

[0110] S32: Further, map the unified network traffic behavior vector to the shared feature space. As input to the multi-task classification model, and to adapt to the differences in representation of different identification targets, a task decoupling structure is constructed for three core tasks. The multi-task classification model includes: a task module for binary classification of target applications as malicious and benign software, a task module for type classification of malicious software, and a task module for family identification of malicious software. Each task's module branch contains a specific fully connected network layer (task-private layer) and a Softmax classifier to implement various discrimination tasks.

[0111] The task of classifying malicious software into benign software is as follows:

[0112] ;

[0113] in, This refers to the predictive output of the task module that performs binary classification of malicious and benign software. and These are the model weight matrix and the bias term, respectively.

[0114] The task of classifying malware into types (such as Trojans, adware, spyware, etc.) is as follows:

[0115] ;

[0116] in, This refers to the predicted output of the task module that classifies malware types. and These are the model weight matrix and the bias term, respectively.

[0117] The task of identifying malware families (such as DroidKungFu and FakeInstaller) is as follows:

[0118] ;

[0119] in, This refers to the predicted output of the task module that identifies malware families. and These are the model weight matrix and the bias term, respectively.

[0120] The three tasks mentioned above share input but make predictions independently to enhance the model's task discrimination capability.

[0121] S33: For an end-to-end trainable malware identification framework, this embodiment of the invention optimizes the performance of each task simultaneously by employing a weighted multi-task joint loss function design. The losses of each sub-task are dynamically fused to design the loss function of the multi-task classification model. for:

[0122] ;

[0123] in, The cross-entropy loss represents the binary classification of malware and benign software. Indicates the loss in malware type classification, This indicates a loss of identification of malware families. for The weighting coefficients, for The weighting coefficients, for The weighting coefficients, , and It can be set as a fixed constant or a dynamically adjustable parameter.

[0124] This loss function ensures collaborative optimization between tasks, effectively suppressing the dominance of a single task in the training process and improving the accuracy of multi-target recognition. Furthermore, the training process is end-to-end differentiable, supporting efficient training through standard backpropagation and optimization algorithms. During deployment, by receiving the unified network traffic behavior vector obtained from S2, parallel prediction of the three types of recognition results can be achieved, demonstrating good application adaptability and inference efficiency.

[0125] In summary, the above description is merely a preferred embodiment of this specification and is not intended to limit the scope of protection of this specification. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this specification should be included within the scope of protection of this specification.

[0126] The systems, apparatuses, modules, or units described in one or more of the above embodiments may be implemented by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer. Specifically, a computer may be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or any combination of these devices.

[0127] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0128] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the system embodiments are basically similar to the method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions in the method embodiments.

Claims

1. A malware detection method based on multi-granularity network traffic feature fusion, characterized in that, include: Collect network traffic data of the target application, and extract packet granularity features, flow granularity features, and session granularity features from the network traffic data; Construct a heterogeneous graph, which includes: packet nodes, flow nodes, session nodes, and heterogeneous edges between packet nodes, flow nodes, and session nodes; the heterogeneous graph for: ; ; ; in, Represents a set of nodes. Represents the set of data packet nodes. Represents a set of flow nodes. Represents the set of session nodes. Represents a set of heterogeneous edges. Indicates data packet, Indicates flow, Indicates a session; The feature vectors of packet nodes, flow nodes and session nodes are initialized based on packet granularity features, flow granularity features and session granularity features. This includes: converting the payload of the packet into a fixed-length byte sequence or a fixed-size image, inputting it into a visual Transformer model to extract high-level semantic features, and using the high-level semantic features as the feature vectors of the packet nodes. Use flow granularity features as feature vectors for flow nodes; Use session granular features as the feature vectors of session nodes; Assign trainable type embedding vectors to packet nodes, stream nodes, and session nodes respectively, and concatenate the node's feature vector with the embedding vector to obtain the concatenated vector. Represented as: ; in, This represents the dimension of the initial feature vector for each node. This represents the dimension of the embedding vector for each node type. Represents a trainable type embedding vector. This serves as a unified representation for packet nodes, flow nodes, and session nodes. A unified representation of the initial feature vectors for packet nodes, flow nodes, and session nodes; The concatenated vectors are projected onto a semantic embedding space of the same dimension using a multilayer perceptron, including: the concatenated vectors... Perform a unified projection, mapping to a semantic embedding space of a unified dimension. , This represents the embedding dimension of the final unified output, and the concatenated vector after dimension unification. Represented as: ; in, , , and All of these are model parameters. and Both represent nonlinear activation functions; The heterogeneous graph neural network model is used to perform semantic representation of the session-level heterogeneous subgraphs in the heterogeneous graph, generating session embedding vectors, including: Define multiple meta-paths for the session-level heterogeneous subgraph. Meta-paths are used to describe the semantic relationships between different types of nodes. For each node on any meta-path, the features of the neighboring nodes on that meta-path are aggregated through a graph attention mechanism to generate a specific embedding representation of that node under each meta-path; A semantic-level attention mechanism is used to fuse the specific embedding representations of nodes in different meta-paths, and each node obtains a final embedding. Aggregate the final embeddings of each node to generate the session embedding vector of the session-level heterogeneous subgraph; Furthermore, by using temporal modeling and attention mechanisms, multiple session embedding vectors of the same target application are fused to generate a unified network traffic behavior vector for the target application. A multi-task classification model is constructed, and the unified network traffic behavior vector is used as the input of the multi-task classification model to perform malicious / benign classification, malware type classification, and malware family classification.

2. The malware detection method based on multi-granularity network traffic feature fusion according to claim 1, characterized in that, Packets with the same communication quintuple are grouped into streams, and streams within a single run of the target application or within a set time window are aggregated into sessions.

3. The malware detection method based on multi-granularity network traffic feature fusion according to claim 1, characterized in that, The data packet granularity features include at least one of the following: payload, packet length, protocol type, transmission direction, and timestamp; The flow granularity features include at least one of the following: flow duration, total number of bytes in the flow, average packet length in the flow, number of packets in the flow, byte rate in the flow, and number of forward / backward data packets in the flow; The session granularity features include at least one of the following: session duration, number of intra-session flows, and number of target domains.

4. The malware detection method based on multi-granularity network traffic feature fusion according to claim 1, characterized in that, The step of fusing multiple session embedding vectors of the same target application through temporal modeling and attention mechanisms to generate a unified network traffic behavior vector at the application level includes: determining a sequence of session embedding vectors based on the heterogeneous subgraphs at the session level contained in the target application; The session embedding vector sequence is input into a time-series network based on gated recurrent units to generate a hidden state sequence; The attention intensity of each session-level heterogeneous subgraph to the global representation of the target application is calculated using an attention mechanism, and the unified network traffic behavior vector of the target application is calculated based on the attention intensity of each session-level heterogeneous subgraph to the global representation of the target application.

5. The malware detection method based on multi-granularity network traffic feature fusion according to claim 1, characterized in that, The multi-task classification model includes: a task module for binary classification of target applications into malicious and benign software, a task module for type classification of malicious software, and a task module for family identification of malicious software.

6. The malware detection method based on multi-granularity network traffic feature fusion according to claim 5, characterized in that, The loss function of the multi-task classification model for: ; in, The cross-entropy loss represents the binary classification of malware and benign software. Indicates the loss in malware type classification, This indicates a loss of identification of malware families. for The weighting coefficients, for The weighting coefficients, for The weighting coefficients.