Http / 3 encrypted traffic intelligent identification method based on empty load removal packet cluster representation unit
By removing empty payload messages and masking plaintext fields, a message cluster representation unit is constructed and self-supervised training is performed using the BERT model. This solves the problem of accurate classification of HTTP/3 encrypted traffic and achieves high-precision and low-cost traffic identification.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SOUTHEAST UNIV
- Filing Date
- 2026-03-24
- Publication Date
- 2026-06-26
AI Technical Summary
Existing traffic detection methods based on plaintext features are difficult to adapt to the fully encrypted communication of the HTTP/3 protocol, resulting in a decline in the classification performance of the model in different deployment environments. Furthermore, existing encrypted traffic classification methods lack robustness in HTTP/3 scenarios.
By removing empty payload messages and masking plaintext fields, a message cluster representation unit is constructed, and the BERT model is used for self-supervised pre-training and fine-tuning to achieve accurate classification of HTTP/3 encrypted traffic.
It achieves high-precision classification of HTTP/3 encrypted traffic, enhances the model's cross-environment generalization ability and recognition accuracy, reduces computing power costs, and adapts to parallel computing environments.
Smart Images

Figure CN122293379A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to an intelligent identification method for HTTP / 3 encrypted traffic based on empty payload message cluster characterization units, belonging to the field of network space security technology. Background Technology
[0002] HTTP / 3, as a next-generation HTTP protocol, implements fully encrypted network communication based on the QUIC transport layer protocol. Therefore, traditional traffic inspection and deep packet inspection (DPI) methods based on plaintext features are difficult to apply directly. With the widespread adoption of the HTTP / 3 protocol in internet applications, intelligent identification technology for encrypted traffic faces new challenges.
[0003] On the one hand, the HTTP / 3 protocol introduces traffic patterns different from traditional TCP streams, generating environment-dependent noise data. Furthermore, HTTP / 3 includes zero-payload control messages; these messages do not carry application data but increase traffic volume and obscure truly meaningful application-layer interaction patterns. On the other hand, certain metadata transmitted in plaintext during the QUIC handshake exposes deployment environment-specific characteristics. If these plaintext environmental features are directly used for model training, the model is prone to overfitting to the characteristics of specific network environments or servers, thus memorizing deployment environment patterns instead of extracting general application traffic patterns. When the application deployment environment changes or the model migrates to a new environment, classification performance deteriorates sharply.
[0004] Current research on encrypted traffic classification attempts to utilize traffic pattern features such as message length sequences and message arrival times for machine learning or deep learning classification. However, in HTTP / 3 encrypted traffic scenarios, due to the inherent noise of the QUIC protocol and environmental differences, the robustness of these methods remains insufficient. Therefore, there is an urgent need for a method that can accurately classify HTTP / 3 encrypted traffic without decryption. This method should be able to eliminate environmental interference and extract general traffic representations reflecting application behavior to ensure the model's cross-environment generalization ability across different deployment environments. Summary of the Invention
[0005] To address the need for intelligent identification of encrypted traffic, this invention proposes an intelligent identification method for HTTP / 3 encrypted traffic based on empty payload packet cluster representation units. This method comprehensively considers plaintext information and statistical feature information while reducing invalid and redundant ciphertext input, obtaining more essential and accurate encrypted traffic information, thereby reducing the computational cost requirements of the model. This invention implements a solution encompassing traffic feature denoising, traffic packet cluster representation unit construction, and encrypted traffic application classification model training, ensuring the accuracy, robustness, and lightweight deployment of the application classification model.
[0006] To achieve the above objectives, the present invention provides the following technical solution:
[0007] A method for intelligent identification of HTTP / 3 encrypted traffic based on empty payload message cluster characterization units includes the following steps:
[0008] (1) Preprocess the original HTTP / 3 encrypted traffic, remove empty payload data packets with a payload length of 0, and mask the environment-related plaintext fields in the traffic to obtain a noise-reduced bidirectional traffic sequence;
[0009] (2) Based on bidirectional message analysis, a message cluster representation unit for traffic is constructed. The preprocessed traffic is divided into multiple message clusters according to a fixed time window. The message sub-clusters in the uplink and downlink directions of each cluster are extracted. The statistical characteristics of the sub-clusters are calculated and the header and end ciphertext fragments of the message are extracted to form a bidirectional message cluster structure. The message cluster structure is converted into a token sequence by the byte pair encoding (BPE) method.
[0010] (3) Input the processed message cluster representation unit into the redesigned BERT model, and pre-train the BERT model using the mask prediction task and the same source message cluster prediction task; optimize and test the pre-trained model by inputting HTTP / 3 protocol traffic with traffic type labels, and output the optimized model.
[0011] Furthermore, step (1) specifically includes the following sub-steps:
[0012] (1.1) Collect raw HTTP / 3 encrypted traffic data;
[0013]
[0014] in, This indicates a collected original Encrypted traffic data stream, This indicates the first [item] in the data stream. One data packet, This indicates the total number of data packets contained in the data stream;
[0015] (1.2) Determine whether the encrypted payload length of the HTTP / 3 data packet is greater than 0 bytes to identify empty payload packets and filter out packets with a length of 0. These packets are ACK confirmation packets or keep-alive packets and do not carry actual application data.
[0016]
[0017] in, This represents the set of valid data packets retained after removing empty payload data packets. Indicates data packet The encrypted application layer payload byte sequence, This function is used to calculate the length of a byte sequence.
[0018] (1.3) Mask plaintext fields in each message that may reveal environmental information, including IP source address and destination address, source port and destination port, as well as server name indication (SNI) and digital certificate information in the TLS handshake process.
[0019] Furthermore, step (2) specifically includes the following sub-steps:
[0020] (2.1) Divide the traffic sequence into packet clusters according to a predetermined time window. Each packet cluster corresponds to a time window of fixed duration T. The uplink sub-cluster in the packet cluster contains all the packets sent from the client to the server within the window, and the downlink sub-cluster contains all the packets sent from the server to the client within the window.
[0021] (2.2) Calculate the statistical characteristics, including the number of messages in the sub-cluster, the total number of message bytes, the average and standard deviation of message size, and the average interval of message arrival time. Then append the statistical characteristics to the beginning of the sub-cluster messages.
[0022] (2.3) Modify each message in the sub-cluster to consist of the plaintext header of each message and the first 8 bytes and the last 8 bytes of the encrypted payload;
[0023]
[0024] in, This represents the first [unit / item] obtained after feature extraction and recombination. Each data packet is characterized. Represents the raw data packet The head, and For sequence slicing operations, these represent extracting the first 8 bytes and the last 8 bytes of the encrypted payload sequence, respectively.
[0025] (2.4) The message cluster representation unit is converted into a token sequence by the vocabulary obtained by byte pair encoding (BPE).
[0026] Furthermore, step (3) specifically includes the following sub-steps:
[0027] (3.1) When inputting training data, four types of tokens, [cls], [sep], [mask] and [pad], are used to process the input tokens. [cls] is added to the head of each input and does not contain any semantics. It is used as the basis for classification tasks during inference. [sep] is added between two sub-clusters to represent the separator between the feature units of the two sub-clusters. [mask] is used as a mask to randomly cover tokens at different positions during training. [pad] is used to complete the word embedding sequence of the input to ensure that the token inputs of the two sub-clusters are of equal length.
[0028] (3.2) For each token, a relative position encoding is introduced. Two sets of preset segment position embeddings are used for the word embeddings in the sub-clusters. The vector sum of each word embedding vector with its position encoding and segment encoding is calculated.
[0029] (3.3) The model pre-training process sets up a mask prediction task and a homogeneous message cluster prediction task. The mask prediction task MLM randomly selects a portion of tokens from the input token sequence according to a certain proportion to mask them. The processed sequence is then fed into the BERT model, and the model is trained to recover the masked tokens. The homogeneous message cluster prediction task SBP is achieved by inputting two message cluster token sequences into the BERT model and performing binary classification. The model is trained to determine whether the sequence pairs are homogeneous.
[0030] (3.4) After pre-training, supervised fine-tuning training is performed on the model. An HTTP / 3 traffic dataset labeled with application category tags is prepared. The message cluster representation extraction and token sequence encoding are performed on these labeled data in the same way as the pre-training. The obtained token sequence is input into the pre-trained BERT model. A classification layer is added to the top layer of the model. The classification layer takes the hidden vector corresponding to the [CLS] label output by BERT as input and the output dimension is the predetermined number of categories, which is used to represent the confidence that the input traffic belongs to each category. The gradient of the entire model is updated using the training set data so that the distribution of the [CLS] vector can be correctly mapped to the corresponding traffic category.
[0031] Compared with the prior art, the present invention has the following advantages and beneficial effects:
[0032] (1) Considering the impact of empty payload messages and message header information on model convergence, we preprocessed the data packets to better adapt to downstream classification tasks.
[0033] (2) A novel traffic representation scheme was designed. Unlike directly inputting the ciphertext of the message into the model, it uses the combination of statistical information and partial ciphertext information, which better reduces the model cost and speeds up the inference efficiency.
[0034] (3) The BERT neural network is used to process the high-dimensional feature data of encrypted traffic. The model supports parallel computing, is easy to use in concurrent and high-speed network environments, converges quickly, and has good application classification effect for encrypted traffic, which scientifically solves the problem of difficult application classification for similar functions.
[0035] Overall, this invention achieves in-depth mining of encrypted traffic sequence information through traffic feature preprocessing, packet cluster traffic representation scheme, and BERT neural network architecture. It can achieve high-precision classification and identification of HTTP / 3 protocol traffic, which is of great significance for strengthening network service supervision and protecting user privacy. Attached Figure Description
[0036] Figure 1 This is a flowchart of the method of the present invention.
[0037] Figure 2 This is a diagram illustrating the construction of a message cluster representation unit based on the QUIC protocol in this invention.
[0038] Figure 3 This diagram illustrates the pre-training method based on the BERT model in this invention.
[0039] Figure 4 This diagram illustrates the fine-tuning method based on the BERT model in this invention. Detailed Implementation
[0040] The following detailed description of the technical solutions provided by the present invention will be based on specific embodiments. It should be understood that the following specific embodiments are only used to illustrate the present invention and are not intended to limit the scope of the present invention.
[0041] As shown in the figure, this invention proposes an intelligent identification method for HTTP / 3 encrypted traffic based on empty payload packet cluster representation units. It mainly consists of three parts: a traffic feature preprocessing module, a traffic representation unit construction module, and a BERT-based encrypted application classification module. The first part, the traffic feature preprocessing module, removes empty payload packets and masks header information from the traffic, transforming it into a corresponding downstream traffic classification task. The second part, the traffic representation unit construction module, aggregates important statistical information from packet clusters into one-dimensional binary data, concatenates it with ciphertext fragments from the packet header and tail, and removes invalid ciphertext to represent packet information and hidden relationships within the flow. The third part, the encrypted application classification model training module, uses BERT, a Transformer architecture classifier for encrypted traffic application classification. It first performs self-supervised pre-training with the representation units from the second part, and then fine-tunes the model by inputting the extracted traffic representations from the second part for 10 types of business traffic and 100 types of pure QUIC protocol traffic, for classifying and identifying encrypted application traffic.
[0042] Specifically, the present invention includes the following steps:
[0043] (1) Obtain the raw HTTP / 3 encrypted traffic data and preprocess the traffic. Remove all empty payload packets with a payload size of 0 (such as packets containing only ACK confirmation or keep-alive signals) and filter out noisy traffic that does not carry valid application data. At the same time, mask the plaintext fields related to the environment carried in the traffic, including but not limited to source IP address, destination IP address, source port, destination port, server name indicator (SNI), and digital certificate information, and set these fields to zero with a preset mask to prevent the model from remembering environmental features. After the above filtering and masking processing, the denoised HTTP / 3 bidirectional traffic sequence is obtained.
[0044] The specific process for this step is as follows:
[0045] (1.1) Collect raw HTTP / 3 encrypted traffic data;
[0046]
[0047] in, This indicates a collected original Encrypted traffic data stream, This indicates the first [item] in the data stream. One data packet, This indicates the total number of data packets contained in the data stream;
[0048] (1.2) Empty payload packets are identified by judging whether the encrypted payload length of the HTTP / 3 data packet is greater than 0 bytes. Messages with a length of 0 are filtered out. These messages are usually ACK confirmation packets or keep-alive packets and do not carry actual application data, thus reducing the interference of invalid traffic on subsequent analysis.
[0049]
[0050] in, This represents the set of valid data packets retained after removing empty payload data packets. Indicates data packet The encrypted application layer payload byte sequence, This function is used to calculate the length of a byte sequence.
[0051] (1.3) Mask plaintext fields in each message that may reveal environmental information, including IP source and destination addresses, source and destination ports, as well as information such as Server Name Indication (SNI) and digital certificates in the TLS handshake process. This is done by setting these parts to zero while leaving the rest unchanged.
[0052] (2) Based on bidirectional message analysis, a message cluster representation unit for traffic is constructed. The cleaned traffic is divided into multiple message clusters according to a fixed time window. The message sub-clusters in the uplink and downlink directions of each cluster are extracted. The statistical characteristics of the sub-clusters are calculated and the header and end ciphertext fragments of the message are extracted to form a bidirectional message cluster structure. The message cluster structure is converted into a token sequence by the byte pair encoding (BPE) method.
[0053] The specific process for this step is as follows:
[0054] (2.1) The traffic is segmented into segments with a fixed time window length T. Each HTTP / 3 stream is divided into multiple consecutive time segments (packet clusters) according to the order of occurrence. Each packet cluster contains all data packets arriving within the time window T. When the duration of a stream is less than the last time window, the last packet cluster can be appropriately padded.
[0055] (2.2) Subsequently, each message cluster is split according to the transmission direction of the messages to obtain an uplink sub-cluster and a downlink sub-cluster. The uplink sub-cluster contains all messages sent by the client to the server in that cluster, while the downlink sub-cluster contains all messages returned by the server to the client. Through this bidirectional separation, the original mixed bidirectional message sequence is transformed into a structured bi-sub-cluster representation.
[0056] (2.3) Calculate the key statistical characteristics of each subcluster, including the number of messages within the subcluster, the total message size, the mean and standard deviation of message sizes, and the average time interval between message arrivals. These statistics reflect the overall communication interactions within the time window, and are combined and added to the beginning of each subcluster.
[0057] (2.4) To capture more fine-grained message content patterns, partial byte data is extracted from each encrypted message in the sub-cluster for representation. Preferably, the present invention truncates the first and last 8 bytes of the message header and encrypted payload of each HTTP / 3 message. The message header and payload prefix and suffix bytes are concatenated to form the truncated representation of the message. This truncation strategy greatly reduces the data dimensionality while retaining key message information.
[0058] (2.5) Place the statistical characteristics of the uplink subcluster at the beginning of the uplink subcluster sequence, followed by each truncated message within that subcluster in sequence; similarly, place the statistical characteristics of the downlink subcluster at the beginning of the downlink subcluster sequence, followed by the downlink truncated message sequence. This yields the uplink subcluster representation sequence and the downlink subcluster representation sequence. Finally, treat the uplink and downlink sequences as a whole to represent the bidirectional characteristics of the current message cluster, thus obtaining a complete message cluster representation unit.
[0059] (2.6) A vocabulary is pre-learned on a large amount of training traffic byte data using the BPE algorithm. Then, the byte fragment sequence in the message cluster representation unit is mapped to the index sequence in this vocabulary in turn, and the message cluster representation unit is converted into a token sequence.
[0060] (3) Using BERT, a classifier based on the Transformer architecture, we designed two pre-training tasks: mask prediction and homogeneous sub-cluster feature unit judgment. We input the message cluster representation unit formed in step (2) to pre-train the model. Then, we input HTTP / 3 protocol traffic with two types of granular traffic with labels to optimize and test the pre-trained model.
[0061] The specific process for this step is as follows:
[0062] (3.1) Add special tokens to the token sequence: add a classification label [CLS] at the beginning of the sequence, and its corresponding encoding vector will be used to represent the aggregate features of the entire sequence; insert a separator label [SEP] between the up row sub-cluster and down row sub-cluster sequences to indicate the sub-cluster boundary; for tokens in the sequence that need to be masked for prediction, temporarily replace them with the [MASK] label during pre-training; for sequences whose length is less than the maximum sequence length L, pad the end with the [PAD] label until the length L.
[0063] (3.2) In the input layer, the token sequence is first processed by the embedding layer, including: token embedding, which converts each token index into a corresponding vector representation; position embedding, which adds the position encoding vector of each position in the sequence to the corresponding token vector to inject position information; and segment embedding, which is an identifier vector that can be used to distinguish uplink / downlink subclusters and is added to the corresponding vector.
[0064] (3.3) A portion of tokens are randomly selected from the input token sequence at a certain ratio (15%) for masking. Most of the selected tokens are replaced with the special marker [MASK], while a small portion are randomly replaced with other words or remain unchanged. The processed sequence is then fed into the BERT model. The model needs to predict these masked original tokens based on the sequence context. By minimizing the prediction error, the model is driven to learn the relationships between different fields and byte segments in the message sequence.
[0065] (3.4) The model determines whether two sequences in an upstream and downstream sub-cluster originate from the same communication flow. During training, multiple samples are generated, including upstream and downstream sub-clusters within the same time window that maintain their natural order; upstream and downstream sub-clusters within the same time window but with reversed order; and a sub-cluster coupled with another sub-cluster from a different time window. The model finally accepts [CLS] as input and performs binary classification, where y=0 indicates a homologous origin and y=1 indicates a non-homologous origin. The training objective is to make the output of true homologous pairs close to 0 and the output of non-homologous pairs close to 1. Through this task, the model needs to focus on the correlation between the two sequences in terms of content and temporal sequence to make a correct judgment, thereby learning higher-level traffic temporal patterns and causal relationships.
[0066] (3.5) During pre-training, mask prediction and homogeneous sub-cluster identification are performed simultaneously. For a given training batch, some samples are used for mask prediction, and some are used for homogeneous sub-cluster identification. Both tasks can also be applied to the same input simultaneously. The model's total loss is usually set as the sum of the losses from both tasks. Through multi-task joint training, the model takes into account both local byte patterns and global sequence consistency in feature learning.
[0067] (3.6) After pre-training, this invention applies the model to a practical encrypted traffic classification task and performs supervised fine-tuning training. Specifically, an HTTP / 3 traffic dataset labeled with application category tags is prepared. These labeled data are processed using the same steps as pre-training to extract message cluster representations and encode token sequences. The resulting token sequences are then input into the pre-trained BERT model. A classification layer is then added to the top layer of the model. This classification layer takes the hidden vector corresponding to the [CLS] label output by BERT as input, and the output dimension is the predetermined number of categories, representing the confidence level that the input traffic belongs to each category. The gradient of the entire model is updated using the training set data, ensuring that the distribution of the [CLS] vector correctly maps to the corresponding traffic category. The pre-trained model parameters are also adjusted, but the learning rate is appropriately reduced to retain the general features learned by the model, adapting only to specific tasks.
[0068] It should be noted that the above embodiments are not intended to limit the scope of protection of the present invention. Equivalent transformations or substitutions made based on the above technical solutions all fall within the scope of protection of the claims of the present invention.
Claims
1. A method for intelligent identification of HTTP / 3 encrypted traffic based on the representation unit of empty payload message clusters, characterized in that, Includes the following steps: (1) Preprocess the original HTTP / 3 encrypted traffic, remove empty payload data packets with a payload length of 0, and mask the environment-related plaintext fields in the traffic to obtain a noise-reduced bidirectional traffic sequence; (2) Based on bidirectional message analysis, a message cluster representation unit for traffic is constructed. The preprocessed traffic is divided into multiple message clusters according to a fixed time window. The message sub-clusters in the uplink and downlink directions of each cluster are extracted. The statistical characteristics of the sub-clusters are calculated and the header and end ciphertext fragments of the message are extracted to form a bidirectional message cluster structure. The message cluster structure is converted into a token sequence by the byte pair encoding (BPE) method. (3) Input the processed message cluster representation unit into the redesigned BERT model, and pre-train the BERT model using the mask prediction task and the same source message cluster prediction task; optimize and test the pre-trained model by inputting HTTP / 3 protocol traffic with traffic type labels, and output the optimized model.
2. The method for intelligent identification of HTTP / 3 encrypted traffic based on the empty payload message cluster characterization unit according to claim 1, characterized in that, Step (1) specifically includes the following sub-steps: (1.1) Collect raw HTTP / 3 encrypted traffic data; in, This indicates a collected original Encrypted traffic data stream, This indicates the first [item] in the data stream. Data packets, ; This indicates the total number of data packets contained in the data stream; (1.2) Determine whether the encrypted payload length of the HTTP / 3 data packet is greater than 0 bytes to identify empty payload packets and filter out packets with a length of 0. These packets are ACK confirmation packets or keep-alive packets and do not carry actual application data. in, This represents the set of valid data packets retained after removing empty payload data packets. Indicates data packet The encrypted application layer payload byte sequence, This function is used to calculate the length of a byte sequence. (1.3) Mask plaintext fields in each message that may reveal environmental information, including IP source address and destination address, source port and destination port, as well as server name indication (SNI) and digital certificate information in the TLS handshake process.
3. The method for intelligent identification of HTTP / 3 encrypted traffic based on the empty payload message cluster characterization unit according to claim 1, characterized in that, Step (2) specifically includes the following sub-steps: (2.1) Divide the traffic sequence into packet clusters according to a predetermined time window. Each packet cluster corresponds to a time window of fixed duration T. The uplink sub-cluster in the packet cluster contains all the packets sent from the client to the server within the window, and the downlink sub-cluster contains all the packets sent from the server to the client within the window. (2.2) Calculate the statistical characteristics, including the number of messages in the sub-cluster, the total number of message bytes, the average and standard deviation of message size, and the average interval of message arrival time. Then append the statistical characteristics to the beginning of the sub-cluster messages. (2.3) Modify each message in the sub-cluster to consist of the plaintext header of each message and the first 8 bytes and the last 8 bytes of the encrypted payload; in, This represents the first [unit / item] obtained after feature extraction and recombination. Each data packet is characterized. Represents the raw data packet The head, and For sequence slicing operations, these represent extracting the first 8 bytes and the last 8 bytes of the encrypted payload sequence, respectively. (2.4) The message cluster representation unit is converted into a token sequence by the vocabulary obtained by byte pair encoding (BPE).
4. The method for intelligent identification of HTTP / 3 encrypted traffic based on the empty payload message cluster characterization unit according to claim 1, characterized in that, Step (3) specifically includes the following sub-steps: (3.1) When inputting training data, four types of tokens, [cls], [sep], [mask] and [pad], are used to process the input tokens. [cls] is added to the head of each input and does not contain any semantics. It is used as the basis for classification tasks during inference. [sep] is added between two sub-clusters to represent the separator between the feature units of the two sub-clusters. [mask] is used as a mask to randomly cover tokens at different positions during training. [pad] is used to complete the word embedding sequence of the input to ensure that the token inputs of the two sub-clusters are of equal length. (3.2) For each token, a relative position encoding is introduced. Two sets of preset segment position embeddings are used for the word embeddings in the sub-clusters. The vector sum of each word embedding vector with its position encoding and segment encoding is calculated. (3.3) The model pre-training process sets up a mask prediction task and a homogeneous message cluster prediction task. The mask prediction task MLM randomly selects a portion of tokens from the input token sequence according to a certain proportion to mask them. The processed sequence is then fed into the BERT model, and the model is trained to recover the masked tokens. The homogeneous message cluster prediction task SBP is achieved by inputting two message cluster token sequences into the BERT model and performing binary classification. The model is trained to determine whether the sequence pairs are homogeneous. (3.4) After pre-training, supervised fine-tuning training is performed on the model. An HTTP / 3 traffic dataset labeled with application category tags is prepared. The message cluster representation extraction and token sequence encoding are performed on these labeled data in the same way as the pre-training. The obtained token sequence is input into the pre-trained BERT model. A classification layer is added to the top layer of the model. The classification layer takes the hidden vector corresponding to the [CLS] label output by BERT as input and the output dimension is the predetermined number of categories, which is used to represent the confidence that the input traffic belongs to each category. The gradient of the entire model is updated using the training set data so that the distribution of the [CLS] vector can be correctly mapped to the corresponding traffic category.