A cross-server encryption service early identification method based on one-way flow key state focusing

By using a key state focusing method based on unidirectional flow and training forward and backward flow features using a hidden Markov model, the stability and speed issues of cross-server encrypted service identification are solved, enabling early identification in real network environments.

CN122247749APending Publication Date: 2026-06-19SOUTHEAST UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SOUTHEAST UNIV
Filing Date
2026-05-01
Publication Date
2026-06-19

Smart Images

  • Figure CN122247749A_ABST
    Figure CN122247749A_ABST
Patent Text Reader

Abstract

This invention discloses a cross-server early identification method for encrypted services based on unidirectional flow key state focusing. This method can identify encrypted services of the same type across servers. The method first groups captured internet traffic into five-tuple streams and filters short streams. Based on the server IP address and port corresponding to the encrypted service, it aggregates five-tuple streams of the same service type, then filters service types with fewer streams, dividing the filtered five-tuple stream set into a forward flow set and a backward flow set. Next, it calculates the transport layer payload length sequence for each unidirectional flow, further extracts feature vectors, and trains Hidden Markov Models (HMMs) based on the feature vector sets of the forward and backward flows respectively. Third, it performs statistical analysis on the feature vector sets of the forward and backward flows, and combines them with the trained HMMs to mine the key sub-vectors that contribute most to the identification results, and optimizes the HMMs accordingly. Finally, it collects traffic from other servers and extracts features, identifying their encrypted service types based on the optimized HMMs.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to an early identification method for cross-server encrypted services based on unidirectional flow critical state focusing, belonging to the field of computer network security technology. Background Technology

[0002] With the widespread deployment of TLS, HTTPS, QUIC, and other encrypted transport protocols in internet services, the amount of plaintext payload that can be directly parsed during communication is gradually decreasing. Traditional service identification methods that rely on port number mapping or deep packet inspection (DPI) face practical challenges. On the one hand, the fixed one-to-one correspondence between port numbers and service types is no longer maintained. Many network services operate on non-standard ports, making it difficult to accurately distinguish specific network services based solely on port numbers. On the other hand, DPI methods rely on parsing the plaintext fields, keywords, or message formats of the protocol. However, in end-to-end encrypted scenarios, most valid fields are transmitted encrypted, and intermediate nodes cannot directly read the protocol information, rendering traditional deep packet inspection methods ineffective.

[0003] To address the problem of encrypted service identification, existing technologies have proposed various solutions, which can be mainly divided into two categories: rule-based matching and classification model-based service identification methods. Among rule-based matching methods, existing invention patents such as "Server Port Service Identification Method and Apparatus" actively probe and obtain response messages from a specified port of the server to be detected. They extract preset field information, concatenate them into a string, and then calculate the hash value of the concatenated string and match it with a pre-collected JA3S fingerprint database to identify network services. Another method, "A Network Service Identification Method, Apparatus, Device, and Medium," records the first and second data packets sent by the server after the connection is established, and then formulates a rule base for the target network service based on clustering. These methods mostly rely on fixed-location messages, explicit handshake fields, or response content. However, with the deployment of fully encrypted protocols such as QUIC, the plaintext fields exposed by encrypted service traffic are decreasing.

[0004] In classification-based identification methods, existing invention patents such as "Port Service Identification Method, Device, Electronic Device, and Readable Storage Medium" learn the semantic information of service traffic through large language models to identify service types. "An Encrypted Tunnel Identification Method and System Based on Feature Manifold Coherence Scoring" designs a method combining traffic temporal features, variational autoencoders, and generative adversarial networks to identify specified encrypted tunnel services. These methods typically rely on the characteristics of data packets from both communicating parties. However, asymmetric routing is widespread in real-world network environments, where the forward flow sent by the client and the backward flow sent by the server often do not pass through the same collection node. Therefore, these methods cannot acquire the characteristics of bidirectional data packets in practical deployments and cannot identify encrypted network services in real-world network environments.

[0005] With the widespread adoption of microservice architecture, CDN acceleration, and load balancing technologies, the same network service typically runs in a distributed manner across multiple endpoints, regions, and even autonomous systems, rather than being fixed on a single server. For the same network service, different servers may introduce subtle traffic differences due to different proxies, middleware, and scheduling strategies. Existing identification methods often rely on fixed fingerprint databases and single-point sample training models, making it difficult to maintain stable performance when the deployment environment changes. Especially in real-world network environments, where it is known that a server is carrying a specific encrypted service, it is often necessary to further determine whether other unknown servers are also carrying the same service. This falls under the cross-server problem of identifying the same encrypted service. This problem requires the identification method to ignore individual node differences and capture stable traffic characteristics related to the service itself. Therefore, building a generalizable method for identifying the same encrypted service in a distributed deployment environment remains a weak point in existing methods.

[0006] In real-world network environments, especially in backbone networks, carrier metropolitan area networks, and large data center scenarios, asymmetric routing is prevalent. The forward and backward flows of the same session may traverse different forwarding paths, resulting in a single monitoring point often only obtaining a single-direction packet sequence, thus failing to capture the complete bidirectional session. Under this constraint, many identification methods relying on complete bidirectional characteristics or interaction behavior are difficult to apply directly. Therefore, when identifying encrypted services, how to extract stable features from limited time-series information and effectively suppress interference from abnormal link behavior, even when only observing unidirectional flows or incomplete bidirectional flows, has become a significant engineering problem that existing methods cannot avoid.

[0007] Furthermore, in core switching nodes, backbone network egress links, and large-scale bypass monitoring environments, the traffic volume is enormous and the number of concurrent connections is extremely high. Existing methods based on complex classification models rely on complete sessions, complex statistical values, or deep features, which not only places extremely high demands on caching and computational overhead but also makes it difficult to meet real-time detection and online requirements. Therefore, how to minimize the number of required packets and shorten the observation window while ensuring identification accuracy is key to deployment in real-world network environments.

[0008] In summary, existing methods for identifying encrypted services still face several key challenges in real-world network environments: (1) Existing methods rely on specific rule bases for fixed servers or collect traffic data to train identification models, making it difficult to achieve stable cross-server identification capabilities in distributed deployment scenarios; (2) Existing methods generally depend on complete bidirectional flows, failing to address the challenge of only unidirectional flows being visible due to asymmetric routing in real-world network environments; (3) Existing methods, based on long observation windows and complex features, struggle to achieve early and rapid identification in high-speed network scenarios. The cross-server encrypted service early identification method based on unidirectional flow key state focusing proposed in this invention can solve the challenge of identifying the same encrypted services deployed on different servers based on early traffic in real-world network environments. Summary of the Invention

[0009] To address the aforementioned issues, this invention discloses an early identification method for cross-server encrypted services based on unidirectional flow key state focusing. This method can identify encrypted services of the same type deployed on different servers. The method first collects internet traffic at a network node, performs five-tuple flow grouping and filters short flows, aggregates five-tuple flows with the same server IP address and port, and marks them as the same encrypted service type. Then, it filters out service types with fewer flows, dividing the filtered five-tuple flow set into forward flow set and backward flow set based on server IP address. Second, it calculates the transport layer payload length sequence for each unidirectional flow, extracts feature vectors, and trains Hidden Markov Models (HMMs) based on the feature vector sets of the forward and backward flows, respectively. Third, it statistically analyzes the features of each dimension in the feature vector sets of the forward and backward flows, combines them with the trained HMMs, mines the key sub-vectors that have the greatest gain on the identification result, and optimizes the HMMs accordingly. Finally, it collects traffic from the server to be identified, extracts features, and identifies its encrypted service type based on the optimized HMM.

[0010] To achieve the objectives of this invention, the technical solution is as follows: A method for early identification of cross-server encrypted services based on unidirectional flow critical state focusing, the method comprising the following steps:

[0011] Step (1) Capture network traffic passing through the collection node. For the captured traffic, based on the same five-tuple group flow, determine and record the server IP address and port of each five-tuple flow. Then detect and discard data packets with a transport layer payload length of 0, such as ACK packets. Filter out five-tuple flows with less than 30 data packets. If the number of data packets in a five-tuple flow is greater than 30, only the first 30 data packets are retained to obtain a set of pure encrypted service five-tuple flows with an equal number of data packets.

[0012] Step (2) Based on the five-tuple flow obtained in step (1), detect and remove any flows with abnormal conditions such as packet loss. Then, mark the five-tuple flows with the same server IP address and port as the same service type. Divide each five-tuple flow into forward flow and backward flow based on the server IP address. Count the number of forward flow and backward flow for each service type. If the number of one-way flow for a certain service type is less than 200, discard the set of one-way flow.

[0013] Step (3) For the forward flow set and backward flow set obtained in step (2), the data packets are parsed layer by layer and the transport layer payload length sequence is calculated. After further processing, the feature vector is obtained, and the forward and backward hidden Markov models are trained respectively to obtain the identification model of the encrypted service.

[0014] Step (4) Statistical analysis is performed on the feature vector sets of the forward flow and the backward flow respectively. Combined with the completed forward and backward hidden Markov models obtained in step (3), the key feature sub-vectors that have the greatest gain on the recognition results are mined.

[0015] Step (5) In order to further improve the performance of the Hidden Markov Model, the trained Hidden Markov Model is optimized by combining the statistical analysis results of the feature vector set and the key feature sub-vectors mined in step (4).

[0016] Step (6) Collect the traffic of the server or network to be detected, and execute the processing steps (1), (2) and (3) in sequence to obtain the feature vector. Based on the optimized hidden Markov model, identify the encrypted service type of the extracted feature vector.

[0017] Furthermore, in step (1), the steps for obtaining a clean and equal-length set of encrypted service quintuple streams are as follows:

[0018] (1.1) Capture all passing network traffic at the designated collection node, extract the 5-tuple (source IP address, destination IP address, source port, destination port, transport layer protocol) and calculate the HASH value as the unique identifier of a 5-tuple flow for grouping the flow, and then sort the data packets in each 5-tuple flow in chronological order.

[0019] (1.2) For the sorted 5-tuple stream, if the transport layer is TCP, record the destination IP address and destination port of the first SYN packet without the ACK flag as the server IP address and port; if the transport layer is UDP, record the source and destination IP addresses and destination port of the first data packet as the server IP address and port.

[0020] (1.3) For data packets with extremely low information content and a transport layer payload length of 0, in order to eliminate their interference with the recognition results, all 5-tuple streams are traversed and these data packets are discarded.

[0021] (1.4) Count the number of packets in each stream and discard short streams with fewer than 30 packets. If a 5-tuple stream has more than 30 packets and its transport layer is TCP, then only the first 30 packets are kept; if its transport layer is UDP and there are no packets with a time interval of more than 15 seconds, then only the first 30 packets are kept. If there are packets with a time interval of more than 15 seconds, then it is split into two 5-tuple streams, and step (1.4) is executed in each of them, finally obtaining a clean set of 5-tuple streams of equal length;

[0022] Furthermore, in step (2), the steps for obtaining the forward flow set and backward flow set for each service type are as follows:

[0023] (2.1) For the set of pure five-tuple streams of equal length obtained in step (1.4), if its transport layer is TCP protocol, then check whether there is packet loss according to the TCP protocol field. If there is, discard the stream.

[0024] (2.2) Based on the server IP address and port recorded in step (1.2), aggregate the five-tuple flows with the same server IP address and port and mark them as the same service type. Then, divide the five-tuple flows into forward flows and backward flows based on the server IP address.

[0025] (2.3) In order to ensure the number of streams required for training, the number of forward and backward streams for each service type is counted, and the set of unidirectional streams with less than 200 streams is discarded.

[0026] Furthermore, in step (3), the steps for training the encrypted service identification model are as follows:

[0027] (3.1) For the forward flow set and backward flow set obtained in step (2.3), traverse all data packets of each unidirectional flow, decouple each layer protocol to calculate the transport layer payload length sequence, and obtain the transport layer payload length sequence set of the forward flow and backward flow.

[0028] (3.2) Each encryption service's forward and backward flows follow its communication rules; therefore, this method selects a Hidden Markov Model to learn the features of the encryption service. The length of the transport layer payload of a single data packet may fluctuate. For stability, this method divides the value space of the length sequence into 150 sub-intervals, each sub-interval being 10 bytes long. Each dimension of the transport layer payload length sequence is mapped to its corresponding sub-interval, thereby obtaining the feature vector set of the forward and backward flows.

[0029] (3.3) Based on the feature sets of the forward and backward flows in step (3.2), the Baum-Welch algorithm is used to train the parameters of the Hidden Markov Model. (Regarding the direction...) ( (Forward or backward), let the observation sequence be Hidden state set Model parameters First, initialize π, A, and B, as shown in formulas (3-1), (3-2), and (3-3):

[0030] (3-1) (3-2) (3-3)

[0031] in, Indicates the initial hidden state is The probability, Indicates from hidden state Transition to hidden state The probability, Indicates the hidden state The value of the eigenvector corresponding to that location is The probability, The eigenvector of an arbitrary flow is represented by the first... The hidden state corresponding to the dimension;

[0032] (3.4) Based on the initial hidden Markov model obtained in step (3.3), the forward probabilities are calculated recursively. and backward probability The calculation process is shown in formulas (3-4), (3-5), (3-6), and (3-7):

[0033] (3-4) (3-5) (3-6) (3-7)

[0034] (3.5) Calculate the posterior state based on the forward and backward probabilities obtained in step (3.4). With the post-transfer The calculation process is shown in formulas (3-8) and (3-9):

[0035] (3-8) (3-9)

[0036] (3.6) Then, based on all feature vectors, the model parameters are trained, and the training process is shown in equations (3-10), (3-11), and (3-12):

[0037] (3-10) (3-11) (3-12)

[0038] The process stops when the log-likelihood increment is less than a preset threshold τ or when the maximum number of iterations is reached, resulting in the forward discrete hidden Markov model and the backward discrete hidden Markov model, respectively.

[0039] Furthermore, in step (4), the observation probabilities of each dimension of the features in the forward and backward flow feature sets are statistically analyzed, and combined with the trained forward and backward hidden Markov models, the continuous subsequences with higher observation probabilities and the greatest contribution to the recognition results are determined as key feature sub-vectors. The process of selecting key feature sub-vectors is shown in formulas (4-1) and (4-2):

[0040] (4-1) (4-2)

[0041] in, Show service type In direction Upper Dimensional observations The overall observation probability, This represents the average observed probability of other service types. This is a smoothing constant. Based on this process, the key feature vectors are obtained. .

[0042] Furthermore, in step (5), the steps for optimizing the hidden Markov model are as follows:

[0043] (5.1) Find the observation with extremely low global cumulative distribution probability in the feature statistics results of step (4). If the hidden state corresponding to the feature has a very low distribution probability, it means that the probability of the observation in reality is extremely low. That is, the state may be noise caused by model overfitting. Then discard this hidden state.

[0044] (5.2) The state transition probability distribution and observation probability of the hidden state are smoothed respectively, as shown in formulas (5-1) and (5-2):

[0045] (5-1) (5-2)

[0046] in, Hidden state State transition probability distribution Hidden state The observed probability distribution The smoothing coefficient is used. Then, the similarity of the hidden states is calculated based on the KL divergence, as shown in formulas (5-3) and (5-4):

[0047] (5-3) (5-4)

[0048] in, Hidden state Smoothed state transition probability distribution Hidden state The smoothed observation probability distribution. Based on the KL divergence value, merge the hidden states whose state transition probability distribution and observation probability distribution are highly similar;

[0049] (5.3) Use the key feature sub-vectors obtained in step (4) to perform Viterbi segment re-evaluation on the Hidden Markov Model. Combine the processing results of steps (5.1) and (5.2) to retrain the forward discrete Hidden Markov Model and the backward discrete Hidden Markov Model based on the Baum-Welch algorithm to optimize the recognition performance of the Hidden Markov Model.

[0050] Furthermore, in step (6), the step of identifying the type of traffic service to be detected is as follows:

[0051] (6.1) Capture the traffic of the server or network to be detected, and extract the feature vector of the original network traffic based on the processing steps (1), (2), (3.1) and (3.2);

[0052] (6.2) If only a unidirectional flow is captured, that is, only the feature vectors of the forward flow and the backward flow can be extracted, then input them into the forward and backward hidden Markov models optimized in step (5.3), respectively. If a bidirectional flow is captured, then input the feature vectors of the forward flow and the backward flow into the corresponding hidden Markov models optimized in step (5.3), respectively.

[0053] (6.3) Use the forward-backward algorithm to evaluate the probability that the stream to be detected belongs to each type of encryption service. Take the maximum probability output by the forward and backward hidden Markov models and determine the type of encryption service corresponding to the stream to be detected.

[0054] Compared with the prior art, the technical solution of the present invention has the following beneficial technical effects.

[0055] (1) This invention proposes an early identification method for cross-server encrypted services based on focusing on key states of unidirectional flow. Unlike methods that classify traffic for specific encrypted applications or specific servers, this invention does not rely on a single server IP address, port, or fixed deployment node. Instead, it models the temporal behavior of service-level unidirectional flow and extracts stable features that can characterize the essential communication mode of encrypted services. This enables the identification of the same encrypted service across different servers and deployment nodes, providing technical support for network security supervision, asset mapping, service profiling, and the discovery of unknown deployment nodes.

[0056] (2) This invention addresses the constraint of only one-way flow visibility in real network environments. In actual deployment environments such as backbone networks, carrier gateways, and enterprise egress gateways, due to the influence of asymmetric routing, traffic mirroring, and collection strategies, detection nodes often can only observe one side of the forward or backward flow. To address this issue, this invention establishes hidden Markov models for both the forward and backward flows, respectively characterizing the hidden state evolution laws in the client-to-server and server-to-client directions. This enables encrypted service identification to be completed even when only one-way flow is captured, improving the deployability and robustness of the method in real network environments.

[0057] (3) This invention balances cross-server identification accuracy and rapid identification capability for encrypted services, meeting the real-time online analysis requirements of high-speed networks. This invention requires only the first 30 payload data packets to construct an effective early observation sequence and perform service identification, avoiding waiting for the complete connection to end or long-term accumulation of flow statistics. This method significantly reduces detection latency and cache pressure, making it suitable for scenarios such as high-speed links, online monitoring, real-time alarms, and rapid discovery of encrypted services.

[0058] (4) This invention improves the stability and interpretability of encrypted service identification through a key state focusing mechanism. Existing methods based on statistical features or fingerprints usually directly use global sequence features for classification, which are easily affected by local noise. This invention further mines the key state sequences or key feature sub-vectors that contribute the most to service identification in the state space of the Hidden Markov Model, highlighting local time segments that appear stably and have strong distinguishing ability during service communication, reducing the interference of low-frequency noise states and non-key states on the identification results, thereby improving the cross-server identification accuracy and model interpretability. Attached Figure Description

[0059] Figure 1 A general framework diagram of a cross-server encrypted service early identification method based on unidirectional flow critical state focusing;

[0060] Figure 2 A flowchart of a cross-server encrypted service early identification method based on unidirectional flow critical state focusing;

[0061] Figure 3 Diagram of the encryption service request and response process;

[0062] Figure 4 The graph is for optimizing a Hidden Markov Model. Detailed Implementation

[0063] The technical solutions provided by the present invention will be described in detail below with reference to specific embodiments. It should be understood that the following specific embodiments are only used to illustrate the present invention and are not intended to limit the scope of the present invention.

[0064] Specific Implementation: The present invention provides an early identification method for cross-server encrypted services based on unidirectional flow critical state focusing, the overall framework of which is as follows: Figure 1 As shown, it specifically includes the following steps:

[0065] Step (1) Capture network traffic passing through the collection node. For the captured traffic, based on the same five-tuple group flow, determine and record the server IP address and port of each five-tuple flow. Then detect and discard data packets with a transport layer payload length of 0, such as ACK packets. Filter out five-tuple flows with less than 30 data packets. If the number of data packets in a five-tuple flow is greater than 30, only the first 30 data packets are retained to obtain a set of pure encrypted service five-tuple flows with an equal number of data packets.

[0066] In one embodiment of the present invention, the steps for obtaining a clean and equal-length set of encrypted service quintuple streams are as follows:

[0067] (1.1) Capture all passing network traffic at the designated collection node, extract the 5-tuple (source IP address, destination IP address, source port, destination port, transport layer protocol) and calculate the HASH value as the unique identifier of a 5-tuple stream for grouping the streams, and then organize the data packets in each 5-tuple stream in chronological order; for example, concatenate the 5-tuple (192.168.1.1, 11111, 114.114.114.114, 53, UDP) into a string and use the Farm hash function to calculate the unique identifier;

[0068] (1.2) For the sorted 5-tuple stream, if the transport layer is TCP, record the destination IP address and destination port of the first SYN packet without the ACK flag as the server IP address and port; if the transport layer is UDP, record the source and destination IP addresses and destination port of the first data packet as the server IP address and port; for example, for the TCP protocol 5-tuple stream (192.168.1.1,22222, 198.18.0.78, 80, TCP), the first data packet of its TCP three-way handshake is sent by the client, and its field characteristics are that the Acknowledgment bit of the Flags field is 0b0 and the SYN bit is 0b1, then record the server IP address 198.18.0.78 and port 80; for the UDP protocol 5-tuple stream (192.168.1.1, 11111, 114.114.114.114, 53, (UDP), recording the server IP address 114.114.114.114 and port 53;

[0069] (1.3) For data packets with extremely low information content and a transport layer payload length of 0, in order to eliminate their interference with the identification results, all five-tuple streams are traversed and these data packets are discarded; for example, for data packets with the transport layer of TCP protocol, if the Acknowledgment bit or SYN bit of the Flags field is 0b1, that is, an ACK packet or a SYN packet, the packet is discarded; and for data packets with the transport layer of UDP protocol, if the Length field is 0x0008, that is, only a UDP header, the packet is discarded.

[0070] (1.4) Count the number of packets in each stream and discard short streams with fewer than 30 packets. If a 5-tuple stream has more than 30 packets and its transport layer is TCP, then only the first 30 packets are kept; if its transport layer is UDP and there are no packets with a time interval of more than 15 seconds, then only the first 30 packets are kept. If there are packets with a time interval of more than 15 seconds, then it is split into two 5-tuple streams, and step (1.4) is executed in each of them, finally obtaining a clean set of 5-tuple streams of equal length;

[0071] Step (2) Based on the five-tuple flow obtained in step (1), detect and remove any flows with abnormal conditions such as packet loss. Then, mark the five-tuple flows with the same server IP address and port as the same service type. Divide each five-tuple flow into forward flow and backward flow based on the server IP address. Count the number of forward flow and backward flow for each service type. If the number of one-way flow for a certain service type is less than 200, discard the set of one-way flow.

[0072] In one embodiment of the present invention, the steps for obtaining the forward flow set and backward flow set for each service type are as follows:

[0073] (2.1) For the set of pure five-tuple streams of equal length obtained in step (1.3), if its transport layer is TCP protocol, then check whether there is packet loss according to the TCP protocol field. If there is, discard the stream; for example, after arranging in chronological order, the length of a data packet (seq=1) is 1000, and the next data packet (seq=3001) is lost in the middle.

[0074] (2.2) Based on the server IP address and port recorded in step (1.2), aggregate the five-tuple flows with the same server IP address and port and mark them as the same service type. Then, divide the five-tuple flows into forward flows and backward flows based on the server IP address. For example, for the five-tuple flow (192.168.1.1, 11111, 114.114.114.114, 53, UDP), the packets with source IP address equal to 192.168.1.1 and source port equal to 11111 are divided into forward flows, and the packets with source IP address equal to 114.114.114.114 and source port equal to 53 are divided into backward flows.

[0075] (2.3) In order to ensure the number of streams required for training, the number of forward and backward streams for each service type is counted, and the set of unidirectional streams with less than 200 streams is discarded.

[0076] Step (3) For the forward flow set and backward flow set obtained in step (2), the data packets are parsed layer by layer and the transport layer payload length sequence is calculated. After further processing, the feature vector is obtained, and the forward and backward hidden Markov models are trained respectively to obtain the identification model of the encrypted service.

[0077] In one embodiment of the present invention, the steps for training the encrypted service identification model are as follows:

[0078] (3.1) For the forward flow set and backward flow set obtained in step (2.3), traverse all data packets of each unidirectional flow, decouple each layer protocol to calculate the transport layer payload length sequence, and obtain the transport layer payload length sequence set of the forward flow and backward flow; Table 1 lists the extraction results of the transport layer payload length sequence of some forward flow and backward flow.

[0079] Table 1: Transmission layer load length sequence for partial forward and backward flows

[0080]

[0081] (3.2) Each encryption service's forward and backward flows follow its communication rules; therefore, this method selects a Hidden Markov Model to learn the characteristics of the encryption service. The transport layer payload length of a single data packet may fluctuate; for stability, this method divides the value space of the length sequence into 150 sub-intervals. Each sub-interval is 10 bytes long. Each dimension of the transport layer payload length sequence is mapped to its corresponding sub-interval, thus obtaining the feature vector set of the forward and backward flows. Table 2 lists some of the feature vectors of the forward and backward flows.

[0082] Table 2: Eigenvectors of some forward and backward flows

[0083]

[0084] (3.3) For the forward and backward flow feature sets in step (3.2), the Baum-Welch algorithm is used to train the hidden Markov model parameters. Figure 3 This diagram illustrates the request and response process for encrypted services. (Towards...) ( (Forward or backward), let the observation sequence be Hidden state set Model parameters First, initialize π, A, and B, as shown in formulas (3-1), (3-2), and (3-3):

[0085] (3-1) (3-2) (3-3)

[0086] in, Indicates the initial hidden state is The probability, Indicates from hidden state Transition to hidden state The probability, Indicates the hidden state The value of the eigenvector corresponding to that location is The probability, The eigenvector of an arbitrary flow is represented by the first... The hidden state corresponding to the dimension;

[0087] (3.4) Based on the initial hidden Markov model obtained in step (3.3), the forward probabilities are calculated recursively. and backward probability The calculation process is shown in formulas (3-4), (3-5), (3-6), and (3-7):

[0088] (3-4) (3-5) (3-6) (3-7)

[0089] (3.5) Calculate the posterior state based on the forward and backward probabilities obtained in step (3.4). With the post-transfer The calculation process is shown in formulas (3-8) and (3-9):

[0090] (3-8) (3-9)

[0091] (3.6) Then, based on all feature vectors, the model parameters are trained, and the training process is shown in equations (3-10), (3-11), and (3-12):

[0092] (3-10) (3-11) (3-12)

[0093] The process stops when the log-likelihood increment is less than a preset threshold τ or when the maximum number of iterations is reached, resulting in the forward discrete hidden Markov model and the backward discrete hidden Markov model, respectively.

[0094] Step (4) performs statistical analysis on the feature vector sets of the forward and backward flows respectively, and combines them with the trained forward and backward hidden Markov models obtained in step (3) to mine the key feature sub-vectors that have the greatest gain on the recognition results; the process of selecting key feature sub-vectors is shown in formulas (4-1) and (4-2):

[0095] (4-1) (4-2)

[0096] in, Show service type In direction Upper Dimensional observations The overall observation probability, This represents the average observed probability of other service types. This is a smoothing constant. Based on this process, the key feature vectors are obtained. .

[0097] In one embodiment of the present invention, the observation probabilities of each dimension of features in the feature sets of the forward and backward flows are statistically analyzed, and combined with the trained forward and backward Hidden Markov Models, the continuous subsequences with high observation probabilities and the greatest contribution to the recognition results are identified as key feature sub-vectors. For example, Table 3 shows the key feature sub-vectors of some services.

[0098] Table 3: Key Feature Subvectors for Some Services

[0099]

[0100] Step (5) In order to further improve the performance of the Hidden Markov Model, the trained Hidden Markov Model is optimized by combining the statistical analysis results of the feature vector set and the key feature sub-vectors mined in step (4).

[0101] In one embodiment of the present invention, the steps for optimizing the hidden Markov model are as follows:

[0102] (5.1) Find the observation with extremely low global cumulative distribution probability in the feature statistics results of step (4). If the hidden state corresponding to the feature has a very low distribution probability, it means that the probability of the observation in reality is extremely low. That is, the state may be noise caused by model overfitting. Then discard this hidden state.

[0103] (5.2) The state transition probability distribution and observation probability of the hidden state are smoothed respectively, as shown in formulas (5-1) and (5-2):

[0104] (5-1) (5-2)

[0105] in, Hidden state State transition probability distribution Hidden state The observed probability distribution The smoothing coefficient is used. Then, the similarity of the hidden states is calculated based on the KL divergence, as shown in formulas (5-3) and (5-4):

[0106]

[0107] in, Hidden state Smoothed state transition probability distribution Hidden state The smoothed observation probability distribution. Based on the KL divergence value, merge the hidden states whose state transition probability distribution and observation probability distribution are highly similar;

[0108] (5.3) Use the key feature sub-vectors obtained in step (4) to perform Viterbi segment re-evaluation on the Hidden Markov Model. Combine the processing results of steps (5.1) and (5.2) to retrain the forward discrete Hidden Markov Model and the backward discrete Hidden Markov Model based on the Baum-Welch algorithm to optimize the recognition performance of the Hidden Markov Model. Figure 4 The optimization process of the Hidden Markov Model is demonstrated.

[0109] Step (6) Collect the traffic of the server or network to be detected, and execute the processing steps (1), (2) and (3) in sequence to obtain the feature vector. Based on the optimized hidden Markov model, identify the encrypted service type of the extracted feature vector.

[0110] In one embodiment of the present invention, the step of identifying the type of traffic service to be detected is as follows:

[0111] (6.1) Capture the traffic of the server or network to be detected, and extract the feature vector of the original network traffic based on the processing steps (1), (2), (3.1) and (3.2);

[0112] (6.2) If only a unidirectional flow is captured, that is, only the feature vectors of the forward flow and the backward flow can be extracted, then input them into the forward and backward hidden Markov models optimized in step (5.3), respectively. If a bidirectional flow is captured, then input the feature vectors of the forward flow and the backward flow into the corresponding hidden Markov models optimized in step (5.3), respectively.

[0113] (6.3) Use the forward-backward algorithm to evaluate the probability that the stream to be detected belongs to each type of encryption service. Take the maximum probability output by the forward and backward Hidden Markov Models and determine the encryption service type corresponding to the stream to be detected. For example, Table 4 shows the highest probability matching result of a forward stream with server IP address 183.2.143.108 and port 443, and Table 5 shows the highest probability matching result of a backward stream with server IP address 202.119.26.90 and port 12345.

[0114] Table 4: Matching results with the highest probability for a backward flow

[0115]

[0116] Table 5: Matching results with the highest probability for a backward flow

[0117]

[0118] It should be noted that the above embodiments are not intended to limit the scope of protection of the present invention. Equivalent transformations or substitutions made based on the above technical solutions all fall within the scope of protection of the claims of the present invention.

Claims

1. A method for early identification of cross-server encrypted services based on unidirectional flow critical state focusing, characterized in that, The method includes the following steps: Step (1) Capture traffic transmitted on the Internet. Based on the same five-tuple (source IP address, destination IP address, source port, destination port, transport layer protocol), group the captured traffic into streams, determine and record the server IP address and port of each five-tuple stream, discard data packets with a transport layer payload length of 0 such as ACK packets, and then filter out five-tuple streams with fewer than 30 data packets. If the number of data packets in a five-tuple stream is greater than 30, only the first 30 data packets are retained to obtain a clean encrypted service five-tuple stream of equal length. Step (2) Based on the five-tuple flow obtained in step (1), detect and remove flows with abnormal conditions such as packet loss, then mark the five-tuple flows with the same server IP address and port as the same service type, divide each five-tuple flow into forward flow and backward flow, count the number of unidirectional flows of each service type and discard service types with less than 200 flows. Step (3) For the forward flow set and backward flow set obtained in step (2), parse the protocol of each layer of the data packet and extract the transport layer payload length sequence. After further processing, obtain the feature vector, and train the hidden Markov model based on the feature vector sets of the forward flow and backward flow respectively to obtain the identification model of the encrypted service. Step (4) Statistical analysis is performed on the feature vector sets of the forward flow and the backward flow respectively. Combined with the trained Hidden Markov Model obtained in step (3), the key feature sub-vectors that contribute the most to the recognition are mined. Step (5) In order to further improve the performance of the recognition model, the hidden Markov model trained in step (3) is optimized by combining the statistical analysis results of the feature vector set and the key feature sub-vectors mined in step (4). Step (6) For the traffic of the server or network to be detected, perform the processing steps (1), (2) and (3) to obtain the feature vector, and use the optimized hidden Markov model to identify the encrypted service type of the extracted feature vector.

2. The method for early identification of cross-server encrypted services based on unidirectional flow critical state focusing as described in claim 1, characterized in that, Step (1) specifically includes the following sub-steps: (1.1) Collect all network traffic passing through the designated gateway, extract the five-tuple (source IP address, destination IP address, source port, destination port, transport layer protocol) from all data packets, use the five-tuple as the key to calculate the HASH value as the unique identifier of a five-tuple flow, and organize the data packets in each five-tuple flow according to the arrival timestamp of the data packets. (1.2) For a sorted 5-tuple stream, for a 5-tuple stream with TCP as the transport layer, record the destination IP address and destination port of the first SYN packet without the ACK flag as the server IP address and port; for a 5-tuple stream with UDP as the transport layer, record the source and destination IP addresses and destination port of the first data packet as the server IP address and port. (1.3) Since data packets with a transport layer payload length of 0, such as ACK packets, carry very little information, in order to eliminate the interference of these data packets on the recognition results, all five-tuple streams are traversed and data packets with a transport layer payload length of 0 are discarded. (1.4) Count the number of packets in each 5-tuple stream and discard short streams with fewer than 30 packets. If the number of packets in a 5-tuple stream is greater than 30 and the transport layer of the stream is TCP, then only the first 30 packets are saved. If the transport layer of the stream is UDP and there is no time interval between two packets that exceeds 15 seconds, then only the first 30 packets are saved. If there is a time interval between two packets that exceeds 15 seconds, then the stream is split into two 5-tuple streams and step (1.4) is executed on each stream. Finally, a set of pure 5-tuple streams of equal length is obtained.

3. The method for early identification of cross-server encrypted services based on unidirectional flow critical state focusing according to claim 2, characterized in that, Step (2) specifically includes the following sub-steps: (2.1) For the set of pure five-tuple streams of equal length obtained in step (1.4), if the transport layer of the service is TCP protocol, check whether there is an abnormal situation of packet loss according to the TCP protocol field. If there is, discard the five-tuple stream. (2.2) Based on the server IP address and port recorded in step (1.2), aggregate the five-tuple flows with the same server IP address and port and mark them as the same service type. Then, divide the five-tuple flows into forward flows and backward flows based on the server IP address. (2.3) In order to ensure that the number of streams for each service type is sufficient to support the training process, the number of forward and backward streams for each service type is counted, and the set of unidirectional streams with less than 200 streams is discarded.

4. The method for early identification of cross-server encrypted services based on unidirectional flow critical state focusing according to claim 3, characterized in that, Step (3) specifically includes the following sub-steps: (3.1) For the forward flow set and backward flow set obtained in step (2.3), traverse all data packets of each unidirectional flow, parse each layer protocol and extract the transport layer payload, calculate the transport layer payload length sequence, and obtain the transport layer payload length sequence set of the forward flow and the transport layer payload length sequence set of the backward flow respectively. (3.2) The forward and backward flows of each specific encryption service follow predefined communication rules and possess Hidden Markov property. Therefore, this method selects the Hidden Markov Model as the encryption service identification model. The length of the transport layer payload of a single data packet may fluctuate. In order to ensure the stability of the transport layer payload length sequence as the observation sequence, the value space of the length sequence 0-1499 bytes is further divided into 150 sub-intervals. Each sub-interval is 10 bytes long. Each dimension of the transport layer payload length sequence is mapped to the corresponding sub-interval to obtain the feature vector set of the forward flow and the backward flow. (3.3) For the forward and backward flow feature sets in step (3.2), the Baum-Welch algorithm is used to train the hidden Markov model parameters, and the direction is... ( (Forward or backward), let the observation sequence be Hidden state set Model parameters First, initialize π, A, and B, as shown in formulas (3-1), (3-2), and (3-3): in, Indicates the initial hidden state is The probability, Indicates from hidden state Transition to hidden state The probability, Indicates the hidden state The value of the eigenvector corresponding to that location is The probability, The eigenvector of an arbitrary flow is represented by the first... The hidden state corresponding to the dimension; (3.4) Based on the initial hidden Markov model obtained in step (3.3), the forward probabilities are calculated recursively. and backward probability The calculation process is shown in formulas (3-4), (3-5), (3-6), and (3-7): (3.5) Calculate the posterior state based on the forward and backward probabilities obtained in step (3.4). With the post-transfer The calculation process is shown in formulas (3-8) and (3-9): (3.6) Then, based on all feature vectors, the model parameters are trained, and the training process is shown in equations (3-10), (3-11), and (3-12): The process stops when the log-likelihood increment is less than a preset threshold τ or when the maximum number of iterations is reached, resulting in the forward discrete hidden Markov model and the backward discrete hidden Markov model, respectively.

5. The method for early identification of cross-server encrypted services based on unidirectional flow critical state focusing according to claim 1, characterized in that, In step (4), the observation probability of each feature dimension is calculated for the feature sets of the forward and backward flows in step (3.2). Combined with the hidden Markov model, the continuous subsequence with the highest observation probability and the greatest gain on the recognition result in the feature set is determined as the key feature sub-vector. The process of selecting the key feature sub-vector is shown in formulas (4-1) and (4-2): in, Show service type In direction Upper Dimensional observations The overall observation probability, This represents the average observed probability of other service types. As a smoothing constant, based on this process, key feature vectors are obtained. .

6. The method for early identification of cross-server encrypted services based on unidirectional flow critical state focusing according to claim 1, characterized in that, Step (5) specifically includes the following sub-steps: (5.1) For observations with extremely low global cumulative distribution probability in the feature statistics results of step (4), if the distribution probability of the corresponding hidden state is also very low, that is, the observation result is almost never triggered in actual operation, which may be due to noise caused by model overfitting, then discard such hidden state. (5.2) The state transition probability distribution and observation probability of the hidden state are smoothed respectively, as shown in formulas (5-1) and (5-2): in, Hidden state State transition probability distribution Hidden state The observed probability distribution The smoothing coefficient is used, and then the similarity of the hidden states is calculated based on the KL divergence. The similarity calculation process is shown in formulas (5-3) and (5-4): in, Hidden state Smoothed state transition probability distribution Hidden state The smoothed observation probability distribution is used to merge the hidden states that are highly similar to the observation probability distribution based on the KL divergence value; (5.3) Use the key feature sub-vectors from step (4) to perform Viterbi segment re-evaluation on the Hidden Markov Model. Combine the processing results from steps (5.1) and (5.2), perform at least one round of retraining on the forward discrete Hidden Markov Model and the backward discrete Hidden Markov Model obtained in step (3.3) based on the Baum-Welch algorithm to optimize the recognition performance of the Hidden Markov Model.

7. The method for early identification of cross-server encrypted services based on unidirectional flow critical state focusing according to claim 1, characterized in that, Step (6) specifically includes the following sub-steps: (6.1) Capture traffic at the gateway of the server or network to be tested, and perform the processing steps (1), (2), (3.1) and (3.2) on the original network traffic respectively to obtain the feature vector; (6.2) If only unidirectional flow is captured at the gateway, that is, only the feature vector of one of the forward flow and the backward flow is captured, then the forward hidden Markov model and the backward hidden Markov model optimized in step (5.3) are respectively input. If bidirectional flow is captured at the gateway, then the feature vectors of the forward flow and the backward flow are respectively input into the corresponding hidden Markov model optimized in step (5.3). (6.3) Evaluate the probability that the stream to be detected belongs to each type of encryption service based on the forward-backward algorithm, take the maximum probability output by the forward Hidden Markov Model and the backward Hidden Markov Model, and determine the type of encryption service corresponding to the stream to be detected.

8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements a cross-server encryption service early identification method based on unidirectional flow critical state focusing as described in any one of claims 1 to 7.

9. A computer-readable storage medium storing computer instructions thereon, characterized in that, When executed by the processor, the computer instruction implements an early identification method for cross-server encrypted services based on unidirectional flow critical state focusing, as described in any one of claims 1-7.