Big data based control system traffic anomaly detection method and system

By performing protocol parsing and feature extraction on real-time network traffic data of industrial control systems, and utilizing pre-trained models and legality verification, the problem of deep analysis of encrypted data features was solved, enabling accurate identification and anomaly detection of unexpected encrypted data blocks, thereby improving the security and stability of the system.

CN122247709APending Publication Date: 2026-06-19GUIZHOU NIO TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GUIZHOU NIO TECHNOLOGY CO LTD
Filing Date
2026-04-01
Publication Date
2026-06-19

Smart Images

  • Figure CN122247709A_ABST
    Figure CN122247709A_ABST
Patent Text Reader

Abstract

This invention relates to the field of traffic anomaly detection technology, specifically a method and system for detecting traffic anomalies in control systems based on big data. The method collects real-time network traffic data from industrial control systems and extracts function codes, source addresses, destination addresses, and port information from protocol headers. It then identifies whether encrypted data blocks exist in the transport layer payload based on a pre-defined plaintext protocol specification library. If encrypted data blocks are found, their traffic characteristics are extracted. These characteristics are input into a pre-trained baseline model to calculate the reconstruction error of the encrypted data blocks. If the reconstruction error exceeds a pre-defined error threshold, the encrypted data blocks are determined to be unexpected, and a traffic anomaly alarm is generated. This effectively improves the accuracy of traffic anomaly detection, providing strong support for the safe and stable operation of control systems.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of traffic anomaly detection technology, specifically to a method and system for detecting traffic anomalies in control systems based on big data. Background Technology

[0002] Industrial control systems involve frequent data exchanges between internal and external networks, resulting in complex and ever-changing network traffic patterns. To protect industrial control systems from security threats such as malicious attacks and data breaches, data encryption technology is widely used as a core defense measure. By encrypting critical data, it is possible to effectively prevent sensitive information from being stolen or tampered with during transmission, ensuring the stable operation of industrial production and data security.

[0003] However, while the widespread application of encryption technology enhances data security in the actual operation of industrial control systems, it also raises a series of new problems. Because the encryption process consumes significant computing resources, including processor power, memory space, and network bandwidth, it directly impacts system real-time performance, manifesting as a significant increase in network traffic. As encrypted data blocks are transmitted over the network, traffic characteristics change, making it difficult for traditional traffic detection methods to accurately identify encrypted data blocks, let alone effectively distinguish between normal encrypted data and potentially unexpected encrypted data (such as malicious encrypted traffic). Existing technologies, when detecting encrypted data blocks in industrial control systems, often lack in-depth analysis and accurate judgment of encrypted data characteristics, making it difficult to cope with complex and ever-changing network attack methods and unable to promptly detect and handle traffic anomalies caused by unexpected encrypted data blocks, thus posing a significant threat to the safe and stable operation of industrial control systems. Summary of the Invention

[0004] (1) Technical problems to be solved The purpose of this invention is to provide a method and system for detecting abnormal traffic in control systems based on big data, so as to solve the problem that industrial control systems lack in-depth analysis and accurate judgment of the characteristics of encrypted data, and cannot detect and handle abnormal traffic situations caused by unexpected encrypted data blocks in a timely manner.

[0005] (2) Technical solution To achieve the above objectives, on the one hand, the present invention provides a method for detecting abnormal control system traffic based on big data, the method comprising: S1. Collect real-time network traffic data of the industrial control system. The network traffic data includes the transport layer payload and the corresponding protocol header information. Perform protocol parsing on the network traffic data, extract the function code, source address, destination address and port information in the protocol header information, and identify whether there are encrypted data blocks in the transport layer payload according to the preset plaintext protocol specification library.

[0006] S2. If an encrypted data block exists, extract the traffic characteristics of the encrypted data block, including packet length distribution, transmission frequency, entropy value, and session duration; input the traffic characteristics of the encrypted data block into a pre-trained baseline model to calculate the reconstruction error of the encrypted data block;

[0007] S3. If the reconstruction error exceeds the preset error threshold, the encrypted data block is determined to be an unexpected encrypted data block, and the source device and destination device of the unexpected encrypted data block are associated according to the source address, destination address and port information in the protocol header information, and a traffic anomaly alarm is generated.

[0008] Furthermore, the method for identifying whether there are encrypted data blocks in the transport layer payload according to a preset plaintext protocol specification library includes: The expected structure of the transport layer payload is determined by matching the function code and port information in the protocol header with the predefined protocol types in the preset plaintext protocol specification library; the semantic content and encoding rules of preset fields in the transport layer payload are extracted through protocol reverse analysis.

[0009] If the semantic content conforms to the plaintext data format of the expected structure and the encoding rule is consistent with the standard encoding in the plaintext protocol specification library, then it is determined that there is no encrypted data block in the transport layer payload.

[0010] If the semantic content contains a non-standard character set or the encoding rules deviate from the standard encoding, and the entropy value calculated for the transport layer payload exceeds the preset plaintext data entropy value threshold, then it is determined that there is an encrypted data block in the transport layer payload, and further verification is made based on the source address and destination address in the protocol header information to determine whether the encrypted data block conforms to the legal encrypted communication strategy defined in the preset plaintext protocol specification library.

[0011] Furthermore, the method for calculating the entropy value of the transport layer payload to exceed a preset plaintext data entropy threshold includes: The binary data of the transport layer payload is divided into multiple data blocks according to a preset block size, and the Shannon entropy value of its byte value is calculated for each data block; the average value and variance of the Shannon entropy values ​​of the multiple data blocks are calculated. If the average value of the Shannon entropy exceeds a preset plaintext data entropy threshold and the variance of the Shannon entropy is lower than a preset variance threshold, then the entropy value of the transport layer payload is determined to exceed the preset plaintext data entropy threshold.

[0012] If the average value of the Shannon entropy does not exceed the preset plaintext data entropy threshold, but the difference between the average value of the Shannon entropy and the preset plaintext data entropy threshold is within the preset fuzzy difference range, then the high-frequency component energy ratio of the transmission layer payload is extracted by frequency domain analysis; when the high-frequency component energy ratio exceeds the preset energy ratio threshold, then it is determined that the entropy value of the transmission layer payload exceeds the preset plaintext data entropy threshold.

[0013] Furthermore, the method for extracting the high-frequency component energy proportion of the transport layer payload through frequency domain analysis includes: The binary data of the transport layer payload is converted into a decimal numerical sequence and subjected to a fast Fourier transform to obtain the frequency domain energy distribution. The frequency range of the high-frequency component is determined according to the predefined plaintext data frequency domain characteristics in the preset plaintext protocol specification library, and the proportion of the frequency domain energy within the frequency range of the high-frequency component to the total energy of the frequency domain energy distribution is calculated and recorded as the high-frequency component energy ratio.

[0014] Furthermore, the method for further verifying whether the encrypted data block conforms to the legal encrypted communication strategy defined in the preset plaintext protocol specification library based on the source address and destination address in the protocol header information includes: Based on the source address and destination address in the protocol header information, a predefined whitelist of legal encrypted communication is queried from a preset plaintext protocol specification library. The whitelist of legal encrypted communication contains authorized combinations of source and destination addresses that are allowed to perform encrypted communication.

[0015] If the matching result of the source address and destination address exists in the legal encrypted communication whitelist, then the certificate fingerprint information of the encrypted data block is extracted and compared with the certificate fingerprint information pre-stored in the legal encrypted communication whitelist. When the certificate fingerprint information matches, then it is verified whether the communication timestamp of the encrypted data block is within the valid time window corresponding to the authorized combination. If the communication timestamp is within the valid time window, then it is determined that the encrypted data block conforms to the legal encrypted communication policy.

[0016] Further, the method of inputting the traffic characteristics of the encrypted data block into a pre-trained baseline model to calculate the reconstruction error of the encrypted data block includes: Historical normal traffic data is acquired, which includes the transport layer payload and protocol header information of legitimate encrypted traffic and plaintext traffic; after preprocessing the historical normal traffic data, the traffic features of legitimate encrypted traffic and plaintext traffic are extracted and input into an autoencoder network for training, which includes an encoder and a decoder.

[0017] By minimizing the mean square error between the output flow characteristics and the input flow characteristics of the autoencoder network, the parameters of the encoder and decoder are optimized until the reconstruction error of the autoencoder network converges to a stable value; the encoder and decoder of the trained autoencoder network are used as pre-trained baseline models.

[0018] Furthermore, the method for generating traffic anomaly alarms includes: Based on the source address, destination address, and port information in the protocol header, a predefined device registry is queried to obtain the device types and security domains of the source and destination devices; based on the device types and security domains, it is determined whether a legitimate communication relationship between the source and destination devices exists in a preset access control policy.

[0019] If the legitimate communication relationship does not exist in the preset access control policy, a comprehensive anomaly score is calculated based on the traffic characteristics of the unexpected encrypted data block; if the comprehensive anomaly score exceeds the preset alarm threshold, a traffic anomaly alarm is generated that includes the source device identifier, the destination device identifier, and the comprehensive anomaly score level, and the encryption behavior pattern of the unexpected encrypted data block is analyzed.

[0020] Furthermore, the method for analyzing the encryption behavior patterns of unexpected encrypted data blocks includes: The communication behavior pattern features of unexpected encrypted data blocks are extracted, including the target port switching frequency, the variance of the data packet arrival time interval, and the number of session direction abrupt changes.

[0021] The communication behavior pattern features are input into a pre-trained behavior classification model to obtain the probability that the unexpected encrypted data block belongs to malicious encrypted traffic; the behavior classification model is trained based on the feature differences between historical attack traffic data and legitimate encrypted traffic data.

[0022] If the probability of the traffic being malicious exceeds a preset malicious determination threshold, an attack type tag is added to the traffic anomaly alarm, and the attack type is determined by matching the function code in the protocol header information with a preset attack feature library.

[0023] Furthermore, the method for calculating a comprehensive anomaly score based on the traffic characteristics of unexpected encrypted data blocks includes: Obtain the packet length distribution, transmission frequency, and session duration from the unexpected encrypted data block traffic characteristics, and calculate the distribution deviation, frequency deviation, and session deviation by comparing them with the corresponding historical normal traffic characteristics in the pre-trained baseline model.

[0024] The distribution deviation, frequency deviation, and session deviation are weighted and summed according to preset weighting coefficients to obtain a comprehensive anomaly score.

[0025] On the other hand, based on the same inventive concept, the present invention also provides a control system traffic anomaly detection system based on big data, the system comprising: a network traffic data analysis module, an encrypted data block analysis module, and a traffic anomaly alarm management module, wherein each module is connected in a sequential communication manner; The network traffic data analysis module is used to collect real-time network traffic data of the industrial control system. The network traffic data includes the transport layer payload and the corresponding protocol header information. The module performs protocol parsing on the network traffic data, extracts the function code, source address, destination address and port information from the protocol header information, and identifies whether there are encrypted data blocks in the transport layer payload according to a preset plaintext protocol specification library.

[0026] The encrypted data block analysis module is used to extract the traffic characteristics of the encrypted data block if it exists. The traffic characteristics include packet length distribution, transmission frequency, entropy value, and session duration. The traffic characteristics of the encrypted data block are then input into a pre-trained baseline model to calculate the reconstruction error of the encrypted data block.

[0027] The traffic anomaly alarm management module is used to determine that the encrypted data block is an unexpected encrypted data block if the reconstruction error exceeds a preset error threshold, and to associate the source device and destination device of the unexpected encrypted data block with the source address, destination address and port information in the protocol header information to generate a traffic anomaly alarm.

[0028] (3) Beneficial effects Compared with the prior art, the beneficial effects of the present invention are: 1. By collecting real-time network traffic data from industrial control systems and performing deep analysis and encryption identification of transport layer payloads according to a pre-defined plaintext protocol specification library, it is possible to accurately determine whether a data block is encrypted. By extracting the traffic characteristics of encrypted data blocks and inputting them into a pre-trained baseline model to calculate the reconstruction error, unexpected encrypted data blocks can be accurately identified. This effectively improves the accuracy of traffic anomaly detection, avoids false alarms and missed alarms, and provides strong support for the safe and stable operation of industrial control systems.

[0029] 2. Upon detecting unexpected encrypted data blocks, their basic traffic characteristics are extracted, and the legitimacy of the source and destination devices is verified. By calculating a comprehensive anomaly score, the deviation of traffic characteristics across multiple dimensions is weighted and summed to more comprehensively assess the anomaly level of unexpected encrypted data blocks, significantly enhancing the accuracy of traffic anomaly alerts.

[0030] 3. After generating an anomaly traffic alert, further analysis of the encryption behavior patterns of the unexpected encrypted data blocks is performed to extract their communication behavior pattern characteristics. A pre-trained behavior classification model is then used to assess the probability that the traffic belongs to malicious encrypted traffic. If determined to be malicious encrypted traffic, an attack type tag is automatically added, and the specific attack type is determined by matching the function code in the protocol header information against a pre-defined attack signature database. This in-depth threat attribution capability helps to quickly locate the source of the attack, take targeted defensive measures, and effectively curb the spread of security threats. Attached Figure Description

[0031] Figure 1 This is a flowchart of the big data-based control system traffic anomaly detection method of the present invention.

[0032] Figure 2 This is a schematic diagram of the module composition of the control system traffic anomaly detection system based on big data according to the present invention. Detailed Implementation

[0033] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0034] Example 1: As Figure 1 As shown, this embodiment provides a method for detecting abnormal control system traffic based on big data. The method includes: S1. Collect real-time network traffic data of the industrial control system. The network traffic data includes the transport layer payload and the corresponding protocol header information. Perform protocol parsing on the network traffic data, extract the function code, source address, destination address and port information in the protocol header information, and identify whether there are encrypted data blocks in the transport layer payload according to the preset plaintext protocol specification library.

[0035] The method for identifying whether there are encrypted data blocks in the transport layer payload according to a preset plaintext protocol specification library includes: The expected structure of the transport layer payload is determined by matching the function code and port information in the protocol header with the predefined protocol types in the preset plaintext protocol specification library; the semantic content and encoding rules of preset fields in the transport layer payload are extracted through protocol reverse analysis.

[0036] If the semantic content conforms to the plaintext data format of the expected structure and the encoding rule is consistent with the standard encoding in the plaintext protocol specification library, then it is determined that there is no encrypted data block in the transport layer payload.

[0037] If the semantic content contains a non-standard character set or the encoding rules deviate from the standard encoding, and the entropy value calculated for the transport layer payload exceeds the preset plaintext data entropy value threshold, then it is determined that there is an encrypted data block in the transport layer payload, and further verification is made based on the source address and destination address in the protocol header information to determine whether the encrypted data block conforms to the legal encrypted communication strategy defined in the preset plaintext protocol specification library.

[0038] The method for calculating the entropy value of the transport layer payload to exceed a preset plaintext data entropy threshold includes: The binary data of the transport layer payload is divided into multiple data blocks according to a preset block size, and the Shannon entropy value of the byte value of each data block is calculated; the average value and variance of the Shannon entropy values ​​of the multiple data blocks are calculated.

[0039] If the average value of the Shannon entropy exceeds a preset plaintext data entropy threshold and the variance of the Shannon entropy is lower than a preset variance threshold, then the entropy value of the transport layer payload is determined to exceed the preset plaintext data entropy threshold.

[0040] If the average value of the Shannon entropy does not exceed the preset plaintext data entropy threshold, but the difference between the average value of the Shannon entropy and the preset plaintext data entropy threshold is within the preset fuzzy difference range, then the high-frequency component energy ratio of the transmission layer payload is extracted through frequency domain analysis.

[0041] The method for extracting the high-frequency component energy ratio of the transport layer payload through frequency domain analysis includes: The binary data of the transport layer payload is converted into a decimal numerical sequence and subjected to a fast Fourier transform to obtain the frequency domain energy distribution. The frequency range of the high-frequency component is determined according to the predefined plaintext data frequency domain characteristics in the preset plaintext protocol specification library, and the proportion of the frequency domain energy within the frequency range of the high-frequency component to the total energy of the frequency domain energy distribution is calculated and recorded as the high-frequency component energy ratio.

[0042] When the energy proportion of the high-frequency component exceeds a preset energy proportion threshold, it is determined that the entropy value of the transmission layer payload exceeds a preset plaintext data entropy value threshold.

[0043] The method for further verifying whether the encrypted data block conforms to the legal encrypted communication strategy defined in the preset plaintext protocol specification library based on the source address and destination address in the protocol header information includes: Based on the source address and destination address in the protocol header information, a predefined whitelist of legal encrypted communication is queried from a preset plaintext protocol specification library. The whitelist of legal encrypted communication contains authorized combinations of source and destination addresses that are allowed to perform encrypted communication.

[0044] If the matching result of the source address and destination address exists in the legal encrypted communication whitelist, then the certificate fingerprint information of the encrypted data block is extracted and compared with the certificate fingerprint information pre-stored in the legal encrypted communication whitelist. When the certificate fingerprint information matches, then it is verified whether the communication timestamp of the encrypted data block is within the valid time window corresponding to the authorized combination. If the communication timestamp is within the valid time window, then it is determined that the encrypted data block conforms to the legal encrypted communication policy.

[0045] For example, network traffic in industrial control systems is continuously generated by a large number of heterogeneous devices, including programmable logic controllers, distributed control system master stations, engineering workstations, and historical data servers. Communication between these devices generally follows industry-specific protocols such as Modbus TCP, DNP3, or IEC 61850. In actual deployment scenarios, network traffic data is typically continuously collected by network probes installed at key nodes of the industrial network or by mirror ports of industrial firewalls. The collected data includes the complete transport layer payload and the corresponding protocol header information. The protocol header information contains four core fields: the function code, which identifies the type of operation requested by the data packet (e.g., function code 0x03 in the Modbus protocol represents a read holding register operation); the port information, which distinguishes the protocol type (e.g., Modbus TCP uses port 502 by default, and DNP3 uses port 20000 by default); and the source and destination addresses, which identify the devices initiating and receiving communication in the network, respectively. After parsing the collected network traffic data and extracting the four types of fields, the transport layer payload is then analyzed layer by layer according to a pre-defined plaintext protocol specification library to determine whether encrypted data blocks exist within the transport layer payload.

[0046] The plaintext protocol specification library is the core reference benchmark of this method. It is pre-configured by the system administrator based on the actual network topology and covers the protocol type definitions, field structure specifications, standard encoding rules, legal encrypted communication whitelists, and corresponding certificate fingerprint information and valid time windows for common plaintext protocols in industrial control systems. When judging the transport layer payload, the predefined protocol type is first matched against the function code and port information in the protocol header to determine the expected structure of the transport layer payload. Taking the Modbus TCP protocol as an example, when the port information is 502 and the function code is 0x03, the plaintext protocol specification library identifies the payload as a read holding register request message. The expected structure is a fixed-length register start address field and register quantity field, the content of which should consist of decimal integer encoding and should not contain unprintable bytes. After determining the expected structure, the transport layer payload is parsed segment by segment according to the field offsets and field lengths defined in the plaintext protocol specification library through protocol reverse analysis. The semantic content of each preset field is extracted, and the encoding rules are verified to be consistent with the standard encoding in the plaintext protocol specification library.

[0047] If the extracted semantic content fully conforms to the plaintext data format of the expected structure, and the encoding rules are consistent with the standard encoding in the plaintext protocol specification library, then it is determined that there is no encrypted data block in the transport layer payload, and the data packet is processed as normal plaintext traffic and does not enter the subsequent detection stage.

[0048] If the extracted semantic content contains non-standard character sets (e.g., a large number of non-printable bytes with values ​​exceeding 127 appearing in a field that should be ASCII encoded), or if the encoding rules deviate significantly from standard encoding, then the Shannon entropy value is further calculated on the transport layer payload to quantify the randomness of the payload content, thereby helping to determine whether the payload has been encrypted. The specific calculation method is as follows: the binary data of the transport layer payload is evenly divided into several data blocks according to a preset block size (e.g., 64 bytes per block). For each data block, the frequency of each byte value (ranging from 0 to 255) is counted, and the Shannon entropy value of the data block is calculated using the Shannon entropy formula. ;in, Represents byte value The frequency of occurrence in the current data block. After calculating the Shannon entropy value for each data block of the transport layer payload, the average Shannon entropy value of all data blocks is calculated. and the variance of Shannon entropy .like Exceeding the preset plaintext data entropy threshold, and If the variance is below a preset threshold, the entropy value of the transport layer payload is determined to exceed the plaintext data entropy threshold, indicating that the payload is highly random and contains encrypted data blocks. The variance constraint is added here because the byte distribution of normal encrypted data should be highly uniform across data blocks, and the differences between the Shannon entropy values ​​of each data block should be small. If the variance is too large, it indicates that the transport layer payload may be a mixture of plaintext and ciphertext, rather than being entirely encrypted, requiring careful consideration to avoid misjudgment. The preset plaintext data entropy threshold can be determined statistically based on the distribution of Shannon entropy values ​​in historical normal plaintext traffic. Typically, the average Shannon entropy value of historical plaintext traffic plus twice the standard deviation is taken as the threshold to cover a reasonable fluctuation range for normal plaintext data.

[0049] Considering that the Shannon entropy value of the transport layer payload of some industrial protocols may fall between typical plaintext and typical ciphertext under specific operating modes, it is difficult to make a clear judgment based solely on the average value of the Shannon entropy value. Therefore, when the average value of the Shannon entropy value does not exceed the preset plaintext data entropy value threshold, but the difference between the average value of the Shannon entropy value and the preset plaintext data entropy value threshold falls within the preset fuzzy difference range (e.g., the absolute value of the difference is between 0.2 and 0.5 bits), frequency domain analysis is introduced as an auxiliary criterion. The specific implementation of frequency domain analysis is as follows: The binary data of the transport layer payload is converted into a corresponding decimal numerical sequence. A Fast Fourier Transform (FFT) is performed on this sequence to obtain the frequency domain energy distribution. Based on the predefined plaintext data frequency domain characteristics in the plaintext protocol specification library, the frequency range of high-frequency components is determined (plaintext industrial protocol data, due to its regularly repeating field structure, typically concentrates energy in the low-frequency band; while encrypted data, due to the uniform and random distribution of byte values, tends to have a more uniform energy distribution across the entire frequency band, resulting in a significantly higher proportion of high-frequency component energy). The proportion of frequency domain energy within the high-frequency component range to the total frequency domain energy is calculated and denoted as the high-frequency component energy percentage. When the high-frequency component energy percentage exceeds a preset energy percentage threshold, the entropy value of the transport layer payload is determined to exceed a preset plaintext data entropy threshold, thus confirming the existence of an encrypted data block and proceeding to the subsequent verification stage.

[0050] After confirming the presence of encrypted data blocks through entropy analysis or frequency domain analysis, the encrypted data blocks are not directly marked as abnormal. Instead, the system first queries a predefined whitelist of legitimate encrypted communication in the plaintext protocol specification library based on the source and destination addresses in the protocol header information to verify the legitimacy of the encrypted communication behavior. The whitelist of legitimate encrypted communication records all authorized combinations of source and destination addresses that can execute encrypted communication in the industrial control system, as well as the certificate fingerprint information and valid time window corresponding to each authorized combination. If the combination of source and destination addresses exists in the whitelist of legitimate encrypted communication, the certificate fingerprint information carried in the encrypted data block is further extracted and compared byte by byte with the certificate fingerprint information pre-stored in the whitelist of legitimate encrypted communication. When the certificate fingerprint information comparison result is consistent, the system continues to verify whether the communication timestamp of the encrypted data block is within the valid time window corresponding to the authorized combination. When the above three verifications—address combination verification, certificate fingerprint comparison, and timestamp verification—pass, the encrypted data block is determined to comply with the legitimate encrypted communication strategy and is processed as legitimate encrypted traffic, without proceeding to the subsequent anomaly detection stage. If the combination of the source address and destination address is not in the whitelist of legitimate encrypted communication, or if the certificate fingerprint information does not match, or if the communication timestamp is not within the valid time window, then the encrypted data block will enter the baseline model reconstruction error calculation stage and be further analyzed as a suspected unexpected encrypted data block.

[0051] S2. If an encrypted data block exists, extract the traffic characteristics of the encrypted data block, including packet length distribution, transmission frequency, entropy value and session duration; input the traffic characteristics of the encrypted data block into the pre-trained baseline model to calculate the reconstruction error of the encrypted data block.

[0052] The method for inputting the traffic characteristics of the encrypted data block into a pre-trained baseline model to calculate the reconstruction error of the encrypted data block includes: Historical normal traffic data is acquired, which includes the transport layer payload and protocol header information of legitimate encrypted traffic and plaintext traffic; after preprocessing the historical normal traffic data, the traffic features of legitimate encrypted traffic and plaintext traffic are extracted and input into an autoencoder network for training, which includes an encoder and a decoder.

[0053] The parameters of the encoder and decoder are optimized by minimizing the mean square error between the output flow features and the input flow features of the autoencoder network until the reconstruction error of the autoencoder network converges to a stable value; the encoder and decoder of the trained autoencoder network are used as the pre-trained baseline model.

[0054] For example, after entering the baseline model analysis stage, four types of traffic features are extracted from the encrypted data block: packet length distribution reflects the size characteristics of the encrypted payload, usually described by the statistical distribution of each packet length value, such as the mean and frequency distribution of each interval; transmission frequency represents the number of packets sent per unit time; entropy value is the average Shannon entropy value calculated in the previous steps; session duration represents the complete time span from connection establishment to connection closure. These four types of traffic features are used to construct a feature vector, which is then input into the pre-trained baseline model to calculate the reconstruction error of the encrypted data block.

[0055] The pre-trained baseline model is implemented using an autoencoder network architecture. The autoencoder network consists of an encoder and a decoder. The encoder maps the input traffic feature vector to a low-dimensional latent space representation, and the decoder then restores the latent space representation to an output vector of the same dimension as the input feature vector. During the training phase, historical normal traffic data (covering the transport layer payload and protocol header information of both legitimate encrypted and plaintext traffic) is used to extract corresponding traffic features as training samples. The parameters of the encoder and decoder are optimized by minimizing the mean squared error between the input and output feature vectors. The formula for calculating the mean squared error is: ;in, The first of the input flow feature vectors One portion, Let $n$ be the corresponding reconstructed value output by the autoencoder, and $n$ be the dimension of the feature vector. Training is complete when the reconstruction error of the autoencoder network on the training set converges to a stable value. The encoder and decoder of the trained autoencoder network are then saved as a pre-trained baseline model. For encrypted data blocks that conform to historical normal traffic patterns, the autoencoder network can accurately reconstruct the input feature vector with a small reconstruction error. However, for unexpected encrypted data blocks whose traffic patterns deviate from historical normal patterns, the reconstruction error of the autoencoder network will be significantly higher because their traffic pattern was not fully learned during training. This allows for the quantitative differentiation of unexpected encrypted data blocks.

[0056] S3. If the reconstruction error exceeds the preset error threshold, the encrypted data block is determined to be an unexpected encrypted data block, and the source device and destination device of the unexpected encrypted data block are associated according to the source address, destination address and port information in the protocol header information, and a traffic anomaly alarm is generated.

[0057] The method for generating traffic anomaly alarms includes: Based on the source address, destination address, and port information in the protocol header, a predefined device registry is queried to obtain the device types and security domains of the source and destination devices; based on the device types and security domains, it is determined whether a legitimate communication relationship between the source and destination devices exists in a preset access control policy.

[0058] If the legitimate communication relationship does not exist in the preset access control policy, a comprehensive anomaly score is calculated based on the traffic characteristics of the unexpected encrypted data block.

[0059] The method for calculating a comprehensive anomaly score based on the traffic characteristics of unexpected encrypted data blocks includes: Obtain the packet length distribution, transmission frequency, and session duration from the unexpected encrypted data block traffic characteristics, and calculate the distribution deviation, frequency deviation, and session deviation by comparing them with the corresponding historical normal traffic characteristics in the pre-trained baseline model.

[0060] The distribution deviation, frequency deviation, and session deviation are weighted and summed according to preset weighting coefficients to obtain a comprehensive anomaly score.

[0061] If the overall anomaly score exceeds the preset alarm threshold, a traffic anomaly alarm is generated, which includes the source device identifier, the destination device identifier, and the overall anomaly score level, and the encryption behavior pattern of the unexpected encrypted data blocks is analyzed.

[0062] The method for analyzing the encryption behavior patterns of unexpected encrypted data blocks includes: The communication behavior pattern features of unexpected encrypted data blocks are extracted, including the target port switching frequency, the variance of the data packet arrival time interval, and the number of session direction abrupt changes.

[0063] The communication behavior pattern features are input into a pre-trained behavior classification model to obtain the probability that the unexpected encrypted data block belongs to malicious encrypted traffic; the behavior classification model is trained based on the feature differences between historical attack traffic data and legitimate encrypted traffic data.

[0064] If the probability of the traffic being malicious exceeds a preset malicious determination threshold, an attack type tag is added to the traffic anomaly alarm, and the attack type is determined by matching the function code in the protocol header information with a preset attack feature library.

[0065] For example, if the reconstruction error of an encrypted data block calculated by the baseline model exceeds a preset error threshold, the encrypted data block is determined to be an unexpected encrypted data block. The preset error threshold is determined by statistically analyzing the reconstruction error distribution of historical normal traffic on the baseline model, taking the mean plus a certain number of standard deviations as the threshold, to ensure that the false positive rate of normal traffic is within an acceptable range. After determining the existence of an unexpected encrypted data block, based on the source address, destination address, and port information in the protocol header, a predefined device registry is queried to obtain the device type (e.g., programmable logic controller, engineering station, historical data server, etc.) and security domain (e.g., control zone, monitoring zone, factory information zone) of the source and destination devices, and then determining whether the communication relationship between the source and destination devices exists in the preset access control policy. The access control policy is pre-configured by the system administrator according to the security zoning principle of the industrial control system, defining the allowed communication relationships between different device types and security domains. For example, the programmable logic controller in the control zone is usually only allowed to establish communication with the master station of the distributed control system in the same area, and is not allowed to directly establish a session with the server in the factory information zone.

[0066] If the communication relationship between the source and destination devices is not included in the access control policy, a comprehensive anomaly score is calculated for the traffic characteristics of unexpected encrypted data blocks. The comprehensive anomaly score is calculated as follows: The packet length distribution, transmission frequency, and session duration are extracted from the traffic characteristics of the unexpected encrypted data blocks, and their deviations are calculated compared to the corresponding historical normal traffic characteristics in the pre-trained baseline model. This yields three sub-scores: distribution deviation, frequency deviation, and session deviation. All three deviations are calculated using the Z-score method. Taking distribution deviation as an example, the calculation formula is as follows: ,in This represents the statistical value of the current unexpected encrypted data block packet length distribution. These represent the mean and standard deviation of the characteristics corresponding to historical normal flow rates; frequency deviation. Deviance from conversation Calculate in the same way. Then, based on the preset weighting coefficients. The three deviations are weighted and summed to obtain the comprehensive anomaly score. The sum of all weight coefficients is 1, and the specific values ​​are pre-configured by the system administrator based on the importance of each feature to business security.

[0067] If the overall anomaly score exceeds the preset alarm threshold, a traffic anomaly alarm is generated, including the source device identifier, destination device identifier, and overall anomaly score level. Simultaneously, an encryption behavior pattern analysis of the unexpected encrypted data block is triggered. The encryption behavior pattern analysis phase extracts three types of communication behavior pattern features: target port switching frequency records the number and frequency of changes in the destination port during the session. Normal industrial control communication typically uses a fixed port; frequent switching of the destination port is a typical characteristic of tunneling attacks or lateral movement attacks. The variance of the data packet arrival time interval reflects the stability of the data packet sending rhythm. Normal industrial control communication has a periodic polling characteristic, resulting in a relatively small time interval variance, while malicious traffic often exhibits suddenness or randomness, leading to a significantly larger time interval variance. The number of session direction abrupt changes counts the number of times the communication direction reverses in a complete session. Frequent abnormal bidirectional traffic abrupt changes may indicate the existence of a command and control channel. These three types of features are used to construct a behavior feature vector, which is input into a pre-trained behavior classification model, outputting the probability that the unexpected encrypted data block belongs to malicious encrypted traffic. The behavior classification model is trained using the feature differences between historical attack traffic data and legitimate encrypted traffic data. During training, the behavior feature vectors of historical attack traffic data are used as positive samples, and the behavior feature vectors of legitimate encrypted traffic data are used as negative samples. Supervised classification training enables the model to learn the decision boundary for distinguishing between the two types of traffic. If the probability of the output belonging to malicious encrypted traffic exceeds a preset malicious judgment threshold, an attack type label is added to the traffic anomaly alarm. The specific attack type is determined by matching the function code in the protocol header information against a preset attack feature library. Taking the Modbus TCP protocol as an example, if the function code is 0x06 (write a single register), and the communication behavior pattern matches the parameter tampering attack feature defined in the attack feature library, the attack type is marked as a register parameter tampering attack, and the attack type label is written to the traffic anomaly alarm, providing a clear basis for subsequent emergency response and attack tracing analysis.

[0068] Example 2: Based on the same inventive concept, such as Figure 2 As shown, this embodiment also provides a control system traffic anomaly detection system based on big data. The system includes: a network traffic data analysis module, an encrypted data block analysis module, and a traffic anomaly alarm management module, with each module communicating with each other in sequence. The network traffic data analysis module is used to collect real-time network traffic data of the industrial control system. The network traffic data includes the transport layer payload and the corresponding protocol header information. The module performs protocol parsing on the network traffic data, extracts the function code, source address, destination address and port information from the protocol header information, and identifies whether there are encrypted data blocks in the transport layer payload according to a preset plaintext protocol specification library.

[0069] The encrypted data block analysis module is used to extract the traffic characteristics of the encrypted data block if it exists. The traffic characteristics include packet length distribution, transmission frequency, entropy value and session duration. The traffic characteristics of the encrypted data block are input into a pre-trained baseline model to calculate the reconstruction error of the encrypted data block.

[0070] The traffic anomaly alarm management module is used to determine that the encrypted data block is an unexpected encrypted data block if the reconstruction error exceeds a preset error threshold, and to associate the source device and destination device of the unexpected encrypted data block with the source address, destination address and port information in the protocol header information to generate a traffic anomaly alarm.

[0071] It should be noted that the specific methods by which each module performs operations in the system described in the above embodiments have been described in detail in the embodiments related to the method, and will not be elaborated here.

[0072] Finally, it should be noted that although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing embodiments or make equivalent substitutions for some of the technical features. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A method for detecting abnormal control system flow based on big data, characterized in that, The method includes: The system collects real-time network traffic data from an industrial control system. The network traffic data includes the transport layer payload and the corresponding protocol header information. The system performs protocol parsing on the network traffic data, extracts the function code, source address, destination address, and port information from the protocol header information, and identifies whether there are encrypted data blocks in the transport layer payload according to a preset plaintext protocol specification library. If an encrypted data block exists, the traffic characteristics of the encrypted data block are extracted. The traffic characteristics include packet length distribution, transmission frequency, entropy value, and session duration. The traffic characteristics of the encrypted data block are then input into a pre-trained baseline model to calculate the reconstruction error of the encrypted data block. If the reconstruction error exceeds a preset error threshold, the encrypted data block is determined to be an unexpected encrypted data block, and the source device and destination device of the unexpected encrypted data block are associated with the source address, destination address and port information in the protocol header information, and a traffic anomaly alarm is generated.

2. The method for detecting abnormal control system traffic based on big data according to claim 1, characterized in that, The method for identifying whether there are encrypted data blocks in the transport layer payload according to a preset plaintext protocol specification library includes: Based on the function code and port information in the protocol header, the expected structure of the transport layer payload is determined by matching the predefined protocol types in the preset plaintext protocol specification library; the semantic content and encoding rules of preset fields in the transport layer payload are extracted through protocol reverse analysis. If the semantic content conforms to the plaintext data format of the expected structure and the encoding rule is consistent with the standard encoding in the plaintext protocol specification library, then it is determined that there is no encrypted data block in the transport layer payload; If the semantic content contains a non-standard character set or the encoding rules deviate from the standard encoding, and the entropy value calculated for the transport layer payload exceeds the preset plaintext data entropy value threshold, then it is determined that there is an encrypted data block in the transport layer payload, and further verification is made based on the source address and destination address in the protocol header information to determine whether the encrypted data block conforms to the legal encrypted communication strategy defined in the preset plaintext protocol specification library.

3. The method for detecting abnormal control system traffic based on big data according to claim 2, characterized in that, The method for calculating the entropy value of the transport layer payload to exceed a preset plaintext data entropy threshold includes: The binary data of the transport layer payload is divided into multiple data blocks according to a preset block size, and the Shannon entropy value of its byte value is calculated for each data block; the average value and variance of the Shannon entropy values ​​of the multiple data blocks are calculated. If the average value of the Shannon entropy exceeds a preset plaintext data entropy threshold and the variance of the Shannon entropy is lower than a preset variance threshold, then the entropy value of the transport layer payload is determined to exceed the preset plaintext data entropy threshold. If the average value of the Shannon entropy does not exceed the preset plaintext data entropy threshold, but the difference between the average value of the Shannon entropy and the preset plaintext data entropy threshold is within the preset fuzzy difference range, then the high-frequency component energy ratio of the transmission layer payload is extracted by frequency domain analysis; when the high-frequency component energy ratio exceeds the preset energy ratio threshold, then it is determined that the entropy value of the transmission layer payload exceeds the preset plaintext data entropy threshold.

4. The method for detecting abnormal control system traffic based on big data according to claim 3, characterized in that, The method for extracting the high-frequency component energy ratio of the transport layer payload through frequency domain analysis includes: The binary data of the transport layer payload is converted into a decimal numerical sequence and subjected to a fast Fourier transform to obtain the frequency domain energy distribution. The frequency range of the high-frequency component is determined according to the predefined plaintext data frequency domain characteristics in the preset plaintext protocol specification library, and the proportion of the frequency domain energy within the frequency range of the high-frequency component to the total energy of the frequency domain energy distribution is calculated and recorded as the high-frequency component energy ratio.

5. The method for detecting abnormal control system traffic based on big data according to claim 2, characterized in that, The method for further verifying whether the encrypted data block conforms to the legal encrypted communication strategy defined in the preset plaintext protocol specification library based on the source address and destination address in the protocol header information includes: Based on the source address and destination address in the protocol header information, a predefined whitelist of legal encrypted communication is queried from a pre-defined plaintext protocol specification library. The whitelist of legal encrypted communication contains authorized combinations of source and destination addresses that are allowed to perform encrypted communication. If the matching result of the source address and destination address exists in the legal encrypted communication whitelist, then the certificate fingerprint information of the encrypted data block is extracted and compared with the certificate fingerprint information pre-stored in the legal encrypted communication whitelist. When the certificate fingerprint information matches, then it is verified whether the communication timestamp of the encrypted data block is within the valid time window corresponding to the authorized combination. If the communication timestamp is within the valid time window, then it is determined that the encrypted data block conforms to the legal encrypted communication policy.

6. The method for detecting abnormal control system traffic based on big data according to claim 1, characterized in that, The method of inputting the traffic characteristics of the encrypted data block into a pre-trained baseline model to calculate the reconstruction error of the encrypted data block includes: Historical normal traffic data is acquired, which includes the transport layer payload and protocol header information of legitimate encrypted traffic and plaintext traffic; after preprocessing the historical normal traffic data, the traffic features of legitimate encrypted traffic and plaintext traffic are extracted and input into an autoencoder network for training, which includes an encoder and a decoder. By minimizing the mean square error between the output flow characteristics and the input flow characteristics of the autoencoder network, the parameters of the encoder and decoder are optimized until the reconstruction error of the autoencoder network converges to a stable value; the encoder and decoder of the trained autoencoder network are used as pre-trained baseline models.

7. The method for detecting abnormal control system traffic based on big data according to claim 1, characterized in that, The method for generating traffic anomaly alarms includes: Based on the source address, destination address, and port information in the protocol header, a predefined device registry is queried to obtain the device types and security domains of the source and destination devices; based on the device types and security domains, it is determined whether a legitimate communication relationship between the source and destination devices exists in a preset access control policy; If the legitimate communication relationship does not exist in the preset access control policy, a comprehensive anomaly score is calculated based on the traffic characteristics of the unexpected encrypted data block; if the comprehensive anomaly score exceeds the preset alarm threshold, a traffic anomaly alarm is generated that includes the source device identifier, the destination device identifier, and the comprehensive anomaly score level, and the encryption behavior pattern of the unexpected encrypted data block is analyzed.

8. The method for detecting abnormal control system traffic based on big data according to claim 7, characterized in that, The method for analyzing the encryption behavior patterns of unexpected encrypted data blocks includes: Extract communication behavior pattern features of unexpected encrypted data blocks, including target port switching frequency, variance of data packet arrival time interval and number of session direction abrupt changes; The communication behavior pattern features are input into a pre-trained behavior classification model to obtain the probability that an unexpected encrypted data block belongs to malicious encrypted traffic; the behavior classification model is trained based on the feature differences between historical attack traffic data and legitimate encrypted traffic data; If the probability of the traffic being malicious exceeds a preset malicious determination threshold, an attack type tag is added to the traffic anomaly alarm, and the attack type is determined by matching the function code in the protocol header information with a preset attack feature library.

9. The method for detecting abnormal control system traffic based on big data according to claim 7, characterized in that, The method for calculating a comprehensive anomaly score based on the traffic characteristics of unexpected encrypted data blocks includes: Obtain the packet length distribution, transmission frequency, and session duration from the unexpected encrypted data block traffic characteristics, and calculate the distribution deviation, frequency deviation, and session deviation by comparing them with the corresponding historical normal traffic characteristics in the pre-trained baseline model. The distribution deviation, frequency deviation, and session deviation are weighted and summed according to preset weighting coefficients to obtain a comprehensive anomaly score.

10. A control system flow anomaly detection system based on big data, characterized in that, The system includes: a network traffic data analysis module, an encrypted data block analysis module, and a traffic anomaly alarm management module, with each module communicating with the others in sequence. The network traffic data analysis module is used to collect real-time network traffic data of the industrial control system. The network traffic data includes the transport layer payload and the corresponding protocol header information. The module performs protocol parsing on the network traffic data, extracts the function code, source address, destination address and port information from the protocol header information, and identifies whether there are encrypted data blocks in the transport layer payload according to a preset plaintext protocol specification library. The encrypted data block analysis module is used to extract the traffic characteristics of the encrypted data block if it exists. The traffic characteristics include packet length distribution, transmission frequency, entropy value and session duration. The traffic characteristics of the encrypted data block are input into a pre-trained baseline model to calculate the reconstruction error of the encrypted data block. The traffic anomaly alarm management module is used to determine that the encrypted data block is an unexpected encrypted data block if the reconstruction error exceeds a preset error threshold, and to associate the source device and destination device of the unexpected encrypted data block with the source address, destination address and port information in the protocol header information to generate a traffic anomaly alarm.