Apparatus for collecting data sets for building machine learning systems in the field of networks

By constructing a collaborative ML module, evaluation module, and data acquisition module, the problems of insufficient generalization ability and opaque decision logic of machine learning models in the field of cybersecurity are solved. This enables the construction of high-quality datasets and closed-loop optimization of model training, thereby improving the accuracy and reliability of cybersecurity detection.

CN122204518APending Publication Date: 2026-06-12INST OF COMPUTING TECH CHINESE ACAD OF SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
INST OF COMPUTING TECH CHINESE ACAD OF SCI
Filing Date
2026-04-22
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing machine learning models in the field of cybersecurity suffer from insufficient generalization ability, opaque decision-making logic, data bias, and shortcut learning problems, making them difficult to operate effectively in actual deployments.

Method used

By constructing a data acquisition device that includes an ML module, an evaluation module, and a data acquisition module, we can perform label configuration, training set construction, and shortcut learning behavior detection for network traffic data, dynamically guide data acquisition strategies, and achieve closed-loop optimization of dataset construction and model training.

Benefits of technology

It improves the quality and reliability of training datasets for machine learning systems in the network domain, enhances the generalization ability and robustness of models, and ensures stable and reliable operation in complex real-world network environments.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122204518A_ABST
    Figure CN122204518A_ABST
Patent Text Reader

Abstract

The application provides a collection device for constructing a data set of a machine learning system in a network field, and the device comprises: an ML module, which is used for pre-processing each network traffic data newly collected by a data collection module to configure a label for each network traffic data, constructing a training set from the newly collected network traffic data configured with the label and historical network traffic data configured with the label, and training a machine learning system in a network field based on the training set; an evaluation module, which is used for evaluating whether the machine learning system has a shortcut learning behavior, and if the machine learning system does not have the shortcut learning behavior, outputting the training set constructed in the current ML module, otherwise, formulating a data collection strategy based on the identified shortcut learning behavior; and a data collection module, which is used for collecting network traffic data based on a preset data collection strategy or the data collection strategy formulated by the evaluation module, and feeding back the collected network traffic data to the ML module.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of network security technology, specifically to the field of network data processing and machine learning dataset construction, and more specifically, to a data acquisition device for constructing datasets for machine learning systems in the network field. Background Technology

[0002] With the rapid development of artificial intelligence (AI) and machine learning (ML) technologies, their application in the network field is becoming increasingly profound, covering multiple aspects such as threat detection, malware analysis, and intrusion prevention systems, and has become a research hotspot for improving network automation and intelligence. While machine learning-based network security and operations solutions have achieved significant success, a key obstacle remains when deploying these developed machine learning models (such as DDoS detection, malware identification, and intrusion prevention) in real-world networks with different network behavior characteristics: models trained in controlled environments cannot maintain their original effectiveness in different real-world networks. This problem is often referred to as the generalization problem of machine learning models. The diversity and adversarial nature of real-world network environments prevent machine learning models from performing as expected in different deployment settings. For example, an intrusion detection model trained and tested based on specific environment data cannot be expected to remain effective in different environments (due to the differences in real-world network environments, attack behaviors, even benign behaviors, can vary significantly). Furthermore, even after deployment, maintaining the effectiveness of the model presents serious problems. For instance, network changes based on business expansion may lead to new benign behaviors being identified as attacks, which is why security vendors are reluctant to deploy in real-world networks.

[0003] Currently, although many models demonstrate excellent performance metrics such as high accuracy and high recall on closed test datasets, their data-driven internal decision-making logic remains opaque to cybersecurity researchers. This opacity makes it difficult for researchers to understand, attribute, and trust the judgment criteria of these models, severely hindering the large-scale deployment and effective operation of such systems in real-world production environments.

[0004] Specifically, the shortcomings of existing machine learning models in the field of computer networks are mainly reflected in the following aspects: (1) The model decision-making process is not interpretable: Most high-performance machine learning models (especially deep learning models) lack inherent interpretability. On the one hand, this makes it impossible for security analysts to correct the model's incorrect learning of erroneous data. On the other hand, the model is difficult to cope with attacks on real networks based on biased and insufficiently representative data. (2) Sensitive to data bias and shortcut learning: The model may rely on biased features unintentionally introduced during the data collection process to make decisions, rather than the pattern of the attack behavior itself, resulting in insufficient generalization ability of the model trained in a specific environment in the real environment. (3) Difficult to detect and respond to out-of-distribution (OOD) attacks: When faced with new attack variants or attack scenarios (i.e., out-of-distribution samples) that are not fully covered by the training data, existing black-box models often cannot give reliable predictions, and may even make wrong judgments with high confidence, and lack effective early warning mechanisms. (4) The model evaluation system is incomplete. Current model evaluation relies too heavily on single indicators such as accuracy, lacking effective means to evaluate the rationality of its decision-making logic, robustness, and generalization ability in real network environments (such as detecting the existence of shortcut learning, data bias, and out-of-distribution problems). This leads to many models that "perform well" in laboratory environments failing to play their expected role in actual deployments. The adversarial nature of network security requires models to be reliable, auditable, and with clear action justifications, but the "black box" nature of existing machine learning models creates a sharp contradiction with this requirement. Ultimately, the network datasets used for training large network models are unreliable, scarce, and of low quality.

[0005] In summary, while machine learning technology is increasingly widely used in cybersecurity, it faces significant challenges in generalization during practical deployments. The core issues stem from the "black box" nature of machine learning models and the low quality of training data—models lack interpretability, leading to opaque decision-making processes; training data suffers from severe biases, shortcut learning, and distribution shifts, causing models to over-rely on non-robust features rather than essential classification or prediction attributes. Therefore, the cybersecurity field urgently needs a new paradigm that integrates domain knowledge, interpretable analysis, and dynamic data iteration to systematically improve training data quality and model generalization capabilities.

[0006] It should be noted that the background information presented here is only for illustrating relevant information about the present invention to aid in understanding the technical solution of the present invention, and does not imply that the relevant information is necessarily prior art. The relevant information was submitted and disclosed together with the present invention, and should not be considered prior art unless there is evidence that the relevant information was disclosed before the filing date of the present invention. Summary of the Invention

[0007] Therefore, the purpose of this invention is to overcome the shortcomings of the prior art and provide a data acquisition device for building machine learning systems in the network field.

[0008] The objective of this invention is achieved through the following technical solution:

[0009] According to a first aspect of the present invention, a data acquisition device is provided for constructing a machine learning system in the network domain. The device includes an ML module, an evaluation module, and a data acquisition module. The ML module preprocesses each newly acquired network traffic data point to assign a label to each data point, constructs a training set using all newly acquired labeled network traffic data points and historically labeled network traffic data points, and trains a machine learning system in the network domain using an end-to-end approach based on the training set. The evaluation module evaluates whether the machine learning system exhibits shortcut learning behavior. If the machine learning system does not exhibit shortcut learning behavior, it outputs the training set constructed in the current ML module; otherwise, it formulates a data acquisition strategy based on the identified shortcut learning behavior. The data acquisition module collects network traffic data based on a preset data acquisition strategy or a data acquisition strategy formulated by the evaluation module, and feeds the collected network traffic data back to the ML module.

[0010] This solution can achieve at least the following beneficial technical effects:

[0011] This solution, through the collaborative work of the ML module, evaluation module, and data acquisition module, can automatically complete label configuration, training set construction, and machine learning system training while collecting network traffic data. Furthermore, the evaluation module detects shortcut learning behavior and dynamically guides data acquisition strategies, achieving closed-loop optimization of dataset construction and model training, and effectively improving the quality and reliability of training datasets for machine learning systems in the network domain.

[0012] Preferably, the evaluation module is configured to: use a machine learning system trained with an ML module to predict the predicted label for each network traffic data in the training set; construct a decision tree system, wherein the decision tree system is trained with network traffic data in the training set as input, the labels predicted by the decision tree system as output, and the predicted labels of the machine learning system trained with an ML module as supervised training; obtain all decision paths of the decision tree system and the evaluation information of each decision path, and use a preset method to detect whether each decision path has shortcut learning behavior based on the obtained evaluation information, wherein the evaluation information of each decision path includes at least the depth of the decision path, the features it depends on, and the sample coverage rate, where the depth is the number of node splits included in the decision path, and the coverage rate is the proportion of the number of samples covered by the decision path to the total number of samples in the training dataset used when training the decision tree system.

[0013] This solution can achieve at least the following beneficial technical effects:

[0014] This solution simulates the decision-making logic of a machine learning system by constructing a decision tree system. Based on the depth, dependency features, and sample coverage of the decision path, it detects shortcut learning behavior. This allows the solution to locate shortcut problems caused by dataset defects at the model decision-making logic level, enabling the interpretation evaluation of the training dataset and improving the intuitiveness and accuracy of dataset problem identification.

[0015] Preferably, the evaluation module is configured to construct a decision tree system in the following manner: dividing the training dataset into a decision tree system dataset and an evaluation dataset according to a preset division ratio; training multiple decision tree systems with network traffic data in the decision tree system dataset as input, labels predicted by the decision tree system as output, and supervised training by a machine learning system trained by the ML module on the predicted labels of the network traffic data; using each decision tree system to predict each network traffic data in the evaluation dataset, calculating the prediction similarity of each decision tree system with all other decision tree systems based on the prediction results, and taking the average of each similarity; and selecting the decision tree system with the highest average prediction similarity as the final decision tree system.

[0016] This solution can achieve at least the following beneficial technical effects:

[0017] This approach trains multiple decision tree systems by dividing the dataset and selects the optimal decision tree system based on the average prediction similarity between models. This allows the selection of the most stable and consistent decision tree for subsequent analysis, avoiding the bias caused by the randomness of a single decision tree and improving the reliability of shortcut learning behavior detection results.

[0018] Preferably, in the evaluation module, multiple decision tree systems are trained as follows: S1, a portion of network traffic data is randomly extracted from the decision tree system dataset, and all extracted network traffic data is divided into a decision tree system training dataset and a decision tree system test dataset; S2, a decision tree system is trained based on preset constraint parameters, using each network traffic data in the decision tree system training dataset as input, the label predicted by the decision tree system as output, and the label predicted by the machine learning system trained by the ML module as supervision, wherein the preset constraint parameters include at least the maximum depth and minimum number of leaf node samples of the decision tree system; S3, the trained decision tree system is used to predict the label for each network traffic data in the decision tree system test dataset, and the results are compared based on the decision tree system and the ML module. The block-trained machine learning system predicts the label for each network traffic data in the decision tree system test dataset and determines the prediction fidelity between the decision tree system and the machine learning system trained by the ML module; S4, the decision tree system test dataset is put back into the decision tree system dataset to construct a new decision tree system dataset, and steps S1 to S4 are performed based on the new decision tree system dataset; wherein, when S1 to S4 completes a first preset number of loop iterations, the decision tree system with the highest prediction fidelity to the machine learning system trained by the ML module is selected from the newly constructed first preset number of decision tree systems, until the number of selected decision tree systems reaches a second preset number, the loop iteration stops, and all selected decision tree systems are used as the multiple decision tree systems finally trained.

[0019] This solution can achieve at least the following beneficial technical effects:

[0020] This approach iteratively constructs multiple decision tree systems through cyclic sampling, training, testing, and sample replacement, and selects high-fidelity decision trees in iterative batches. This enhances the decision tree's ability to learn complex sample features, improves the consistency between the surrogate model and the original machine learning system's prediction logic, and ensures the accuracy of subsequent interpretable analysis.

[0021] Preferably, in the evaluation module, each decision tree system obtained from the final training is pruned in the following manner: all decision paths of the decision tree system are obtained; the coverage of each decision path is obtained, wherein the coverage of a decision path is the ratio of the amount of network traffic data it covers in the training dataset of the decision tree system corresponding to its own decision tree system to the total amount of network traffic data in the training dataset of the decision tree system; under the condition of satisfying preset constraints, the multiple decision paths with the highest coverage are selected from all decision paths, and the decision tree system is reconstructed based on all selected decision paths to obtain the pruned decision tree system.

[0022] This solution can achieve at least the following beneficial technical effects:

[0023] This approach prunes the decision tree system based on decision path coverage, simplifying the model structure, reducing redundant path interference, improving the interpretability of the decision tree, and facilitating the rapid identification of core decision rules related to shortcut learning while retaining key decision rules.

[0024] Preferably, in the evaluation module, the preset constraints are that the pruned decision tree system must simultaneously meet the following conditions: the relative decrease in fidelity of the pruned decision tree system is less than or equal to 5%, or the relative decrease in fidelity is less than or equal to 0.02; and the relative decrease in the total number of nodes of the pruned decision tree system is greater than or equal to 30%.

[0025] This solution can achieve at least the following beneficial technical effects:

[0026] This scheme achieves precise control over the pruning effect of the decision tree system by setting dual constraints: a relative / absolute decrease threshold for fidelity and a decrease threshold for the total number of nodes. This strictly limits the decrease in model fidelity after pruning, ensuring that the pruned decision tree system can still simulate the prediction logic of the ML module training model with high fidelity. At the same time, it ensures that the model complexity is effectively reduced, achieving an optimal balance between fidelity and interpretability, and avoiding analytical distortion caused by over-pruning or interpretability difficulties caused by under-pruning.

[0027] Preferably, in the evaluation module, the preset method is as follows: perform preliminary screening on all decision paths, and mark decision paths that simultaneously meet the conditions of depth less than or equal to a preset depth threshold and sample coverage greater than a preset sample coverage threshold as suspicious decision paths; perform semantic analysis on each suspicious decision path, and check whether the features on which the suspicious decision path depends contain surface features that are semantically unrelated to the classification task performed by the machine learning system trained by the ML module. If they are, it is determined that the suspicious decision path has shortcut learning behavior.

[0028] This solution can achieve at least the following beneficial technical effects:

[0029] This solution clarifies the specific steps for identifying shortcut learning behavior based on decision paths. First, it extracts key indicators of the decision path and performs threshold screening. Then, it combines semantic analysis to determine feature correlation. This enables the accurate identification of suspicious shortcut rules caused by feature bias in the test dataset, achieving automated and accurate identification of model shortcut learning behavior. At the same time, it clarifies the direct correlation between shortcut learning behavior and dataset feature bias, providing a clear basis for subsequent targeted optimization of the dataset and elimination of shortcut features, effectively improving the accuracy of dataset defect diagnosis.

[0030] Preferably, in the evaluation module, the preset method includes: adding feature perturbation to the features on which the network traffic data covered by the suspicious decision path depends within a preset perturbation range; using a machine learning system trained with an ML module to predict the network traffic data after adding feature perturbation to obtain a prediction result; if the prediction result changes with the feature perturbation, it is determined that the suspicious decision path has shortcut learning behavior.

[0031] This solution can achieve at least the following beneficial technical effects:

[0032] This solution achieves robust secondary verification of suspicious shortcut rules by perturbing the dependent features of suspicious shortcut rules within a preset perturbation range and using the ML module to train a model to verify the changes in prediction results. It can effectively distinguish between real non-robust shortcut learning behavior caused by dataset feature bias and false suspected rules, avoiding the misjudgment problem caused by semantic analysis alone. This further improves the accuracy and reliability of shortcut learning behavior recognition and provides a more accurate basis for dataset optimization.

[0033] Preferably, the ML module is configured with a variety of machine learning systems in the field of computer networks with different architectures for selection. The various machine learning systems in the field of computer networks with different architectures include at least random forest systems, deep learning systems, and logistic regression systems. Each machine learning system is suitable for at least one network behavior classification task among DDoS detection, malware identification, intrusion prevention, and VPN traffic detection.

[0034] This solution can achieve at least the following beneficial technical effects:

[0035] This solution configures machine learning systems with various architectures, such as random forests, deep learning models, and logistic regression models, in the ML module, and adapts them to various network behavior classification tasks, such as DDoS detection and malware identification. This enables the evaluation system to adapt to multiple scenarios and tasks, meet the dataset evaluation needs of different network security scenarios, break through the limitations of single-model and single-task evaluation, and improve the versatility and practicality of the dataset evaluation system.

[0036] According to a second aspect of the present invention, a method for constructing a machine learning system in the field of computer networks is proposed. The method includes: acquiring a machine learning system in the field of computer networks to be trained; acquiring an acquisition device as described in the first aspect of the present invention; constructing a training dataset using the acquired acquisition device; and training the machine learning system in the field of computer networks to be trained until convergence using the constructed training dataset to obtain a final machine learning system in the field of computer networks.

[0037] This solution can achieve at least the following beneficial technical effects:

[0038] This approach uses the high-quality training dataset optimized above to train the machine learning system to convergence. This reduces the model's dependence on spurious features from the data source, significantly suppresses shortcut learning behavior, and enables the machine learning system to learn real and effective network traffic features. This effectively improves the generalization ability, robustness, and detection accuracy of the final network domain machine learning system, ensuring that it can maintain stable and reliable operation in complex real network environments.

[0039] Compared with the prior art, the advantages of the present invention are as follows:

[0040] This invention addresses the problems of opaque decision-making logic, blind data augmentation, and low accuracy of bias diagnosis in existing machine learning models in the network field. By constructing an iterative optimization closed loop driven by interpretable analysis, it adopts a high-fidelity proxy model to achieve accurate bias diagnosis and combines it with cross-platform proactive data collection. This effectively solves the pain points of black-box model decision-making logic being difficult to interpret, dataset quality being difficult to control, and model generalization ability being insufficient. It improves the accuracy and reliability of network security detection and provides technical support for a high-quality and robust network security protection system, which has strong practicality and application value. Attached Figure Description

[0041] The embodiments of the present invention will be further described below with reference to the accompanying drawings, wherein:

[0042] Figure 1 This is a schematic diagram of a data acquisition device for building a machine learning system in the network field according to an embodiment of the present invention;

[0043] Figure 2 A flowchart illustrating the process of constructing a decision tree system for the evaluation module according to an embodiment of the present invention;

[0044] Figure 3 This is a schematic diagram of an evaluation method according to an embodiment of the present invention;

[0045] Figure 4 This is a schematic diagram of the evaluation results according to an embodiment of the present invention. Detailed Implementation

[0046] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative and are not intended to limit the invention.

[0047] As mentioned in the background section, deploying trained machine learning models (such as DDoS detection, malware identification, and intrusion prevention) to real-world network environments with varying network behavior characteristics presents a key obstacle: models trained in controlled environments cannot maintain their original performance in different real-world networks—a problem known as the generalization problem of machine learning models. The diversity and adversarial nature of real-world network environments make it difficult for models to perform as expected. Models trained on data from specific environments cannot maintain effectiveness in other heterogeneous environments, and the distribution differences between attack behaviors and benign behaviors directly lead to a significant drop in model performance. Furthermore, models are difficult to maintain long-term after deployment; business expansion and changes in network structure may cause new benign behaviors to be misjudged as attacks. More significantly, existing models generally suffer from a "black box" problem; their internal decision-making logic is opaque, making it difficult for security researchers to understand, attribute, and trust the model's judgment criteria, and hindering manual review and accountability. Model reliability is significantly reduced when facing out-of-distribution (OOD) attacks, easily making incorrect predictions with high confidence and lacking effective early warning. Current model evaluation systems rely excessively on single indicators such as accuracy, lacking effective means to evaluate the rationality of decision-making, robustness, and generalization ability. This further leads to models that perform well in the laboratory failing to play a role in actual deployment.

[0048] The core cause of these problems lies in the unreliability, scarcity, and low quality of the network datasets used to train cybersecurity models. First, training data commonly suffers from data bias, shortcut learning, and distribution shifts, making models prone to relying on superficial, non-causal features for judgments rather than learning the patterns of attack behavior itself. Second, existing high-performance models, especially deep learning models, lack inherent interpretability. Security analysts cannot know the specific data features and reasoning chains upon which the model judges malicious behavior, making it impossible to correct erroneous learning and cope with complex attacks in real networks. Third, although data augmentation techniques are widely used in computer vision and natural language processing to improve data diversity, these methods are not entirely applicable to cybersecurity. Network data must strictly adhere to protocol standards and domain knowledge; blindly introducing noise, interpolation, and other traditional augmentation operations will generate unrealistic, semantically invalid, or even protocol-violation samples, failing to improve generalization ability and damaging model performance. Finally, traditional data augmentation can only passively expand the data scale and cannot address the problem of outdated training data caused by dynamic changes in the network environment and the continuous evolution of attack behavior, making it difficult to fundamentally solve the model's generalization defects.

[0049] In summary, while machine learning technology is increasingly widely used in cybersecurity, it faces severe challenges in generalization during practical deployment. The core issues stem from the "black box" nature of the models and the low quality of training data. The lack of interpretability leads to opaque decision-making processes, while training data suffers from biases, shortcut learning, and distribution shifts, causing models to over-rely on non-robust surface features rather than inherent attack attributes. Traditional data augmentation methods, failing to incorporate network domain knowledge and protocol semantics, not only struggle to improve data quality but may also generate invalid samples and fail to address the problem of training data becoming outdated as networks evolve. Therefore, the cybersecurity field urgently needs a new paradigm that integrates domain knowledge, interpretable analysis, and dynamic data iteration to systematically improve training data quality and model generalization capabilities, thereby promoting the reliable deployment and effective operation of machine learning models in real-world network environments.

[0050] To address the aforementioned issues, this invention proposes a dataset collection scheme for building machine learning systems in the network domain. This scheme aims to construct an enhanced machine learning pipeline by introducing an interpretability analysis step to diagnose model decision biases and dynamically guide the collection of high-quality data, thereby building a high-quality dataset for training machine learning systems with high generalization ability and high reliability in complex and dynamic real network environments.

[0051] According to one embodiment of the present invention, the present invention proposes a dataset acquisition device for constructing machine learning systems in the field of networks, with reference to the appendix. Figure 1 In summary, the device includes an ML module, an evaluation module, and a data acquisition module. To better understand the present invention, each module in the acquisition device proposed in this invention will be described in detail below with reference to specific embodiments.

[0052] I. ML Module

[0053] The ML module is used to preprocess each network traffic data newly acquired by the data acquisition module to assign a label to each network traffic data. A training set is constructed using all the newly acquired network traffic data with assigned labels and historical network traffic data with assigned labels. Based on the training set, an end-to-end machine learning system in the network domain is trained. In addition, the ML module is configured with a variety of machine learning systems in the computer network domain with different architectures to choose from. Among them, the various machine learning systems in the computer network domain with different architectures include at least random forest systems, deep learning systems, and logistic regression systems. Each machine learning system is suitable for at least one network behavior classification task among DDoS detection, malware identification, intrusion prevention, and VPN traffic detection.

[0054] According to an embodiment of the present invention, the operations performed by the ML module include: (1) configuring labels on the network traffic data newly collected by the data acquisition module, and preprocessing the labeled network traffic data (such as data cleaning, data standardization, etc.); (2) selecting and initializing a machine learning system from a pre-configured set of machine learning systems in the field of computer networks; (3) training the model by using the pre-processed labeled network traffic data to train the initialized machine learning system. By using well-labeled and reliable network traffic data to train the machine learning system end-to-end, the ML module enables the model to fully learn the real characteristics and patterns of network behavior, improves the recognition accuracy and generalization ability of scenarios such as DDoS attacks, malware, intrusion behavior, and VPN traffic, and provides a stable and reliable benchmark model for subsequent evaluation of dataset quality and detection of model shortcut learning problems.

[0055] II. Evaluation Module

[0056] The evaluation module is used to evaluate whether the machine learning system exhibits shortcut learning behavior. If the machine learning system does not exhibit shortcut learning behavior, the training set constructed in the current ML module is output; otherwise, a data collection strategy is formulated based on the identified shortcut learning behavior.

[0057] It should be understood that machine learning systems such as random forest systems, deep learning systems, and logistic regression systems are all black-box models. Their data-driven internal decision-making logic is not transparent to cybersecurity researchers. Therefore, it is impossible to analyze the model's true decision-making basis and feature dependencies. Furthermore, it makes it impossible to assess whether there are misleading features, feature bias, or false associations in the dataset, and it is difficult to determine whether the model exhibits shortcut learning behavior due to dataset defects.

[0058] According to one embodiment of the present invention, the evaluation module constructs a decision tree system to simulate the decision logic of the machine learning system trained by the ML module, so as to extract the decision path of the model, and realize a comprehensive evaluation of the machine learning system and the training dataset based on the analysis of the decision path. Specifically, the operations performed by the evaluation module include: using the machine learning system trained by the ML module to predict the predicted label of each network traffic data in the training set; constructing a decision tree system, wherein the decision tree system is obtained by supervised training with network traffic data in the training set as input, the predicted label of the decision tree system as output, and the predicted label of the machine learning system trained by the ML module as supervision; obtaining all decision paths of the decision tree system and the evaluation information of each decision path, and using a preset method to detect whether there is shortcut learning behavior in each decision path based on the obtained evaluation information, wherein the evaluation information of each decision path includes at least the depth of the decision path, the features it depends on, and the sample coverage, wherein the depth is the number of node splits included in the decision path, and the coverage is the proportion of the number of samples covered by the decision path to the total number of samples in the training dataset used when training the decision tree system.

[0059] It should be noted that the prediction fidelity between the decision tree system and the machine learning model trained by the ML module, that is, the degree of consistency between the decision strategies of the two decision tree systems and the machine learning model trained by the ML module, is a key factor in whether the final evaluation result is accurate. According to one embodiment of the present invention, multiple decision tree systems are iteratively trained, each decision tree system is evaluated, and the decision tree system with the highest prediction fidelity with the machine learning model is selected as the evaluation basis. Specifically, the evaluation module constructs the final decision tree system for the evaluation dataset in the following manner: The training dataset is divided into a decision tree system dataset and an evaluation dataset according to a preset partitioning ratio; multiple decision tree systems are trained using network traffic data from the decision tree system dataset as input, labels predicted by the decision tree system as output, and the predicted labels of the network traffic data by the machine learning system trained by the ML module as supervision. This includes: S1, randomly extracting a portion of network traffic data from the decision tree system dataset and dividing all extracted network traffic data into a decision tree system training dataset and a decision tree system test dataset; S2, training a decision tree system based on preset constraint parameters using each network traffic data in the decision tree system training dataset as input, labels predicted by the decision tree system as output, and the predicted labels of the machine learning system trained by the ML module as supervision, wherein the preset constraint parameters include at least the maximum depth and minimum number of leaf node samples of the decision tree system; S3, using the trained decision tree system to predict the predicted labels for each network traffic data in the decision tree system test dataset, and based on the decision tree... The system and the machine learning system trained by the ML module predict the label for each network traffic data in the decision tree system test dataset to determine the prediction fidelity between the decision tree system and the machine learning system trained by the ML module; S4, put the decision tree system test dataset back into the decision tree system dataset to construct a new decision tree system dataset, and execute steps S1 to S4 based on the new decision tree system dataset; wherein, when S1 to S4 completes a first preset number of loop iterations, select the decision tree system with the highest prediction fidelity with the machine learning system trained by the ML module from the newly constructed first preset number of decision tree systems, until the number of selected decision tree systems reaches a second preset number, stop the loop iteration, and take all the selected decision tree systems as the multiple decision tree systems finally trained; use each decision tree system to predict each network traffic data in the evaluation dataset, calculate the prediction similarity of each decision tree system with all other decision tree systems based on the prediction results, and calculate the average of each similarity; take the decision tree system with the highest average prediction similarity as the final decision tree system.

[0060] According to one embodiment of the present invention, in order to improve the interpretability of the decision tree system, each final-trained decision tree system is pruned before selecting the final decision tree system by: obtaining all decision paths of the decision tree system; obtaining the coverage of each decision path, wherein the coverage of a decision path is the ratio of the amount of network traffic data it covers in the training dataset of the decision tree system corresponding to its own decision tree system to the total amount of network traffic data in the training dataset of the decision tree system; under preset constraints, selecting multiple decision paths with the highest coverage from all decision paths, and reconstructing the decision tree system based on all selected decision paths to obtain the pruned decision tree system. The preset constraints are that the pruned decision tree system must simultaneously meet the following conditions: the relative decrease in fidelity of the pruned decision tree system is less than or equal to 5%, or the relative decrease in fidelity is less than or equal to 0.02; the relative decrease in the total number of nodes of the pruned decision tree system is greater than or equal to 30%. It should be understood that the values ​​in the above constraints are only illustrative and not exhaustive, and implementers can configure them according to their needs.

[0061] For example, see Appendix Figure 2 The diagram illustrates the process of the evaluation module constructing the decision tree model, which can be summarized as follows: S301, Initializing the optimal prediction dataset, using the black-box model to predict the original training dataset, generating an initial set of "input-output" pairs, called the initial optimal prediction dataset. This step visualizes the decision knowledge of the black-box model into a usable dataset; S302, Start the outer loop, which will execute For example, The purpose is to select the most stable model from multiple high-fidelity models to enhance the reliability of the final result; S303, Start the inner loop; In each outer loop, start an inner loop, which will execute For example, The goal is to generate a diverse set of candidate decision trees through multiple sampling and training processes; S304, uniform sampling and dataset partitioning, from the current best prediction dataset... In the middle, random and uniform sampling Each sample constitutes a subset. Subsequently, this subset Divided into training set and test set S305, Train candidate decision trees. Use the training set obtained in the previous step. Train a candidate decision tree model This model acts as a "student," learning the decision-making strategies of the "teacher" (the machine learning model trained by the ML module); S306, test the decision tree system using a test set to evaluate its fidelity; S307, perform dataset augmentation by merging the test dataset back into dataset D. Samples in the test dataset have typically appeared in the training pool; reintroducing them is equivalent to weighting them again, thus increasing their feature weights and learning priority in subsequent model training. This allows the model to more fully learn the feature patterns of difficult-to-classify and key samples, while also enhancing the stability and robustness of the decision rules, further improving the reliability of subsequent path analysis and shortcut detection; S308, end the inner loop and select the optimal tree. Upon completion... After the inner loop, from the generated Among the candidate decision trees, the decision tree with the highest fidelity (i.e., consistency with the predictions of the machine learning system trained by ML) is selected as the core metric. Then, in post-processing (S309), Top-k pruning is performed on the selected decision tree with the highest fidelity. This pruning method retains the decision trees that have the greatest impact on the model's decisions. Several branches (sorted by the number of samples covered) are used to significantly reduce model complexity while sacrificing minimal fidelity; S310, outer loop termination, and selection of the most stable model are completed when... After the outer loop, we get A pruned, high-quality decision tree explanation is generated. The average agreement between each pair of these trees is calculated—the probability that they will make the same prediction for the same input. Finally, the decision tree with the highest average agreement is selected as the final explanation model output. This step ensures the stability of the selected explanation and avoids misleading results due to randomness.

[0062] According to an embodiment of the present invention, in the evaluation module, the preset method is as follows: All decision paths are initially screened, and decision paths that simultaneously satisfy a depth less than or equal to a preset depth threshold and a sample coverage rate greater than a preset sample coverage rate threshold are marked as suspicious decision paths; semantic analysis is performed on each suspicious decision path to check whether the features on which the suspicious decision path depends contain surface features that are semantically unrelated to the classification task performed by the machine learning system trained by the ML module; if so, the suspicious decision path is determined to exhibit shortcut learning behavior. The preset method further includes: within a preset perturbation range, adding feature perturbations to the features on which the network traffic data covered by the suspicious decision path depends; using the machine learning system trained by the ML module to predict the network traffic data after adding feature perturbations to obtain prediction results; if the prediction results change with the feature perturbations, the suspicious decision path is determined to exhibit shortcut learning behavior.

[0063] For example, see Appendix Figure 3The diagram illustrates the evaluation process of the evaluation module, which can be summarized as follows: S501, Inputting a high-fidelity proxy decision tree (i.e., the final constructed decision tree system) to obtain the final decision tree system. This tree is a transparent approximation of the black-box model's decision logic; S502, Traversing the decision tree system and extracting all decision paths, starting from the root node, traversing to each leaf node using depth-first or breadth-first search, transforming each path from the root to a leaf into an "IF-THEN" form decision rule. Each rule contains all splitting conditions (features and thresholds) on the path, as well as the predicted category and the number / proportion of samples covered by the leaf node; S503: Calculating the key metrics of the rules. For each extracted decision rule, calculate its key evaluation metrics. These metrics are crucial for identifying shortcuts, including rule depth and rule coverage. Rule depth is the number of nodes (or splits) contained in the rule path; the shallower the depth, the simpler the rule may be. Rule coverage is the percentage of samples covered by the rule out of the total training set; the higher the coverage, the greater the contribution of the rule to the model's decision. S504: Preliminary screening based on metrics. Based on preset thresholds, perform preliminary screening on all rules, selecting those that simultaneously meet the criteria of "extremely shallow depth" (e.g., depth...). And "extremely high coverage" (e.g., coverage rate) The rules are marked as "suspicious shortcut rules." This step aims to quickly identify the most likely problematic decision paths from a large number of rejected rules. S505: Domain Knowledge-Assisted Semantic Analysis. A deeper analysis is performed on the selected suspicious rules. The features the rule relies on are examined to ensure they match the semantics of the task itself. For example, in a malware classification task, if a high-coverage rule heavily relies on superficial features like "file size" or "creation timestamp" rather than semantic features such as code snippets or API call sequences, it is highly likely to be a shortcut rule. S506: Rule Robustness Verification (Optional but Recommended). To further verify, small perturbations can be made to the features the rule relies on (e.g., changing feature values ​​within a reasonable range), and then the prediction results of the machine learning system can be observed to see if they change. If the prediction results are extremely sensitive to perturbations of this feature, it indicates that the model's decision may rely on a non-robust shortcut based on that feature. S507: Generating a Detection Report. The above analysis results are integrated to output a "List of Potential Shortcut Rules." This list details the complete path, coverage, confidence level, and reasons why each suspicious rule is considered a shortcut (such as inconsistent feature domain knowledge relied upon), providing cybersecurity experts with a clear starting point for investigation. (See attached document.) Figure 4This image shows a visualization of the detection report. The report can be a chart or a dashboard, integrating a model performance overview, interpretive insights, and risk assessment. It includes: Area 601, Report Header and Model Overview, which is a title bar and key metric card area labeled "Trust Report Header & Model Performance Overview." This area displays the dataset, the type of machine learning system trained on the ML model, its prediction accuracy, and an overview of the final decision tree system. Area 602, Visualization of the Core Decision Path of the Agent Decision Tree System, is a tree diagram labeled "High-Fidelity Agent Decision Tree (Top-k Path)." This area displays the decision path diagram of the final decision tree system and gives the probability of each node's occurrence in the decision tree system. Region 603, Detection Results and Feature Analysis, marked "Potential False Correlation and Key Feature Analysis", summarizes Region 602, identifies possible shortcut learning and false correlations, and analyzes and annotates the key nodes of the final constructed decision tree system; Region 604, Overall Trust Score, extracts and summarizes the content of 601, 602 and 603, marked "Overall Trust Score".

[0064] III. Data Acquisition Module

[0065] The data acquisition module collects network traffic data based on a preset data acquisition strategy or a data acquisition strategy formulated by the evaluation module, and feeds the collected network traffic data back to the ML module.

[0066] According to one embodiment of the present invention, the present invention proposes a method for constructing a machine learning system in the field of computer networks. The method includes: acquiring a machine learning system in the field of computer networks to be trained; acquiring a data acquisition device for the aforementioned dataset used to construct the machine learning system in the network field; constructing a training dataset using the acquired data acquisition device; and training the machine learning system in the field of computer networks to be trained to convergence using the constructed training dataset to obtain the final machine learning system in the field of computer networks.

[0067] In summary, this invention addresses the technical pain points in existing machine learning systems for network domains, such as insufficient generalization ability, blind data augmentation, opaque model decision logic, and low accuracy of bias diagnosis. Through three core innovative designs, it achieves a dual improvement in dataset quality and model performance, bringing significant technological advancements and practical value. Specific beneficial effects are as follows: First, it constructs an iterative optimization closed loop driven by interpretability analysis, changing the traditional positioning of interpretability analysis as merely a post-hoc explanation tool, and transforming it into a core driving force guiding model and dataset optimization. Through a complete closed-loop process of "model evaluation → interpretability analysis → bias diagnosis → data strategy optimization → data collection → model retraining," with the rationality of model decision logic as the direct optimization target, it effectively alleviates the problem of insufficient model generalization ability caused by data bias and shortcut learning, ensuring that model decisions are more aligned with the needs of real network scenarios and improving the stability and reliability of model operation. Second, it achieves accurate bias diagnosis based on a high-fidelity proxy interpretation model, solving the problem that traditional techniques struggle to analyze the decision logic of black-box models and accurately locate model defects. This invention employs an interpretability framework to extract high-precision, easily understandable surrogate models (such as decision trees) from complex black-box models. By deeply analyzing the decision rules of the surrogate models, it can accurately identify the non-robust features (i.e., shortcut learning behaviors) upon which the model relies, clearly revealing the model's vulnerability to out-of-distribution (OOD) samples and unknown attack patterns. This provides a reliable basis for formulating targeted data collection strategies, completely avoiding the blindness of traditional data augmentation methods and improving the accuracy and efficiency of data optimization. Third, it utilizes cross-platform abstraction capabilities to achieve generalized data collection, breaking through the limitations of traditional passive data augmentation and innovatively adopting an active data collection mode. Leveraging the cross-environment data collection capabilities of programmable network environments, Kubernetes, and the netUnicorn platform, and guided by the precise results of interpretability analysis, it actively collects data in diverse real-world network environments that can compensate for the model's current deficiencies. This ensures that the data collected in each iteration has the highest "cost-effectiveness," efficiently compensating for the model's feature learning deficiencies, helping to generate high-quality datasets that can be replicated in production networks with heterogeneity, and significantly improving the model's generalization ability and its ability to identify unknown network threats. In summary, through the aforementioned technological innovations, this invention achieves the integrated fusion of dataset collection, model training, interpretability analysis, and iterative optimization. It effectively solves the core problems in existing technologies, such as low dataset quality, weak model generalization ability, and inaccurate bias diagnosis. It provides reliable technical support for the efficient construction of machine learning systems in the network field, has strong practicality and application value, and can be widely applied to various network security detection scenarios.

[0068] It should be noted that although the steps are described in a specific order above, it does not mean that the steps must be executed in the above specific order. In fact, some of these steps can be executed concurrently, or even in a different order, as long as the required function can be achieved.

[0069] The various embodiments of the present invention have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical application, or technical improvements to the embodiments in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.

Claims

1. A data acquisition device for building machine learning systems in the network field, characterized in that, The device includes an ML module, an evaluation module, and a data acquisition module, wherein: The ML module is used to preprocess each network traffic data newly acquired by the data acquisition module to configure a label for each network traffic data. A training set is constructed using all the newly acquired network traffic data with configured labels and historical network traffic data with configured labels. A machine learning system in the network domain is trained in an end-to-end manner based on the training set. The evaluation module is used to evaluate whether the machine learning system has shortcut learning behavior. If the machine learning system does not have shortcut learning behavior, the training set constructed in the current ML module is output; otherwise, a data collection strategy is formulated based on the identified shortcut learning behavior. The data acquisition module collects network traffic data based on a preset data acquisition strategy or a data acquisition strategy formulated by the evaluation module, and feeds the collected network traffic data back to the ML module.

2. The apparatus according to claim 1, characterized in that, The evaluation module is configured as follows: A machine learning system trained with an ML module predicts the predicted label for each network traffic data point in the training set. Construct a decision tree system, wherein the decision tree system is obtained by supervising the training with network traffic data in the training set as input, the labels predicted by the decision tree system as output, and the predicted labels of the machine learning system trained by the ML module as supervision. All decision paths of the decision tree system and the evaluation information of each decision path are obtained. A pre-defined method is used to detect whether there is shortcut learning behavior in each decision path based on the obtained evaluation information. The evaluation information of each decision path includes at least the depth of the decision path, the features it depends on, and the sample coverage. The depth is the number of node splits included in the decision path, and the coverage is the proportion of the number of samples covered by the decision path to the total number of samples in the training dataset used to train the decision tree system.

3. The apparatus according to claim 2, characterized in that, The evaluation module is configured to construct a decision tree system in the following manner: The training dataset is divided into a decision tree system dataset and an evaluation dataset according to a preset partitioning ratio; Multiple decision tree systems are trained using network traffic data from the decision tree system dataset as input, labels predicted by the decision tree system as output, and machine learning systems trained by the ML module supervised by the predicted labels of the network traffic data. Each decision tree system is used to predict each network traffic data in the evaluation dataset. Based on the prediction results, the prediction similarity of each decision tree system with all other decision tree systems is calculated, and the average of each similarity is obtained. The decision tree system with the highest average prediction similarity will be selected as the final decision tree system.

4. The apparatus according to claim 3, characterized in that, In the evaluation module, multiple decision tree systems are trained in the following manner: S1. Randomly extract a portion of network traffic data from the decision tree system dataset, and divide all extracted network traffic data into a decision tree system training dataset and a decision tree system test dataset. S2. A decision tree system is trained based on preset constraint parameters, with each network traffic data in the decision tree system training dataset as input, the label predicted by the decision tree system as output, and the label predicted by the machine learning system trained by the ML module as supervision. The preset constraint parameters include at least the maximum depth and the minimum number of leaf node samples of the decision tree system. S3. Use the trained decision tree system to predict the prediction label of each network traffic data in the decision tree system test dataset, and determine the prediction fidelity between the decision tree system and the machine learning system trained by the ML module for each network traffic data prediction label in the decision tree system test dataset. S4. Put the decision tree system test dataset back into the decision tree system dataset to build a new decision tree system dataset, and perform steps S1 to S4 based on the new decision tree system dataset; Specifically, when S1 to S4 complete a first preset number of iterations, the decision tree system with the highest prediction fidelity to the machine learning system trained by the ML module is selected from the newly constructed first preset number of decision tree systems. The iteration stops when the number of selected decision tree systems reaches a second preset number, and all selected decision tree systems are used as the final multiple decision tree systems obtained through training.

5. The apparatus according to claim 3, characterized in that, In the evaluation module, each decision tree system obtained from the final training is also pruned in the following manner: Obtain all decision paths in the decision tree system; Obtain the coverage of each decision path, where the coverage of a decision path is the ratio of the amount of network traffic data it covers in the training dataset of the decision tree system corresponding to its decision tree system to the total amount of network traffic data in the training dataset of that decision tree system. Under the premise of satisfying the preset constraints, select the decision paths with the highest coverage from all decision paths, and reconstruct the decision tree system based on all selected decision paths to obtain the pruned decision tree system.

6. The apparatus according to claim 5, characterized in that, In the evaluation module, the preset constraint is that the pruned decision tree system must simultaneously meet the following conditions: The relative decrease in fidelity of the pruned decision tree system is less than or equal to 5%, or the relative decrease in fidelity is less than or equal to 0.

02. The total number of nodes in the pruned decision tree system decreases by a ratio greater than or equal to 30%.

7. The apparatus according to claim 2, characterized in that, In the evaluation module, the preset method is: All decision paths are initially screened, and decision paths that simultaneously meet the conditions of depth less than or equal to a preset depth threshold and sample coverage greater than a preset sample coverage threshold are marked as suspicious decision paths. Semantic analysis is performed on each suspicious decision path to check whether the features on which the suspicious decision path depends contain surface features that are semantically unrelated to the classification task performed by the machine learning system trained by the ML module. If they are, the suspicious decision path is determined to have shortcut learning behavior.

8. The apparatus according to claim 7, characterized in that, In the evaluation module, the preset method includes: Within a preset perturbation range, feature perturbations are added to the features upon which the network traffic data covered by the suspicious decision paths depend; A machine learning system trained with an ML module is used to predict network traffic data after feature perturbation is added to obtain prediction results; If the prediction result changes with feature perturbation, the suspicious decision path is determined to have shortcut learning behavior.

9. The apparatus according to claim 1, characterized in that, The ML module is configured with a variety of machine learning systems in the field of computer networks with different architectures to choose from. Among them, the various machine learning systems in the field of computer networks with different architectures include at least random forest systems, deep learning systems, and logistic regression systems. Each machine learning system is suitable for at least one network behavior classification task among DDoS detection, malware identification, intrusion prevention, and VPN traffic detection.

10. A method for constructing a machine learning system in the field of computer networks, characterized in that, The method includes: To acquire a machine learning system in the field of computer networks to be trained; Obtain the acquisition device as described in any one of claims 1 to 9; The training dataset is constructed using the acquired data acquisition device; The constructed training dataset is used to train the machine learning system in the field of computer networks to converge, so as to obtain the final machine learning system in the field of computer networks.