A data stream monitoring method, apparatus and electronic device
By constructing a decision tree model to automatically monitor data streams, the problem of low efficiency in manual monitoring is solved, and efficient and accurate data stream anomaly detection and real-time early warning are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- LENOVO (BEIJING) LTD
- Filing Date
- 2023-07-31
- Publication Date
- 2026-06-30
AI Technical Summary
In existing technologies, data stream monitoring mainly relies on manual periodic checks, resulting in high labor costs and low monitoring efficiency.
By constructing a decision tree model based on a data stream training dataset, the information gain of the transmission features is used to automatically monitor the data stream, and a decision tree model is generated to output the classification result of whether the data stream has anomalies.
It enables comprehensive automatic monitoring of data stream transmission status, improves monitoring accuracy, reduces labor costs, increases monitoring efficiency, and provides a real-time early warning mechanism.
Smart Images

Figure CN116975641B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of data processing technology, and more specifically to a data stream monitoring method, apparatus, and electronic device. Background Technology
[0002] Systems and websites across various industries generate massive amounts of data. Data processing and data quality management play a crucial role in every enterprise's data management system. Monitoring the quality of the entire data flow during the development and migration process is particularly important. Timely detection of anomalies in the data flow during transmission is a critical step for data developers.
[0003] Current data stream quality monitoring and early warning mainly rely on manual, periodic checks to check the execution status of the data stream. However, this method consumes a lot of manpower and reduces the efficiency of data stream monitoring. Summary of the Invention
[0004] In view of the above, this application provides the following technical solution:
[0005] A data stream monitoring method, comprising:
[0006] Obtain transmission characteristic information of the target data stream to be monitored, wherein the transmission characteristic information includes at least one transmission characteristic;
[0007] The transmission feature information is input into the target monitoring model to obtain the monitoring result of the target data stream. The target monitoring model is a decision tree model determined based on the information gain of at least one transmission feature of the data stream training dataset. The decision tree model is used to output the classification result of whether the data stream has an anomaly.
[0008] Optionally, the method further includes:
[0009] Identify the transmission characteristics that affect the data stream transmission effect;
[0010] The transmission characteristics of different data streams during a predetermined time period, and the corresponding monitoring results, are collected.
[0011] Based on the transmission characteristics and monitoring results of each data stream in each of the predetermined time periods, training samples corresponding to the data streams are generated.
[0012] The training samples corresponding to different data streams are combined to obtain the training dataset;
[0013] Based on the training dataset, a target monitoring model is generated.
[0014] Optionally, generating the target monitoring model based on the training dataset includes:
[0015] Based on the attribute parameters corresponding to each transmission feature of the training dataset, the training dataset is divided to obtain a discretized target training dataset.
[0016] Obtain the information entropy of each transmitted feature in the target training dataset;
[0017] Based on the information entropy, determine the information gain of each of the transmission features;
[0018] Nodes are generated based on the information gain and the transmission characteristics, and a decision tree model is generated based on the nodes.
[0019] The decision tree model was selected as the target monitoring model.
[0020] Optionally, the step of generating nodes based on the information gain and the transmission features, and generating a decision tree model based on the nodes, includes:
[0021] The first transmission feature with the largest information gain is determined as the root node of the decision tree, and the target training dataset is divided based on the first transmission feature to obtain a first training data subset.
[0022] Obtain the information gain of other transmission features in the first training data subset besides the first transmission feature, and determine the second transmission feature of the next level corresponding to the root node based on the information gain of other transmission features. Iteratively execute the partitioning of the current data subset based on the transmission feature of the current node, and determine the information gain of the corresponding transmission feature in the partitioned data subset, until the nodes of each level of the decision tree are determined.
[0023] A decision tree model is generated based on the nodes.
[0024] Optionally, the step of dividing the training dataset based on the attribute parameters corresponding to each transmission feature of the training dataset to obtain a discretized target training dataset includes:
[0025] The samples are sorted according to the attribute parameters of each transmission feature in the training dataset to obtain ordered samples;
[0026] Calculate the class diameter of the ordered sample for a specific classification;
[0027] Based on the class diameter, determine the loss function for each specific class;
[0028] Determine the cut-off point based on the loss function;
[0029] The training dataset is divided according to the split points to obtain a discretized target training dataset.
[0030] Optionally, obtaining the information entropy corresponding to the transmitted features in the target training dataset includes:
[0031] Based on the number of samples corresponding to each monitoring result category in the target training dataset, the category information entropy is obtained, where the monitoring result categories include normal and abnormal.
[0032] The target training dataset is divided based on the attribute parameters of each transmission feature to obtain a subset corresponding to each transmission feature;
[0033] The information entropy corresponding to each subset of each transmission feature is calculated.
[0034] Optionally, determining the information gain for each of the transmission features based on the information entropy includes:
[0035] Based on the category information entropy and the information entropy corresponding to each subset of each transmission feature, the information gain for each transmission feature is determined.
[0036] Optionally, the method further includes:
[0037] Based on the monitoring results of the target data stream, a real-time monitoring data table is generated;
[0038] Based on the real-time monitoring data table, early warning information is generated.
[0039] A data stream monitoring device, comprising:
[0040] An acquisition unit is used to obtain transmission characteristic information of the target data stream to be monitored, wherein the transmission characteristic information includes at least one transmission characteristic;
[0041] The model processing unit is used to input the transmission feature information into the target monitoring model to obtain the monitoring result of the target data stream. The target monitoring model is a decision tree model determined based on the information gain of at least one transmission feature of the data stream training dataset. The decision tree model is used to output the classification result of whether the data stream has an anomaly.
[0042] An electronic device, comprising:
[0043] Memory, used to store applications and the data generated by the running of the applications;
[0044] A processor for executing the application to implement the data flow monitoring method as described in any of the above.
[0045] As can be seen from the above technical solution, this application discloses a data stream monitoring method, apparatus, and electronic device, comprising: obtaining transmission characteristic information of a target data stream to be monitored, the transmission characteristic information including at least one transmission feature; inputting the transmission characteristic information into a target monitoring model to obtain the monitoring result of the target data stream, wherein the target monitoring model is a decision tree model determined based on the information gain of at least one transmission feature of the data stream training dataset, and the decision tree model is used to output a classification result indicating whether the data stream is abnormal. This application improves the accuracy of monitoring the data stream transmission status by comprehensively and automatically monitoring the data stream through a decision tree model. Attached Figure Description
[0046] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only embodiments of this application. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.
[0047] Figure 1 A flowchart illustrating a data stream monitoring method provided in an embodiment of this application;
[0048] Figure 2 This is a partial example diagram of the training dataset used in the embodiments of this application;
[0049] Figure 3 A schematic diagram of a data table for calculating a type of diameter provided in an embodiment of this application;
[0050] Figure 4 A schematic diagram of a partial decision attribute table of a discretized target training dataset provided in an embodiment of this application;
[0051] Figure 5 A schematic diagram illustrating the partitioning of a dataset based on the average flow rate of the data stream, provided as an embodiment of this application;
[0052] Figure 6 A schematic diagram of a decision tree model provided in an embodiment of this application;
[0053] Figure 7 This is a schematic diagram of the structure of a data stream monitoring device provided in an embodiment of this application. Detailed Implementation
[0054] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0055] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0056] This application provides a data stream monitoring method that uses a constructed decision tree model to monitor and provide early warnings about the transmission quality status of different data streams. See also... Figure 1 This is a flowchart illustrating a data stream monitoring method provided in an embodiment of this application. The method may include the following steps:
[0057] S101. Obtain the transmission characteristic information of the target data stream to be monitored.
[0058] S102. Input the transmission feature information into the target monitoring model to obtain the monitoring results of the target data stream.
[0059] The target data stream is the data stream that needs to be monitored in real time. It can be a data stream that matches the current application scenario, such as the data stream generated by an application system or website. The transmission characteristic information of the target data stream includes at least one transmission characteristic, which is a feature that can affect the data stream transmission effect; that is, the transmission characteristic characterizes the factors affecting the robustness of data stream transmission. For example, transmission characteristics can be the total amount of data, the average data flow rate, the resource allocation of the data stream job process, data complexity, and data compliance rate. Among these, the total amount of data is a fundamental attribute of data stream transmission. The smaller the total amount of data, the lower the possibility of data stream congestion. As one of the inherent characteristics of the data, its impact on data stream transmission robustness is significant. Average data flow rate: The average data flow rate per unit time directly affects the execution time of the data stream. However, simply considering the average data flow rate is inaccurate for judging the robustness of the data stream; it needs to be evaluated in conjunction with the inherent attributes of the data. Resource allocation of the data stream job process: The higher the resource allocation of the data stream per unit time, the higher the risk of data stream job congestion, the more unstable the overall data stream transmission, and the worse the robustness. Data complexity (determined by the number of its own business attributes): The number of data attributes is a primary indicator of data complexity and also one of the characteristics of the data itself. The more complex the data, the greater its impact on data transmission rate and the more significant its impact on the robustness of data stream transmission. Data compliance rate: Data quality is a crucial aspect that must be monitored during data stream transmission. If the data does not meet the standards, the entire data stream transmission is invalid. Data compliance is checked according to the rules governing each data attribute.
[0060] In this embodiment, a pre-established target monitoring model is used to process the real-time transmission characteristic information of the target data stream to obtain real-time monitoring results of the target data stream. The target monitoring model is a decision tree model determined based on the information gain of at least one transmission feature of the data stream training dataset. This decision tree model is used to output a classification result indicating whether the data stream exhibits anomalies.
[0061] To generate an accurate target monitoring model, historical data transmission information and corresponding data stream robustness from different data streams over several equal time periods can be collected to construct a corresponding training dataset. Deep learning is then applied to these training datasets to obtain the model. Correspondingly, this application embodiment also provides a method for generating a target monitoring model, which may include the following steps:
[0062] Identify the transmission characteristics that affect the data stream transmission effect;
[0063] The transmission characteristics of different data streams during a predetermined time period, and the corresponding monitoring results, are collected.
[0064] Based on the transmission characteristics and monitoring results of each data stream in each of the predetermined time periods, training samples corresponding to the data streams are generated.
[0065] The training samples corresponding to different data streams are combined to obtain the training dataset;
[0066] Based on the training dataset, a target monitoring model is generated.
[0067] Specifically, different data streams can be of different types, and a specific time period can refer to several equal time periods, i.e., multiple time periods of equal duration. Historical transmission characteristics of different data streams over several equal time periods are collected, along with monitoring results (such as data stream robustness classification results) for those time periods. Historical transmission characteristics may include: total data volume, average data flow rate, resource allocation ratio for data stream operations, data stream complexity (number of its own business attributes), and data compliance rate. The various transmission characteristics of different data streams and their corresponding robustness classification information are then arranged as follows: Figure 2 The training dataset mentioned above should be noted as follows: Figure 2 The examples shown are only partial examples of the training datasets used in the embodiments of this application.
[0068] To optimize model training, this embodiment automatically and reasonably optimizes the attribute parameters corresponding to continuous transmission features based on the transmission characteristics of different data streams, thus discretizing continuous value attributes. Furthermore, it calculates the information gain of each transmission feature, constructs a decision tree model based on the information gain magnitude, and predicts the robustness of the current data stream transmission based on the decision tree to determine whether an anomaly has occurred. In one embodiment, generating a target monitoring model based on the training dataset includes: dividing the training dataset based on the attribute parameters corresponding to each transmission feature to obtain a discretized target training dataset; obtaining the information entropy of each transmission feature in the target training dataset; determining the information gain of each transmission feature based on the information entropy; generating nodes based on the information gain and transmission features; generating a decision tree model based on the nodes; and determining the decision tree model as the target monitoring model.
[0069] Furthermore, the step of dividing the training dataset based on the attribute parameters corresponding to each transmission feature of the training dataset to obtain a discretized target training dataset includes: sorting samples according to the attribute parameters of each transmission feature of the training dataset to obtain ordered samples; calculating the class diameter of a specific category of the ordered samples; determining the loss function for each specific category based on the class diameter; determining the split point based on the loss function; and dividing the training dataset according to the split point to obtain a discretized target training dataset.
[0070] Specifically, when partitioning the training dataset, the optimal partitioning of each transmission feature of the collected data stream can be achieved based on the clustering maximum partitioning method, generating a discretized target training data stream. Taking the average flow rate of the data stream as an example, the process of generating a discretized training dataset is as follows: First, the training dataset is sorted according to the attribute parameters corresponding to the transmission feature to obtain ordered samples, as shown in Table 1. The attribute parameters are the data parameter values of the transmission feature. For example, the attribute parameter of the average flow rate of the data stream (data streams / second) can be 15, 25, 40, 46, etc., where 15 indicates that the average flow rate of the data stream is 15 data streams / second.
[0071] Table 1
[0072]
[0073] Then calculate the class diameter, assuming the ordered sample is X. (1) ,X (2) ,…X (n) The mean vector of the class is:
[0074]
[0075] Let D(i,j) represent the diameter of the class, then:
[0076]
[0077] The data table obtained from the diameter calculation is as follows: Figure 3 As shown.
[0078] Based on the class diameter, different classification loss functions are determined, where classification refers to the classification of the attribute parameter range. Specifically, the formulas for calculating different classification loss functions are as follows:
[0079]
[0080] Assuming a maximum of 4 classes, the optimal loss function values and split points for classes 2, 3, and 4 are shown in Table 2.
[0081] Table 2
[0082] property Classification loss function value Dividing point Average flow rate of data stream 2 4129.7 10 Average flow rate of data stream 3 2188.8 6,12 Average flow rate of data stream 4 1041.7 3,10,12
[0083] The more categories there are, the smaller the loss function value. Combining the ratio of average inter-class distance to class diameter, the larger the ratio, the smaller the inter-class similarity and the larger the intra-class similarity. Dividing the average flow rate attribute of the data stream into 4 categories yields the best results.
[0084] Similarly, the best classification results for other attributes are shown in Figure 3.
[0085] Table 3
[0086] property Classification loss function value Dividing point Data volume 3 1.10532E+13 8,13 Data stream job process resource ratio 3 0.063 12,14 Data complexity (number of attributes) 3 13946.43 8,12 Data compliance rate 2 0.0018 4
[0087] For a partial decision attribute table of the discretized target training dataset, please refer to [link / reference]. Figure 4 As shown.
[0088] After obtaining the discretized target training dataset, the final decision tree model can be generated by calculating the information entropy and information gain of each transmission feature.
[0089] In one embodiment of this application, obtaining the information entropy corresponding to the transmission feature in the target training dataset includes: obtaining the category information entropy based on the number of samples corresponding to each monitoring result category in the target training dataset, wherein the monitoring result categories include normal and abnormal; dividing the target training dataset based on the attribute parameters of each transmission feature to obtain a subset corresponding to each transmission feature; and calculating the information entropy corresponding to each subset of each transmission feature.
[0090] Furthermore, determining the information gain for each transmission feature based on information entropy includes: determining the information gain for each transmission feature based on the category information entropy and the information entropy corresponding to each subset of each transmission feature.
[0091] After obtaining the information entropy and information gain of the transmission characteristics, nodes for constructing a decision tree can be generated based on the information gain and transmission characteristics, and then a decision tree model can be generated. This decision tree model is the target monitoring model in the embodiments of this application.
[0092] In one implementation, the step of generating nodes based on the information gain and the transmission features, and generating a decision tree model based on the nodes, includes:
[0093] The first transmission feature with the largest information gain is determined as the root node of the decision tree, and the target training dataset is divided based on the first transmission feature to obtain a first training data subset.
[0094] Obtain the information gain of other transmission features in the first training data subset besides the first transmission feature, and determine the second transmission feature of the next level corresponding to the root node based on the information gain of other transmission features. Iteratively execute the partitioning of the current data subset based on the transmission feature of the current node, and determine the information gain of the corresponding transmission feature in the partitioned data subset, until the nodes of each level of the decision tree are determined.
[0095] A decision tree model is generated based on the nodes.
[0096] Specifically, the training sample set in the target training dataset is D, and the proportion of samples of class K in the current sample is P. k(k = 1, 2, 3, ..., |y|), the information entropy of the sample proportion D is:
[0097]
[0098] Assuming the transmission characteristic of the total amount of data in the data sample is denoted as 'a', then the information gain obtained by partitioning sample D using this transmission characteristic 'a' is:
[0099]
[0100] Where V represents the number of possible values for each transmission feature, and D... V It is the dataset (or subset) of the data containing feature a at each branch node.
[0101] Similarly, let b be the transmission feature of the average data flow rate in the target training dataset, c be the transmission feature of the resource ratio of the data flow job process, d be the transmission feature of the data flow complexity (number of its own business attributes), and e be the transmission feature of the data compliance rate. The information gain of b, c, d, and e can be calculated by combining the same method as above.
[0102] For example, suppose the sample contains 15 data points, the training examples are |y| = 2, positive examples account for p1 = 8 / 15, and negative examples account for p2 = 7 / 15. The initial information entropy is calculated as follows: positive examples represent "good" (i.e., excellent), and negative examples represent "bad" (i.e., poor). Then the class information entropy can be represented as:
[0103]
[0104] Taking the total amount of data (denoted as 'a') as the transmission characteristic as an example, the corresponding data is divided into three parts according to the range, and the subsets correspond to:
[0105] D1: (Set 0) D2: (Set 1) D3: (Set 2)
[0106] The information entropy of the three branch subsets D1, D2, and D3 is as follows:
[0107]
[0108]
[0109]
[0110] Correspondingly, the information gain of the transmission characteristic "total data amount" 'a' is:
[0111]
[0112] The information gains of the following transmission characteristics were calculated in the same manner: b) average data flow rate of attribute data streams; c) resource allocation ratio of data stream job processes; d) data stream complexity (number of its own business attributes); and e) data compliance rate. The results are shown in Table 4.
[0113] Table 4
[0114]
[0115] The calculation shows that the information gain of the average data flow rate b is the largest. This information gain is used as the transmission characteristic corresponding to the root node, and its corresponding data subset partitioning can be as follows: Figure 5 As shown, if the average data flow rate is less than 40, the corresponding data subset can include datasets with average data flow rates of 15 and 25.
[0116] Similarly, the information gain of the transmission features corresponding to each node at each level of the decision tree is recursively calculated. The information gains of each transmission feature are compared, and the transmission features of the splitting nodes of each branch are selected. Each branch is then further split until a complete decision tree model for data flow anomaly monitoring is generated. Post-pruning of the generated decision tree model reduces the risk of overfitting and enhances its generalization performance. For details on the decision tree model, please refer to [link to relevant documentation]. Figure 6 As shown.
[0117] Once the decision tree model is obtained, it can serve as a target monitoring model for monitoring the status of data streams. Specifically, transmission characteristics of the current data stream over several equal time periods can be collected and fed into the constructed decision tree model to automatically predict and obtain the robustness information of the current data stream's transmission. For data streams exhibiting abnormal states, proactive warnings can be issued to developers. Developers can obtain real-time information on the transmission status of the current data stream, receive warnings for attributes with abnormal indicators, and be given adjustment suggestions.
[0118] The attribute parameters corresponding to the transmission characteristics of the collected data stream are compared with each layer of data nodes in turn. Finally, the robustness information of the current data stream is determined at the leaf node. If the robustness is poor, an early warning notification is issued to the developers. Otherwise, no early warning will be issued by default. The following is the robustness and early warning status of the data stream at a certain time period.
[0119] In order to enable the tracking and early warning of data stream monitoring results, one embodiment of this application further includes:
[0120] Based on the monitoring results of the target data stream, a real-time monitoring data table is generated;
[0121] Based on the real-time monitoring data table, early warning information is generated.
[0122] Specifically, the robustness of different data streams over several equal time periods is predicted using a decision tree model. The transmission attributes of the collected data streams over each time period and the corresponding robustness results are stored in a report data warehouse. The data stream's attributes and robustness status monitoring reports throughout the entire transmission process will be displayed in real time for developers to reference. This allows developers to track the data transmission status in real time and analyze and adjust parameters promptly for any abnormal situations.
[0123] The data stream monitoring method provided in this application automatically and reasonably optimizes the segmentation of continuous value attribute parameters using the transmission feature datasets of different data streams, thus discretizing the continuous value attributes. It calculates the information gain of each transmission, constructs a decision tree model based on the magnitude of the information gain of each transmission feature, and predicts the robustness of the current data stream transmission based on the decision tree to determine whether an anomaly has occurred. If an anomaly occurs in the current data stream transmission process, an early warning is automatically triggered, and reports on the robustness of each time period during the data stream execution process are automatically generated for developers to refer to and analyze. In this application embodiment, the data stream can be monitored in real time using a target monitoring model, eliminating the need for manual monitoring and significantly reducing labor costs and the workload of developers. Furthermore, it can process each transmission feature, achieving comprehensive monitoring of all data result anomalies, rather than focusing on a specific task, chance, or sampling. This application also features high execution efficiency, automatically performing optimal attribute segmentation and automatically predicting the stability and robustness of data stream transmission through a decision tree model. It also provides real-time status reports for different business data stream transmission processes, and timely warnings and reminders for abnormal data stream transmission situations; through data stream monitoring status and prediction reports, developers can perform abnormal state analysis and parameter or solution adjustments in advance.
[0124] This application also provides a data stream monitoring device, see [link to relevant documentation]. Figure 7 The device may include:
[0125] The acquisition unit 201 is used to acquire transmission characteristic information of the target data stream to be monitored, wherein the transmission characteristic information includes at least one transmission characteristic.
[0126] The model processing unit 202 is used to input the transmission feature information into the target monitoring model to obtain the monitoring result of the target data stream. The target monitoring model is a decision tree model determined based on the information gain of at least one transmission feature of the data stream training dataset. The decision tree model is used to output the classification result of whether the data stream has an anomaly.
[0127] Optionally, the device further includes:
[0128] The feature determination unit is used to determine the transmission features that affect the data stream transmission effect.
[0129] The data acquisition unit is used to collect the transmission characteristics of different data streams within a predetermined time period, as well as the corresponding monitoring results;
[0130] The first generation unit is used to generate training samples corresponding to the data stream based on the transmission characteristics and monitoring results of each data stream in each predetermined time period.
[0131] The combination unit is used to combine training samples corresponding to different data streams to obtain a training dataset;
[0132] The second generation unit is used to generate a target monitoring model based on the training dataset.
[0133] Optionally, the second generation unit includes:
[0134] The sub-unit is used to divide the training dataset based on the attribute parameters corresponding to each transmission feature of the training dataset, so as to obtain a discretized target training dataset.
[0135] The first acquisition subunit is used to obtain the information entropy of each transmitted feature in the target training dataset;
[0136] The first determining subunit is used to determine the information gain of each of the transmission features based on the information entropy.
[0137] The first generation subunit is used to generate nodes based on the information gain and the transmission characteristics, and to generate a decision tree model based on the nodes;
[0138] The second determining subunit is used to determine the decision tree model as the target monitoring model.
[0139] Optionally, the first generating subunit is specifically used for:
[0140] The first transmission feature with the largest information gain is determined as the root node of the decision tree, and the target training dataset is divided based on the first transmission feature to obtain a first training data subset.
[0141] Obtain the information gain of other transmission features in the first training data subset besides the first transmission feature, and determine the second transmission feature of the next level corresponding to the root node based on the information gain of other transmission features. Iteratively execute the partitioning of the current data subset based on the transmission feature of the current node, and determine the information gain of the corresponding transmission feature in the partitioned data subset, until the nodes of each level of the decision tree are determined.
[0142] A decision tree model is generated based on the nodes.
[0143] Optionally, the sub-unit is specifically used for:
[0144] The samples are sorted according to the attribute parameters of each transmission feature in the training dataset to obtain ordered samples;
[0145] Calculate the class diameter of the ordered sample for a specific classification;
[0146] Based on the class diameter, determine the loss function for each specific class;
[0147] Determine the cut-off point based on the loss function;
[0148] The training dataset is divided according to the split points to obtain a discretized target training dataset.
[0149] Optionally, the first acquisition subunit is specifically used for:
[0150] Based on the number of samples corresponding to each monitoring result category in the target training dataset, the category information entropy is obtained, where the monitoring result categories include normal and abnormal.
[0151] The target training dataset is divided based on the attribute parameters of each transmission feature to obtain a subset corresponding to each transmission feature;
[0152] The information entropy corresponding to each subset of each transmission feature is calculated.
[0153] Optionally, the first determining subunit is specifically used for:
[0154] Based on the category information entropy and the information entropy corresponding to each subset of each transmission feature, the information gain for each transmission feature is determined.
[0155] Optionally, the device further includes: a warning information generation unit for:
[0156] Based on the monitoring results of the target data stream, a real-time monitoring data table is generated;
[0157] Based on the real-time monitoring data table, early warning information is generated.
[0158] It should be noted that the specific implementation of each unit and subunit in this embodiment can be referred to the corresponding content above, and will not be described in detail here.
[0159] In another embodiment of this application, a readable storage medium is also provided, on which a computer program is stored, which, when executed by a processor, implements the data stream monitoring method as described in any of the preceding claims.
[0160] In another embodiment of this application, an electronic device is also provided, which may include:
[0161] Memory, used to store applications and the data generated by the running of the applications;
[0162] A processor for executing the application to implement the data flow monitoring method as described in any of the above.
[0163] It should be noted that the specific implementation of the processor in this embodiment can be referred to the corresponding content above, and will not be described in detail here.
[0164] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the apparatus disclosed in the embodiments, since they correspond to the methods disclosed in the embodiments, the description is relatively simple; relevant parts can be referred to the method section.
[0165] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0166] The steps of the methods or algorithms described in conjunction with the embodiments disclosed herein can be implemented directly by hardware, a software module executed by a processor, or a combination of both. The software module can be located in random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art.
[0167] The above description of the disclosed embodiments enables those skilled in the art to make or use this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims
1. A data stream monitoring method, comprising: Obtain transmission characteristic information of the target data stream to be monitored, wherein the transmission characteristic information includes at least one transmission characteristic; The transmission feature information is input into the target monitoring model to obtain the monitoring result of the target data stream. The target monitoring model is a decision tree model determined based on the information gain of at least one transmission feature of the data stream training dataset. The decision tree model is used to output the classification result of whether the data stream has an anomaly. The method further includes: Identify the transmission characteristics that affect the data stream transmission effect; The transmission characteristics of different data streams during a predetermined time period, and the corresponding monitoring results, are collected. Based on the transmission characteristics and monitoring results of each data stream in each of the predetermined time periods, training samples corresponding to the data streams are generated. The training samples corresponding to different data streams are combined to obtain the training dataset; Based on the attribute parameters corresponding to each transmission feature of the training dataset, the training dataset is divided to obtain a discretized target training dataset. Obtain the information entropy of each transmitted feature in the target training dataset; Based on the information entropy, determine the information gain of each of the transmission features; Nodes are generated based on the information gain and the transmission characteristics, and a decision tree model is generated based on the nodes. The decision tree model was selected as the target monitoring model.
2. The method according to claim 1, wherein generating nodes based on the information gain and the transmission features, and generating a decision tree model based on the nodes, comprises: The first transmission feature with the largest information gain is determined as the root node of the decision tree, and the target training dataset is divided based on the first transmission feature to obtain a first training data subset. Obtain the information gain of other transmission features in the first training data subset besides the first transmission feature, and determine the second transmission feature of the next level corresponding to the root node based on the information gain of other transmission features. Iteratively execute the partitioning of the current data subset based on the transmission feature of the current node, and determine the information gain of the corresponding transmission feature in the partitioned data subset, until the nodes of each level of the decision tree are determined. A decision tree model is generated based on the nodes.
3. The method according to claim 1, wherein dividing the training dataset based on the attribute parameters corresponding to each transmission feature of the training dataset to obtain a discretized target training dataset includes: The samples are sorted according to the attribute parameters of each transmission feature in the training dataset to obtain ordered samples; Calculate the class diameter of the ordered sample for a specific classification; Based on the class diameter, determine the loss function for each specific class; Determine the cut-off point based on the loss function; The training dataset is divided according to the split points to obtain a discretized target training dataset.
4. The method according to claim 1, wherein obtaining the information entropy corresponding to the transmitted features in the target training dataset comprises: Based on the number of samples corresponding to each monitoring result category in the target training dataset, the category information entropy is obtained, where the monitoring result categories include normal and abnormal. The target training dataset is divided based on the attribute parameters of each transmission feature to obtain a subset corresponding to each transmission feature; The information entropy corresponding to each subset of each transmission feature is calculated.
5. The method according to claim 4, wherein determining the information gain for each of the transmission features based on the information entropy comprises: Based on the category information entropy and the information entropy corresponding to each subset of each transmission feature, the information gain for each transmission feature is determined.
6. The method according to claim 1, further comprising: Based on the monitoring results of the target data stream, a real-time monitoring data table is generated; Based on the real-time monitoring data table, early warning information is generated.
7. A data stream monitoring device, comprising: An acquisition unit is used to obtain transmission characteristic information of the target data stream to be monitored, wherein the transmission characteristic information includes at least one transmission characteristic; The model processing unit is used to input the transmission feature information into the target monitoring model to obtain the monitoring result of the target data stream. The target monitoring model is a decision tree model determined based on the information gain of at least one transmission feature of the data stream training dataset. The decision tree model is used to output the classification result of whether the data stream is abnormal. The data stream monitoring device further includes: Identify the transmission characteristics that affect the data stream transmission effect; The transmission characteristics of different data streams during a predetermined time period, and the corresponding monitoring results, are collected. Based on the transmission characteristics and monitoring results of each data stream in each of the predetermined time periods, training samples corresponding to the data streams are generated. The training samples corresponding to different data streams are combined to obtain the training dataset; Based on the attribute parameters corresponding to each transmission feature of the training dataset, the training dataset is divided to obtain a discretized target training dataset. Obtain the information entropy of each transmitted feature in the target training dataset; Based on the information entropy, determine the information gain of each of the transmission features; Nodes are generated based on the information gain and the transmission characteristics, and a decision tree model is generated based on the nodes. The decision tree model was selected as the target monitoring model.
8. An electronic device, comprising: Memory, used to store applications and the data generated by the running of the applications; A processor for executing the application to implement the data flow monitoring method as described in any one of claims 1 to 6.