A tag system-based internet of things monitoring model construction method
By employing manual and automated labeling methods based on a labeling system, the high cost and low efficiency of data labeling for IoT monitoring model training were addressed. This resulted in efficient and accurate data labeling and model training, optimized the labeling system, and improved the entity extraction capability of the IoT monitoring model.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- COMP APPL TECH INST OF CHINA NORTH IND GRP
- Filing Date
- 2023-10-08
- Publication Date
- 2026-06-23
AI Technical Summary
The training data annotation of existing IoT monitoring models is costly, inefficient, and of poor quality, heavily reliant on the expertise of annotators or the translation model, resulting in low data annotation efficiency and poor accuracy.
We established an initial labeling system for manual annotation, built a BERT model and trained it to obtain feature values, improved the labeling system, used the labeling system to automatically annotate business data, built a training sample set and optimized the labeling system, and used the zero-shot replay method for model training.
It has achieved automated generation of feature labels, improved the efficiency and accuracy of data annotation, reduced the error rate, optimized the labeling system, reduced the training sample set generation time, improved the entity extraction accuracy and training efficiency of IoT monitoring models, and avoided catastrophic forgetting.
Smart Images

Figure CN117332269B_ABST
Abstract
Description
TECHNICAL FIELD
[0001] The present application relates to the field of natural language processing, and in particular to a label system-based Internet of Things monitoring model construction method. BACKGROUND
[0002] Structured data is a common data structure, commonly used for data exchange between servers and clients. In the Internet of Things monitoring alarm business, Internet of Things devices will generate a large amount of real-time data, such as real-time charging pile status, real-time traffic light status, and other information. Therefore, lightweight JSON format data is used to realize data transmission between servers and various networked devices. In order to enable monitoring personnel to quickly and accurately find key information, natural language processing technology is needed to extract key information and highlight it.
[0003] To achieve the above-mentioned needs, the commonly used method at present is to train a model with a labeled sample set, and then use the model to predict actual data. There are mainly two kinds of existing sample set construction methods, one is manual, which is labeled by annotators for all new data to form a sample set; the other is based on a translation model, which uses a part of manually labeled data as the source language, first translates the source language into English, Russian, German and other intermediate languages, and then translates the intermediate language back to the source language, thereby expanding one piece of data into multiple pieces of data to form a sample set.
[0004] The existing technology mainly has the following defects, one is that manual annotation of data is high in cost, low in efficiency, prone to errors and dependent on the professionalism of annotators; the other is that the data expansion based on the translation model depends on the accuracy of the translation model, is prone to loss of specific information, and has poor support effect for highly specialized fields. SUMMARY
[0005] In view of the above analysis, the present application aims to provide a label system-based Internet of Things monitoring model construction method to solve the problems of high cost, low efficiency, poor quality and serious dependence on the professionalism of annotators or translation models in existing training data annotation.
[0006] The present application provides a label system-based Internet of Things monitoring model construction method, which comprises the following steps:
[0007] An initial label system is established, and structured data sets are manually annotated based on the initial label system to construct a first training sample set;
[0008] A Bert model is constructed, and the first training sample set is used to train the Bert model to obtain a trained Bert model;
[0009] Business data is input into the trained Bert model to obtain corresponding feature values, and a perfect label system is obtained based on the feature values;
[0010] Construct an IoT monitoring model, and use the IoT monitoring model to predict business data to obtain predicted business data;
[0011] The predicted business data is labeled using the improved labeling system to construct a second training sample set;
[0012] The trained IoT monitoring model is obtained by training the IoT monitoring model based on the second training sample set.
[0013] Furthermore, the method further includes the following steps:
[0014] For the newly added attributes, corresponding category names and aliases are added to the improved tag system to obtain an optimized tag system. Then, the predicted business data is labeled using the optimized tag system to construct a third training sample set.
[0015] An optimized IoT monitoring model is obtained by training the IoT monitoring model based on the third training sample set.
[0016] Furthermore, the step of training the BERT model based on the first training sample set to obtain the trained BERT model includes:
[0017] The first training sample set is divided into a first training set and a first validation set;
[0018] Set the batch size and training epoch threshold for training;
[0019] During each training round, samples from the first training set are input into the Bert model for one round of training according to the set batch size. After one round of training, samples from the first validation set are input into the Bert model, and the evaluation index score is calculated based on the precision, recall and F1 score evaluation index.
[0020] Once the training rounds reach the threshold, the BERT model corresponding to the round with the highest evaluation metric score is the well-trained BERT model.
[0021] Furthermore, the improved labeling system based on the aforementioned feature values includes:
[0022] Based on the feature values, corresponding feature parameter names are obtained. These feature parameter names are aliases in the tag system. After verifying the feature parameter names, they are added to the alias library of the initial tag system to obtain a complete tag system.
[0023] Furthermore, the labeling system includes four levels: primary classification, secondary classification, tertiary classification, and quaternary classification, with the quaternary classification having a corresponding alias library.
[0024] Furthermore, the process of training the IoT monitoring model based on the second training sample set to obtain the trained IoT monitoring model includes:
[0025] During the first round of training, the samples in the second training set are input into the IoT monitoring model according to the pre-set batch size to obtain the corresponding aliases and labeled positions, and the average value of the feature vector of each alias is saved.
[0026] In subsequent training rounds, the sample features of each alias are obtained based on the average value of the feature vectors of each alias obtained in the previous round. The sample features are then input into the IoT monitoring model along with the second training set for training to obtain the corresponding aliases and labeled locations. The average value of the feature vectors of each alias is saved.
[0027] Once the training rounds reach the threshold, a well-trained IoT monitoring model is obtained.
[0028] Furthermore, the average value of the feature vector for each alias is obtained as follows:
[0029] ,
[0030] in, Let be the average of the feature vectors of all samples for the k-th individual name. Rd is the number of samples for each alias, Rd is a real vector containing d aliases, and d is the number of elements in the feature vector. Let F(x) be the set of all aliases, and let F(x) be the function that maps the input samples to the feature vectors of each alias. For the i-th sample with the k-th alias, This is the current training dataset.
[0031] Furthermore, during the next round of training, the sample features are obtained as follows:
[0032] ,
[0033] in, Let g be the sample feature of the kth alias, g be the standard Gaussian sampling noise, and r be the uncertainty scale. Let be the feature vector of the i-th sample of the k-th alias, d be the number of elements in the feature vector, and C1 be the number of aliases.
[0034] Furthermore, the formula for the loss function is:
[0035]
[0036] Where N is the total number of samples in the training set. As an indicator variable, when sample i contains alias j It is 1 if it is not 0 otherwise Let be the unnormalized score of the feature vector of the i-th sample with the j-th alias. Let e be the sum of the feature vectors of the i-th sample of all aliases, and let C1 be the number of aliases.
[0037] Furthermore, the Bert model has 12 Transformer layers with a dimension of 768, and 12 heads for multi-head self-attention.
[0038] Compared with the prior art, the present invention can achieve at least one of the following beneficial effects:
[0039] 1. This invention trains the model with structured data to obtain corresponding feature values, and obtains a complete label system based on the feature values. Therefore, it can automatically generate a large number of feature labels required by business, realize the systematic management of feature labels, and reuse the label system for similar problems, thereby improving development efficiency.
[0040] 2. This invention utilizes a tagging system to automatically annotate a continuous stream of business data, thereby greatly improving the efficiency of data annotation, reducing the error rate, and effectively solving the problems of high cost, low efficiency, poor quality, and heavy reliance on the professionalism of annotation personnel or translation models in existing data annotation.
[0041] 3. For newly added attributes, the present invention allows for the addition, deletion, modification, and querying of the tag system through the front-end page. Therefore, the tag system can be conveniently and quickly optimized based on actual business data, thereby improving the quality of business data annotation.
[0042] 4. This invention utilizes a tagging system to automatically label large amounts of business data, greatly reducing the time required to generate training sample sets for IoT monitoring models, thereby improving development efficiency.
[0043] 5. This invention uses a labeled sample set to train the IoT monitoring model, thus improving the accuracy of the IoT monitoring model in extracting key information entities and the efficiency of training.
[0044] 6. This invention uses a zero-sample replay method to train the IoT monitoring model. It obtains the sample features of old samples through the feature vectors of old samples, and uses the sample features of old samples and new samples simultaneously in the next round of training. Therefore, it avoids catastrophic forgetting during incremental learning while saving data storage overhead.
[0045] In this invention, the above-described technical solutions can be combined with each other to achieve more preferred combinations. Other features and advantages of this invention will be set forth in the following description, and some advantages may become apparent from the description or be learned by practicing the invention. The objects and other advantages of this invention can be realized and obtained from what is particularly pointed out in the description and drawings. Attached Figure Description
[0046] The accompanying drawings are for illustrative purposes only and are not intended to limit the invention. Throughout the drawings, the same reference numerals denote the same parts.
[0047] Figure 1 is a flowchart of the method for constructing an IoT monitoring model based on a tag system according to an embodiment of the present invention;
[0048] Figure 2 This is a schematic diagram illustrating the label system classification according to an embodiment of the present invention;
[0049] Figure 3 This is a schematic diagram illustrating the display of the tag system on the front-end page according to an embodiment of the present invention;
[0050] Figure 4 The business data annotated in the embodiments of the present invention;
[0051] Figure 5 The data content contained in the Pcap package of the IoT dataset in this embodiment of the invention. Detailed Implementation
[0052] Preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form part of this application and are used together with the embodiments of the present invention to illustrate the principles of the present invention, but are not intended to limit the scope of the present invention.
[0053] One specific embodiment of the present invention discloses a method for constructing an Internet of Things (IoT) monitoring model based on a tag system. For example... Figure 1 As shown, the method includes the following steps:
[0054] Step S1: Establish an initial labeling system, and manually label the structured dataset based on the initial labeling system to construct the first training sample set;
[0055] Step S2: Construct the BERT model by training the BERT model based on the first training sample set to obtain the trained BERT model.
[0056] Step S3: Input the business data into the trained BERT model to obtain the corresponding feature values, and obtain a complete labeling system based on the feature values;
[0057] Step S4: Construct an IoT monitoring model and use the IoT monitoring model to predict business data to obtain predicted business data;
[0058] Step S5: Use the improved labeling system to label the predicted business data and construct a second training sample set;
[0059] Step S6: Train the IoT monitoring model based on the second training sample set to obtain a trained IoT monitoring model.
[0060] Specifically, in step S1, the required labels are classified and an initial label system is established according to the actual business needs, and then the open-source structured dataset is manually labeled based on the initial label system.
[0061] Furthermore, the tagging system includes four levels: primary category, secondary category, tertiary category, and quaternary category. The quaternary category has a corresponding alias library. Specifically, for example... Figure 2 As shown, the second-level classification is a subclass of the first-level classification, the third-level classification is a specific product, and the fourth-level classification is an attribute. Each alias in the fourth-level category name library has a corresponding first- to fourth-level classification.
[0062] For example, in IoT monitoring and alarm services, it is necessary to record and statistically analyze a large amount of real-time data generated by IoT devices, such as the real-time status of charging piles and traffic lights. A four-level tagging system is established based on these requirements. For the alias "status," its primary category is "Transportation and Travel," the secondary category is "Electric Vehicle Charging Infrastructure Services," the tertiary category is "Brand A Car Charging Pile," and the quaternary category is "Device Status Information."
[0063] Understandably, the total number of categories in the labeling system is not large, and can be established manually; however, there are many aliases for the four-level categories, which are difficult to establish completely manually, so a model is needed to automatically complete them.
[0064] For example, the Edge-IIoTset dataset of Internet of Things devices is selected as a structured dataset, and the dataset is manually labeled based on the established initial label system to construct the first training sample set.
[0065] Specifically, in step S2, the Bert model Transformer has 12 layers and 768 dimensions, and the number of heads for multi-head self-attention is 12.
[0066] Furthermore, the first training sample set is divided into a first training set and a first validation set;
[0067] Set the batch size and training epoch threshold for training;
[0068] During each training round, samples from the first training set are input into the Bert model for one round of training according to the set batch size. After one round of training, samples from the first validation set are input into the Bert model, and the evaluation index score is calculated based on the precision, recall and F1 score evaluation index.
[0069] Once the training rounds reach the threshold, the BERT model corresponding to the round with the highest evaluation metric score is the well-trained BERT model.
[0070] For example, the first training sample set is divided into a first training set and a first validation set in a 4:1 ratio, with a batch size of 24 and 30 training rounds. In each training round, samples from the first training set are input into the BERT model according to the batch size of 24 for one round of training. The samples are as follows... Figure 5 As shown. After one round of training, the samples from the first validation set are input into the BERT model, and its precision, recall, and F1 score are calculated. After 30 rounds of training, the BERT model corresponding to the round with the highest evaluation metric score is selected as the trained BERT model.
[0071] Specifically, in step S3, a certain amount of business data (e.g., 500 data points) is randomly pulled from the actual IoT monitoring environment, and the business data is input into the trained BERT model to obtain the corresponding feature values, as shown in Table 1.
[0072] Furthermore, the improved labeling system based on the aforementioned feature values includes:
[0073] Based on the feature values, corresponding feature parameter names are obtained, as shown in Table 1. These feature parameter names are aliases in the tag system. After verifying the feature parameter names, they are added to the alias library of the initial tag system to obtain a complete tag system.
[0074] Table 1 Business Data Feature Values and Corresponding Feature Parameter Names
[0075]
[0076] Specifically, the business data is structured data, so the corresponding key value (feature parameter name) can be obtained through the feature value. The four-level categories corresponding to the feature parameter name are checked, and incorrect categories are corrected. Then, the feature parameter name is added to the corresponding four-level category alias library, and the feature value is highlighted.
[0077] For example, for structured data {"port": "port1"} and {"rport": "port2"}, the feature parameter names corresponding to the feature values "port1" and "port2" are "port" and "rport" respectively, and their corresponding fourth-level category is "charging port". Therefore, "port" and "rport" are added to the alias library for "port".
[0078] Specifically, in step S4, the IoT monitoring model is a UIE model built using the Paddle framework. For example, the parameters of the IoT monitoring model are shown in Table 2:
[0079] Table 2 Parameters of the IoT Monitoring Model
[0080]
[0081] The IoT monitoring model includes an encoder, a decoder, a structured extraction language (SEL), a schema, and rules. The encoder maps the input samples to feature vectors for each alias. The decoder maps the feature vectors to entities and relational structures according to the schema and rules, thereby performing entity extraction and relation extraction. The SEL transforms different extraction structures and targets into a unified output format. The extracted entities are aliases of the four-level classification of the label system, and the extracted relations are the start and end positions of the alias.
[0082] Business data is retrieved from the actual IoT monitoring environment and input into the IoT monitoring model for prediction, including word segmentation and part-of-speech tagging, to obtain predicted business data, thereby reducing the difficulty of subsequent data processing.
[0083] Specifically, in step S5, the predicted business data is labeled using the improved labeling system, such as... Figure 4 As shown, the feature values of business data are automatically labeled using aliases in the labeling system, thereby constructing a second training sample set.
[0084] Specifically, in step S6, training the IoT monitoring model based on the second training sample set to obtain the trained IoT monitoring model includes:
[0085] During the first round of training, the samples in the second training set are input into the IoT monitoring model according to the pre-set batch size to obtain the corresponding aliases and labeled positions, and the average value of the feature vector of each alias is saved.
[0086] In subsequent training rounds, the sample features of each alias are obtained based on the average value of the feature vectors of each alias obtained in the previous round. The sample features and the samples in the second training set are input into the IoT monitoring model for training according to the set batch size to obtain the corresponding aliases and label positions. The average value of the feature vectors of each alias is saved.
[0087] Once the training rounds reach the threshold, a well-trained IoT monitoring model is obtained.
[0088] Furthermore, the average value of the feature vector for each alias is obtained as follows:
[0089] ,
[0090] in, Let be the average of the feature vectors of all samples for the k-th individual name. Rd is the number of samples for each alias, Rd is a real vector containing d aliases, and d is the number of elements in the feature vector. Let F(x) be the set of all aliases, and let F(x) be the function that maps the input samples to the feature vectors of each alias. For the i-th sample with the k-th alias, This is the current training dataset.
[0091] Furthermore, the formula for the loss function is:
[0092]
[0093] Where N is the total number of samples in the training set. As an indicator variable, when sample i contains alias j It is 1 if it is not 0 otherwise Let be the unnormalized score of the feature vector of the i-th sample with the j-th alias. Let e be the sum of the feature vectors of the i-th sample of all aliases, and let C1 be the number of aliases.
[0094] Furthermore, during the next round of training, the sample features are obtained as follows:
[0095] ,
[0096] in, Let g be the sample feature of the kth alias, g be the standard Gaussian sampling noise, and r be the uncertainty scale. Let be the feature vector of the i-th sample of the k-th alias, d be the number of elements in the feature vector, and C1 be the number of aliases.
[0097] Understandably, when reconstructing features, g follows a standard normal distribution. ,but It follows a normal distribution in the interval from 0 to r², and then... It follows a normal distribution from vk to r2. In this way, we can restore as many sample features as possible.
[0098] Based on the above method of preserving sample features, in the next round of training, the old sample features are augmented with Gaussian noise and input together with the new samples, thereby reducing the rate at which old sample information is forgotten; and only the feature vectors are stored, saving the overhead of data storage.
[0099] Furthermore, the method further includes the following steps:
[0100] For the newly added attributes, corresponding category names and aliases are added to the improved tag system to obtain an optimized tag system. Then, the predicted business data is labeled using the optimized tag system to construct a third training sample set.
[0101] An optimized IoT monitoring model is obtained by training the IoT monitoring model based on the third training sample set.
[0102] Understandably, when a new attribute is introduced, the existing trained IoT monitoring model has not been trained for that attribute, and therefore cannot accurately identify it. Labeling the training sample set with a tagging system before inputting it into the IoT monitoring model can significantly improve training efficiency and accuracy. However, the existing comprehensive tagging system does not include this attribute and cannot label it. Therefore, a simple method is needed to quickly and easily add the corresponding category name and alias to achieve an optimized tagging system.
[0103] Understandable, such as Figure 3 As shown, the tag system established by this invention displays categories, subcategories, products, attributes, and aliases on the front-end page in a tree structure. Users can add, delete, modify, and query the tag system on this page. When an incorrect tag is found, it can be corrected or deleted. When a new tag is added, a corresponding category name and alias are added. The tags after addition, deletion, modification, and query are synchronized to the tag system. Each level of tag and alias supports modification and deletion.
[0104] Compared with existing technologies, the beneficial effects of the tag-based IoT monitoring model construction method provided by this invention are as follows: First, by training the model with structured data to obtain corresponding feature values, a complete tag system is obtained based on these feature values. Therefore, a large number of feature tags required by business operations can be automatically generated, realizing systematic management of feature tags. Furthermore, the tag system can be reused for similar problems, improving development efficiency. Second, by using the tag system to automatically annotate the continuous stream of business data, the efficiency of data annotation can be greatly improved, the error rate reduced, and the problems of high cost, low efficiency, poor quality, and heavy reliance on the professionalism of annotation personnel or translation models in existing data annotation methods can be effectively solved. Third, for newly added attributes, the tag system can be added or deleted through the front-end page. First, by modifying the labeling system, the tagging system can be continuously optimized conveniently and quickly based on actual business data, thereby improving the quality of business data labeling. Second, by using the labeling system to automatically label a large amount of business data, the time required to generate training sample sets for IoT monitoring models is greatly reduced, thus improving development efficiency. Third, by using labeled sample sets to train IoT monitoring models, the accuracy of entity extraction for key information and the efficiency of training are improved. Fourth, by using the zero-sample replay method to train IoT monitoring models, the feature vectors of old samples are used to obtain the sample features of old samples, and the sample features of old samples and new samples are used simultaneously in the next round of training. Therefore, catastrophic forgetting during incremental learning is avoided while saving data storage overhead.
[0105] Those skilled in the art will understand that all or part of the processes of the methods described in the above embodiments can be implemented by a computer program instructing related hardware, and the program can be stored in a computer-readable storage medium. The computer-readable storage medium may be a disk, optical disk, read-only memory, or random access memory, etc.
[0106] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in the present invention should be included within the scope of protection of the present invention.
Claims
1. A method for constructing an IoT monitoring model based on a tag system, characterized in that, The method includes the following steps: An initial labeling system is established, and the structured dataset is manually labeled based on the initial labeling system to construct the first training sample set; Construct a BERT model and train it based on the first training sample set to obtain a trained BERT model; Business data is input into the trained BERT model to obtain corresponding feature values, and a complete labeling system is obtained based on the feature values; The improved labeling system based on the aforementioned feature values includes: Based on the feature values, the corresponding feature parameter names are obtained. The feature parameter names are aliases in the tag system. After verifying the feature parameter names, they are added to the alias library of the initial tag system to obtain a complete tag system. Construct an IoT monitoring model, and use the IoT monitoring model to predict business data to obtain predicted business data; The predicted business data is labeled using the improved labeling system to construct a second training sample set; The trained IoT monitoring model is obtained by training the IoT monitoring model based on the second training sample set, including: During the first round of training, the samples in the second training set are input into the IoT monitoring model according to the pre-set batch size to obtain the corresponding aliases and labeled positions, and the average value of the feature vector of each alias is saved. In subsequent training rounds, the sample features of each alias are obtained based on the average value of the feature vectors of each alias obtained in the previous round. The sample features are then input into the IoT monitoring model along with the second training set for training to obtain the corresponding aliases and labeled locations. The average value of the feature vectors of each alias is saved. Once the training rounds reach the threshold, a well-trained IoT monitoring model is obtained. The average value of the feature vector for each alias is obtained as follows: , in, Let be the average of the feature vectors of all samples for the k-th individual name. Rd is the number of samples for each alias, Rd is a real vector containing d aliases, and d is the number of elements in the feature vector. Let F(x) be the set of all aliases, and let F(x) be the function that maps the input samples to the feature vectors of each alias. For the i-th sample with the k-th alias, This is the current training dataset.
2. The method according to claim 1, characterized in that, The method further includes the following steps: For the newly added attributes, corresponding category names and aliases are added to the improved tag system to obtain an optimized tag system. Then, the predicted business data is labeled using the optimized tag system to construct a third training sample set. An optimized IoT monitoring model is obtained by training the IoT monitoring model based on the third training sample set.
3. The method according to claim 1, characterized in that, The process of training the BERT model based on the first training sample set to obtain the trained BERT model includes: The first training sample set is divided into a first training set and a first validation set; Set the batch size and training epoch threshold for training; During each training round, samples from the first training set are input into the Bert model for one round of training according to the set batch size. After one round of training, samples from the first validation set are input into the Bert model, and the evaluation index score is calculated based on the precision, recall and F1 score evaluation index. Once the training rounds reach the threshold, the BERT model corresponding to the round with the highest evaluation metric score is the well-trained BERT model.
4. The method according to claim 1, characterized in that, The labeling system includes four levels: primary classification, secondary classification, tertiary classification, and quaternary classification. The quaternary classification has a corresponding alias library.
5. The method according to claim 1, characterized in that, During the next round of training, the sample features are obtained as follows: , in, Let g be the sample feature of the kth alias, g be the standard Gaussian sampling noise, and r be the uncertainty scale. Let be the feature vector of the i-th sample of the k-th alias, d be the number of elements in the feature vector, and C1 be the number of aliases.
6. The method according to claim 1, characterized in that, The formula for the loss function is: Where N is the total number of samples in the training set. As an indicator variable, when sample i contains alias j It is 1 if it is not 0 otherwise Let be the unnormalized score of the feature vector of the i-th sample with the j-th alias. Let e be the sum of the feature vectors of the i-th sample of all aliases, and let C1 be the number of aliases.
7. The method according to claim 1 or 3, characterized in that, The Bert model has 12 Transformer layers with 768 dimensions, and 12 heads for multi-head self-attention.