An iot terminal network communication state monitoring method

By employing adaptive spatiotemporal feature alignment, multi-scale dilated convolution, and cross-attention mechanisms, the temporal offset and multi-scale problems in IoT terminal network communication status monitoring are solved, enabling more accurate status identification and environmental adaptive monitoring.

CN122226643APending Publication Date: 2026-06-16SHANDONG SIJI TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANDONG SIJI TECH CO LTD
Filing Date
2026-03-20
Publication Date
2026-06-16

Smart Images

  • Figure CN122226643A_ABST
    Figure CN122226643A_ABST
Patent Text Reader

Abstract

The application relates to the technical field of Internet of Things monitoring, and discloses an Internet of Things terminal network communication state monitoring method, which comprises network communication data acquisition and training data set construction, construction of a communication state monitoring model, definition of a loss function, communication state monitoring model training, and Internet of Things terminal network communication state monitoring. The application has the beneficial effect that a dynamic time warping mechanism is used to perform time sequence alignment on an original feature matrix, local time offset problems caused by network jitter are eliminated, subsequent feature extraction can be based on more regular time sequence data, and state misjudgment caused by time axis misalignment is avoided.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of Internet of Things (IoT) monitoring technology, specifically a method for monitoring the network communication status of IoT terminals. Background Technology

[0002] Against the backdrop of large-scale deployment and application of the Internet of Things (IoT), various IoT terminals have widely penetrated key areas such as industrial control, smart homes, and environmental monitoring, becoming the core carrier connecting the physical world and the digital space. These terminals typically operate in complex and ever-changing network environments, relying on wireless communication links for data interaction. The stability of their network communication status directly determines the reliability and real-time performance of IoT services. However, due to the heterogeneous hardware platforms and limited computing resources of IoT terminals, coupled with the highly dynamic and uncertain network environment, problems such as network congestion and signal attenuation frequently occur, leading to data transmission delays, packet loss, and even service interruptions, causing serious losses to production and daily life.

[0003] Faced with high-concurrency, multi-source, and heterogeneous data streams generated by massive numbers of terminals, traditional monitoring methods are insufficient, necessitating the introduction of industrial big data technology systems for in-depth governance. By constructing a distributed big data storage architecture, it is possible to efficiently carry and persistently store historical full communication logs and real-time status snapshots, providing a solid data foundation for long-term trend analysis and fault backtracking. Based on this, utilizing advanced big data analytics engines to mine massive amounts of time-series data not only provides a macro-level insight into the overall network operation but also accurately pinpoints the micro-level causes of anomalies.

[0004] To address the aforementioned issues, several IoT status monitoring methods have emerged on the market. However, existing monitoring methods still suffer from the following problems:

[0005] 1. Conventional normalization or standardization methods only scale each feature dimension independently, which cannot correct the misalignment of the feature matrix on the time axis caused by network jitter. This results in time offset errors in the model input, affecting the accuracy of state discrimination.

[0006] 2. Conventional methods typically use convolutional kernels of fixed size to extract temporal features, which can only capture local changes at a single scale. They cannot simultaneously perceive multi-scale temporal dependencies such as rapid retransmission fluctuations and slow signal attenuation, making it difficult to comprehensively characterize the temporal evolution of different abnormal states.

[0007] 3. Conventional techniques often treat all time slots equally or use fixed attention weights when fusing features, without considering the differences in contribution of different time slots to the final state determination, and cannot effectively suppress the interference of historical noise on the current state determination, thus affecting the model's sensitivity to recent state changes.

[0008] 4. Conventional classification models do not fully utilize contextual information such as terminal type and time period when processing network state discrimination, and they have difficulty handling the overlapping distribution and fuzzy boundary problems of different abnormal states in the feature space, resulting in insufficient classification robustness in complex dynamic environments. Summary of the Invention

[0009] To address the shortcomings of existing technologies, this invention provides a method for monitoring the network communication status of IoT terminals.

[0010] To achieve the above objectives, the present invention employs the following technical solution:

[0011] A method for monitoring the network communication status of IoT terminals includes the following steps:

[0012] S1. Obtain network communication data from IoT terminals, and preprocess and label the network communication data to construct a training dataset;

[0013] S2. Construct a communication status monitoring model, which is used to output communication status classification results based on the input network communication data;

[0014] S3. Define the loss function of the communication status monitoring model based on the main task loss and the auxiliary regularization loss. The loss function is used to quantify the difference between the model output and the true label.

[0015] S4. Train the communication status monitoring model using the training dataset, update the model parameters by minimizing the loss function until the model converges, and obtain the trained communication status monitoring model.

[0016] S5. Input the real-time network communication data to be monitored into the trained communication status monitoring model to obtain the current communication status monitoring results, and perform corresponding early warning or control operations based on the monitoring results.

[0017] Furthermore, in S1, a lightweight data acquisition agent is embedded in each participating terminal. This agent continuously captures key performance indicators of the network layer and transport layer at fixed time intervals, specifically including packet loss rate, round-trip time, signal strength, TCP retransmission rate, and data packet reception throughput, totaling five features.

[0018] Furthermore, S2 specifically refers to:

[0019] S21. Adaptive spatiotemporal feature alignment and enhancement processing is adopted. First, the optimal alignment path between the temporal series and the standard reference sequence within the window is found through a dynamic time warping strategy. Then, the local statistical properties are used to enhance the contrast of the aligned data, and the aligned and enhanced normalized feature tensor is output.

[0020] S22. Multi-scale temporal dependencies are captured by parallel dilated convolutions with different dilation rates, and feature maps of different scales are adaptively fused according to the local variance of the alignment enhancement matrix to construct a temporal feature map rich in multi-scale information.

[0021] S23. Construct a lightweight but highly expressive communication state classification model. This model deeply integrates multi-scale features, time slot attention mechanism, and dynamic gating calibration based on environmental context to achieve accurate judgment of the current communication state.

[0022] Furthermore, S3 specifically refers to:

[0023] The main task loss uses the confidence level predicted by the model itself to dynamically adjust the penalty strength of the classification loss for each sample and calibrates the prediction of the confidence level.

[0024] The auxiliary regularization loss is calculated using a graph Laplacian regularization-based method.

[0025] The total loss function combines the main task classification loss with the auxiliary regularization loss through a weighted summation.

[0026] Furthermore, S21 specifically refers to,

[0027] Dynamic time warping alignment: Through dynamic time warping alignment, the time sequence of dimensions is aligned. The original feature matrix With dimension Standard reference matrix Alignment is performed along the time dimension to eliminate local time offsets caused by network jitter, resulting in a dimension of... Alignment matrix ; Indicates the size of the time window. Indicates the number of features;

[0028] Local statistical property calculation: Calculate the local mean and local standard deviation of each feature in the alignment matrix within the current time window;

[0029] Significance weight mask generation: A normalized exponential weighting method based on local mean and standard deviation is used to generate significance weights for each time slot and feature to enhance the fluctuation points that deviate from the local mean;

[0030] Adaptive spatiotemporal feature enhancement utilizes alignment matrices, saliency masks, and learnable temporal convolution kernels for feature enhancement. The alignment enhancement matrix is ​​obtained by element-wise multiplication of the alignment matrix and the saliency mask, followed by addition to the local context features extracted through one-dimensional convolution.

[0031] Furthermore, S22 specifically refers to,

[0032] Multi-scale dilated convolution feature extraction: Three parallel dilated convolutional layers are applied to the alignment enhancement matrix, with different dilation rates set respectively, to generate three initial feature maps with different receptive fields;

[0033] Dynamic fusion weight generation: Calculate the dynamic fusion weight vector related to the input based on the local fluctuation characteristics of the alignment enhancement matrix;

[0034] Multi-scale feature weighted fusion: Using the generated dynamic fusion weight vector, the feature maps of three different scales are weighted and summed to obtain the temporal feature map.

[0035] Furthermore, S23 specifically refers to,

[0036] A decoupled cross-attention mechanism is applied to the temporal feature map. The decoupled cross-attention mechanism treats the time dimension and the feature dimension as two independent sequences. By calculating their interaction with a set of learnable state prototypes, an attention distribution that can characterize the similarity between the current sample and various typical state patterns is generated.

[0037] The dynamic gating mechanism based on context feature vectors adaptively fuses attention aggregation features and projection global pooling features. The dynamic gating mechanism can dynamically adjust the contribution of the two features according to the current environmental context, making the fused features more adaptable to different communication scenarios and improving the accuracy of state discrimination.

[0038] A multi-task classification head is adopted. Based on the fusion of features, the branch is divided into two tasks. The two tasks are dynamically weighted and jointly optimized by considering the homoscedasticity uncertainty of the tasks.

[0039] Furthermore, a multi-task classification head is adopted. Based on the fused features, the branch is divided into two tasks. The two tasks are dynamically weighted and jointly optimized by considering the homoscedasticity uncertainty of the tasks. Specifically:

[0040] Main state classification task: The main state classification task is used to predict the specific state category of network communication within the current time window. It obtains the probability distribution of each category by inputting the fused feature vector into a linear classification layer and applying the Softmax function.

[0041] State confidence regression task: The state confidence regression task is used to estimate the confidence level of the model in the current classification result. It is obtained by concatenating the fused feature vector and the local standard deviation vector and inputting them into a linear layer, and then passing them through the Sigmoid function to obtain a confidence score between 0 and 1.

[0042] Compared with the prior art, the beneficial effects of the present invention are as follows:

[0043] 1. This invention uses a dynamic time warping mechanism to align the original feature matrix in time, eliminating the problem of local time offset caused by network jitter, so that subsequent feature extraction can be based on more regular time series data, avoiding misjudgment of state due to misalignment of time axis.

[0044] 2. This invention employs multi-scale dilated convolution to extract temporal features of different receptive fields in parallel, and dynamically fuses feature maps of each scale according to the local fluctuation of the input data, enabling the model to adaptively select the most suitable combination of receptive fields based on the temporal characteristics of the network state, while capturing both short-term bursts and long-term trends.

[0045] 3. This invention adopts a slot-feature decoupled cross-attention mechanism, treating time and feature dimensions as independent sequences to interact with learnable state prototypes respectively, generating an attention distribution that characterizes the degree of matching between each slot and typical state patterns, and strengthening the importance of recent slots through time decay masking.

[0046] 4. This invention adopts a context-aware dynamic gating fusion mechanism, which uses environmental information such as terminal type and time period as the basis for dynamic weight generation, and adaptively fuses attention aggregation features and global pooling features, so that the fused features can be dynamically adjusted according to different communication scenarios, thereby improving the scenario adaptability of state discrimination. Attached Figure Description

[0047] Appendix Figure 1 This is a flowchart of the present invention;

[0048] Appendix Figure 2 This is the multi-scale dilated convolution dynamic fusion weight distribution diagram of the present invention;

[0049] Appendix Figure 3 It represents the model's fine-grained performance across four types of network communication states. Detailed Implementation

[0050] The present invention will be further illustrated below with reference to specific embodiments. It should be understood that these embodiments are for illustrative purposes only and are not intended to limit the scope of the invention. Furthermore, it should be understood that after reading the teachings of this invention, those skilled in the art can make various alterations or modifications to the invention, and these equivalent forms also fall within the scope defined in this application.

[0051] S1. Obtain network communication data from IoT terminals, and preprocess and label the network communication data to construct a training dataset;

[0052] The data acquisition process needs to cover various communication scenarios of IoT terminals in actual operating environments, including normal operating conditions and abnormal states such as mild congestion, severe congestion, and signal degradation. The data acquisition work is deployed on typical IoT terminals, covering different hardware platforms and communication modules to reflect heterogeneity.

[0053] A lightweight data acquisition agent is embedded in each participating terminal. This agent continuously captures key performance indicators of the network layer and transport layer at fixed time intervals, including packet loss rate, round-trip time, signal strength, TCP retransmission rate, and data packet reception throughput, totaling five features.

[0054] During data acquisition, a time window is defined as 10 consecutive time slots, with each time slot corresponding to one sampling point, thus forming an original feature matrix with a dimension of 10×5.

[0055] To ensure data diversity and representativeness, the data collection period covers different time periods throughout the day (such as peak and off-peak hours), and records the terminal type, time period of the day, and long-term variance of signal reception strength for each sample as contextual information. All collected raw data, along with its timestamps, terminal identifiers, and other metadata, are stored in the central database.

[0056] When constructing the training dataset, the massive amount of raw data collected needs to be cleaned and labeled. The cleaning step removes invalid data segments caused by terminal power outages or communication interruptions, and handles the few missing values ​​through interpolation or elimination. Data labeling is then performed, with the labeling work completed by network domain experts using automated tools. The labeling categories are divided into four types: normal, mild congestion, severe congestion, and signal degradation.

[0057] Normal state refers to network indicators being stable and within the expected range; mild congestion is characterized by a slight increase in round-trip latency and packet loss rate, but throughput has not yet decreased significantly; severe congestion is accompanied by a high packet loss rate, a surge in TCP retransmission rate, and severely limited throughput; signal degradation is mainly characterized by signal strength consistently below the threshold, which may be accompanied by an increase in packet loss rate but no significant change in latency.

[0058] During annotation, manual analysis is conducted to determine the state category of each time window based on the changing trends of five features within that window and the context before and after the window (such as terminal type and time period). The category label is then appended to the sample using one-hot encoding. Furthermore, to construct a standard reference matrix for subsequent alignment processing, high-quality samples are selected from a large amount of historical normal communication data, and typical normal communication patterns are generated using clustering algorithms as a reference benchmark.

[0059] All labeled samples are divided into training, validation and test sets according to a certain ratio to ensure that the distribution of each category is balanced and that different terminal types and time periods are covered, thus providing a solid data foundation for model training.

[0060] S2. Construct a communication status monitoring model, which is used to output communication status classification results based on the input network communication data;

[0061] S21, Adaptive Spatiotemporal Feature Alignment and Enhancement Module

[0062] The network communication data collected by IoT terminals is affected by environmental noise, terminal heterogeneity, and network fluctuations, resulting in slight shifts in the time axis, scale differences between features, and local missing or abnormal spikes in the original feature matrix within different sampling windows. Conventional normalization or standardization methods only scale each feature dimension independently, which cannot correct the misalignment on the time axis or enhance the contextual association of features within local time windows.

[0063] This invention employs adaptive spatiotemporal feature alignment and enhancement processing. First, it uses a dynamic time warping strategy to find the optimal alignment path between the temporal series and the standard reference sequence within a window. Then, it utilizes local statistical properties to enhance the contrast of the aligned data, outputting an aligned and enhanced normalized feature tensor. The specific steps are as follows:

[0064] 1) Dynamic time warping and sequence alignment

[0065] By performing dynamic time warping alignment, the dimension is... The original feature matrix With dimension Standard reference matrix Alignment is performed along the time dimension to eliminate local time offsets caused by network jitter, resulting in a dimension of... Alignment matrix ;

[0066] in, This indicates the size of the time window; for example, a value of 10 corresponds to 10 consecutive time slots.

[0067] This indicates the number of features, with an example value of 5, corresponding to packet loss rate, round-trip time, signal strength, TCP retransmission rate, and data packet reception throughput, respectively.

[0068] In one implementation, the standard reference matrix Pre-defined by clustering from a large amount of historical normal communication data, representing typical communication patterns. Specifically, this involves collecting a large amount of historical normal communication data and dividing it into fixed time windows (size...). The sample matrix, where each sample is... Then, the K-means clustering algorithm is used to cluster these samples, and the resulting cluster centers are used as the standard reference matrix. This matrix represents a typical normal communication pattern.

[0069] In practice, the dynamic time warping alignment operation achieves time alignment by calculating the distance matrix between two sequences in the time dimension and finding the optimal path with the minimum cumulative distance.

[0070] 2) Calculation of local statistical characteristics

[0071] Calculate the alignment matrix for each feature within the current time window (time window size is...). The local mean and local standard deviation of ), specifically, for each feature Calculate the alignment matrix The Middle List all The arithmetic mean of each time slot is used as the local mean. Calculate its sample standard deviation as the local standard deviation. ,

[0072] definition Indicates the first The local mean of a feature within the current time window is a scalar used to characterize the average level of the feature within the window.

[0073] definition Indicates the first The local standard deviation of a feature within the current time window is a scalar used to characterize the degree of fluctuation or dispersion of the feature within the window;

[0074] This represents the feature index, with values ​​ranging from 1 to... ;

[0075] This represents the total number of features, with an example value of 5, including: packet loss rate, round-trip time, signal strength, TCP retransmission rate, and packet reception throughput, for a total of 5 features.

[0076] 3) Generation of saliency weight mask

[0077] A normalized exponential weighting method based on local mean and standard deviation is used to generate significance weights for each time slot and feature, in order to amplify fluctuations that deviate from the local mean, expressed as:

[0078]

[0079] In the formula, Represents the saliency weight mask matrix The Middle line, number The elements of the column are normalized weight values ​​used to highlight time points that deviate significantly from the local mean. The weight values ​​are between 0 and 1. (Significance weight mask matrix) The dimension is ;

[0080] Represents the natural exponential function;

[0081] Represents the alignment matrix The Middle The first time slot, the first The values ​​of each feature, i.e., the alignment matrix The value of the element in row t and column f.

[0082] 4) Adaptive spatiotemporal feature enhancement

[0083] Feature enhancement is achieved using an alignment matrix, a saliency mask, and a learnable temporal convolution kernel. The alignment enhancement matrix is ​​obtained by element-wise multiplication of the alignment matrix and the saliency mask, followed by addition to the local context features extracted through one-dimensional convolution. This process is represented as follows:

[0084]

[0085] In the formula, The alignment enhancement matrix represents the feature matrix after alignment and enhancement. It preserves the temporal structure of the original data, highlights anomalous fluctuations through saliency weights, and incorporates local contextual information through convolution, making the features more discriminative. Its dimension is [dimension number missing]. ;

[0086] This represents the Hadamard product, i.e., element-wise multiplication.

[0087] This represents the convolution operation. The term involves performing a one-dimensional convolution independently along the time axis for each feature dimension. This one-dimensional convolution extracts the local contextual information of each feature's time series, which is equivalent to performing local feature extraction on each feature, capturing the correlation between adjacent time slots, and enhancing the expressive power of the features.

[0088] Represents a set of one-dimensional convolution kernels, totaling One, each size is (Time length 3, feature dimension 1) are trainable parameters that act on the corresponding feature dimensions to extract local contextual information for each feature time series. Padding (adding one zero at each end of the time axis) maintains the output time dimension. .

[0089] S22, Temporal Feature Map Construction Module

[0090] Network communication status (such as congestion and signal degradation) is often reflected in a combination of different time scales (such as short-term bursts and long-term trends). Conventional fixed-size convolutional kernels can only capture local changes at a single scale and cannot simultaneously perceive rapid retransmission fluctuations and slow signal attenuation.

[0091] This invention captures multi-scale temporal dependencies through parallel dilated convolutions with different dilation rates, and adaptively fuses feature maps of different scales based on the local variance of the alignment enhancement matrix, thereby constructing a temporal feature map rich in multi-scale information. The specific steps are as follows:

[0092] 1) Multi-scale dilated convolution feature extraction

[0093] Three parallel dilated convolutional layers are applied to the alignment enhancement matrix, with different dilation rates, to generate three initial feature maps with different receptive fields.

[0094] In practice, each dilated convolutional layer uses a one-dimensional convolutional kernel of size 3, with an inflation rate of 1 / 3. The number of input channels is The number of output channels is 16, and the receptive field is expanded by adjusting the expansion rate. ), and padding is used to maintain the output time dimension. This yields three initial feature maps with different receptive fields, each with a dimension of [missing value]. ,

[0095] definition Indicates the expansion rate The feature map output by the dilated convolutional layer has a dimension of They have different receptive fields and are able to capture multi-scale temporal dependencies;

[0096] Indicates the expansion rate. .

[0097] It should be noted that when At that time, the receptive field is divided into 3 time slots to capture instantaneous time dependence; when At that time, the receptive field is divided into 5 time slots to capture short-term time dependence; when At that time, the receptive field consists of 7 time slots, which are used to capture mid-term time dependence.

[0098] 2) Dynamic fusion weight generation

[0099] The dynamic fusion weight vector related to the input is calculated based on the local fluctuation characteristics of the alignment enhancement matrix. Specifically,

[0100] First, variance pooling is used to extract the degree of fluctuation of each feature dimension within the time window, represented as: ,in, This represents the variance pooling operation, which calculates the variance over time for each column (each feature) of the input matrix and outputs a variance. dimensional vector ( This indicates the degree of fluctuation of each feature within the current window;

[0101] Then, after multilayer perceptron mapping and Softmax normalization, the fusion weights corresponding to the three scales are obtained, i.e., the dynamic fusion weight vector. Dimensions are 3, including , , The three weight coefficients correspond to feature maps with expansion rates of 1, 2, and 3, respectively, and satisfy the following conditions: ,in, This represents a multilayer perceptron with one hidden layer, which... The 3D input is mapped to a 3D unnormalized fraction. This represents the Softmax function, which converts 3D scores into probability weights.

[0102] 3) Multi-scale feature weighted fusion

[0103] Using the generated dynamic fusion weight vector, the feature maps at three different scales are weighted and summed to obtain the temporal feature map, represented as:

[0104]

[0105] In the formula, Represents a time-series feature map with dimension . It is a feature map that integrates multi-scale information and combines the temporal features of different receptive fields. It can capture both short-term bursts and long-term trends at the same time, and more comprehensively reflect the changes in the state of network communication.

[0106] This represents the number of feature channels, which is the feature dimension of each time slot after multi-scale fusion, and its value is 16.

[0107] In one embodiment, such as Figure 2As shown, based on the multi-scale dilated convolution dynamic fusion weight distribution map, the adaptive weight allocation results of the proposed dynamic fusion mechanism under different network states are displayed in the form of stacked bar charts. The horizontal axis represents the number of 24 samples, with each sample corresponding to a time window, and the vertical axis represents the fusion weight ratio, which is dimensionless. Each bar consists of three stacked segments, representing the weight coefficients for an inflation rate of 1 (receptive field of 3 time slots, used to capture short-term bursts), an inflation rate of 2 (receptive field of 5 time slots, used to capture short-term trends), and an inflation rate of 3 (receptive field of 7 time slots, used to capture medium-term trends), respectively, and the sum of the three is always 1. The top of the bar indicates the abbreviation of the sample's state, which includes four states: normal, mild congestion, severe congestion, and signal degradation, with 6 samples for each state. The weighted data is generated using a Dirichlet distribution, with its concentration parameter set according to state characteristics: In normal states, weights tend to be allocated to inflation rate 1 to focus on short-term details; in mild congestion, weights are more allocated to inflation rate 2 to detect gradually increasing latency and packet loss; in severe congestion, weights mainly rely on inflation rate 3 to capture the continuously deteriorating trend; in signal degradation states, the weight distribution is relatively uniform, reflecting the mixed characteristics of long-term signal strength decay and sudden fluctuations. As can be observed from the figure, samples in different states exhibit significant differences in weight stacking patterns: the blue segment (inflation rate 1) of normal samples has the highest proportion, the orange segment (inflation rate 2) of mildly congested samples is prominent, the green segment (inflation rate 3) of severely congested samples is dominant, while the proportions of the three segments are roughly equal in signal degradation samples. This invention effectively extracts the degree of local fluctuation through variance pooling and generates dynamic weights via a multilayer perceptron, enabling the model to automatically select the most suitable receptive field combination based on the temporal fluctuation characteristics of the input samples, thereby more comprehensively characterizing the temporal evolution of different abnormal states.

[0108] S23, Status Classification Module

[0109] The environment in which IoT terminals operate is highly dynamic. The distribution of different types of network anomalies in the feature space may overlap and have fuzzy boundaries. In addition, different time slots contribute differently to the final state determination, and there may be complex nonlinear interactions between feature dimensions.

[0110] To accurately determine the current communication state, this invention constructs a lightweight yet highly expressive communication state classification model. This model deeply integrates multi-scale features, a time-slot attention mechanism, and dynamic gating calibration based on environmental context to achieve accurate determination of the current communication state. The specific steps are as follows:

[0111] 1) Slot-feature decoupling and cross-attention

[0112] To capture key discriminative information simultaneously across both time and feature dimensions, and to avoid the excessive parameter count and overfitting issues associated with traditional fully connected layers, this invention first applies a decoupled cross-attention mechanism to the temporal feature map. Specifically,

[0113] The decoupled cross-attention mechanism treats the time and feature dimensions as two independent sequences. By computing their interactions with a set of learnable state prototypes, it generates an attention distribution that characterizes the similarity between the current sample and various typical state patterns, expressed as:

[0114]

[0115] In the formula, This represents the attention weight matrix, with dimension 1. Attention weight matrix The element in the t-th row and k-th column Characterizing the first The first time slot and the first Attention weights between state prototypes are used to quantify the degree of matching between the time slot feature and the typical state pattern. The larger the weight value, the higher the degree of matching.

[0116] This indicates that the query projection matrix is ​​a trainable parameter with dimension 1. This is used to map temporal feature maps to the query space, thereby enhancing the expressive power of the model.

[0117] The state prototype matrix is ​​a set of trainable parameters with dimension 1. The k-th row of the state prototype matrix The prototype vector representing the k-th class of potential typical communication patterns, in the cross-attention mechanism, is the state prototype matrix. As a key (or value), it interacts with the query (temporal feature map), and obtains attention weights by calculating similarity, thereby measuring the degree of matching between each time slot of the current sample and each prototype pattern, which helps the model capture features related to typical state patterns;

[0118] This represents the number of potential state prototypes, preferably. It should be slightly larger than the actual expected number of categories to cover more potential mixed states or intermediate modes, but it should not be too large to avoid introducing redundancy. The example of 8 is a choice between balancing the model's expressive power and computational complexity.

[0119] Item representation Transpose of the item;

[0120] k represents the index of the state prototype, with a value ranging from 1 to... ;

[0121] t represents the time slot index, with values ​​ranging from 1 to... ;

[0122] The key projection matrix is ​​a trainable parameter with dimension O(n). This is used to map state prototypes to the key space for interaction with queries;

[0123] This represents the time decay mask matrix, used to emphasize the importance of recent time slots and suppress interference from historical noise. Its dimension is... Time decay mask matrix The element in the t-th row and k-th column ;

[0124] Represents the attenuation factor, used to construct the time attenuation mask matrix, through... This makes the time slots closer to the current time ( The larger the value, the greater the weight, thus emphasizing the importance of recent time slots, suppressing the interference of historical noise, and making the model pay more attention to recent communication state changes. An example value is 0.9.

[0125] 2) Context-aware dynamic gating fusion

[0126] By using a dynamic gating mechanism based on context feature vectors, attention aggregation features and projection global pooling features are adaptively fused. The dynamic gating mechanism can dynamically adjust the contribution of the two features according to the current environmental context, making the fused features more adaptable to different communication scenarios and improving the accuracy of state discrimination.

[0127] Specifically, the context feature vector is input into the gating network to generate a gating vector. This gating vector controls the fusion ratio of the attention-aggregated features and the projected global pooling features, thus obtaining a fused feature vector. The attention-aggregated features are obtained by linearly projecting the weighted sum of the attention weight matrix and the state prototype matrix, while the projected global pooling features are obtained by global pooling and linear projection of the temporal feature map, as shown below:

[0128]

[0129] In the formula, The fused feature vector is a feature vector that combines attention-aggregated features and projected global pooling features, with dimensions of . By combining the attention information of the time slot-prototype with the original statistical features, it is possible to comprehensively characterize the network communication status within the current time window;

[0130] This represents the Sigmoid activation function, which maps the input to between 0 and 1, and the output serves as a gating vector to control the fusion ratio of the two features.

[0131] Represents the context feature vector, with dimension . The data includes terminal type encoding, cosine embedding of the time period of the day, and long-term fluctuation variance of signal received strength. These are normalized before being input into the gating network (e.g., layer normalization) to ensure training stability.

[0132] This represents the dimension of the fused feature vector. It is a hyperparameter that can be set according to the model complexity, for example, to 64.

[0133] The dimension of the context feature vector is determined by the number of context features and the encoding method. For example, terminal type encoding might be a one-hot vector, time-time cosine embedding is 2-dimensional, and long-term fluctuation variance is 1-dimensional, summing to... ;

[0134] This represents the weight matrix of the first layer of the gated network, which are trainable parameters with dimension 1. This is used to linearly map the context feature vector to the same dimension as the fused feature vector in order to generate a gated vector;

[0135] This represents the bias vector of the first layer of the gated network, which are trainable parameters with dimension . ;

[0136] This represents the attention aggregation feature, with dimensions of [missing information]. From the attention weight matrix and state prototype matrix By weighted summation, the matching information between each time slot and the typical state prototype is aggregated, which can capture the components related to various typical patterns in the sample, highlight the characteristics of key time slots, and is represented as follows: ;

[0137] The projected weight matrix, representing the attention aggregation features, is a trainable parameter with dimension O(n). This is used to linearly map attention-aggregated features to dimensions. ;

[0138] The projection bias vector representing the attention aggregation feature is a trainable parameter with dimension O(n). ;

[0139] Represents the projected global pooling feature, with dimension . Due to global pooling features Obtained through linear projection, the overall statistical information of the temporal feature map is preserved, serving as a supplement to the original features and enhancing the robustness of the fused features. This includes global pooling features. By analyzing the time series feature map The result is obtained by concatenating max pooling and average pooling along the time dimension, with the dimension being... The weight parameters for linear projection are: (dimension is) ), bias parameters of linear projection (dimension is) ), and All of these are learnable parameters.

[0140] 3) Multi-task classification head and uncertainty weighting

[0141] A multi-task classification head is adopted, which branches into two tasks (main state classification task and state confidence regression task) based on the fused features. The two tasks are dynamically weighted and jointly optimized by considering the homoscedasticity uncertainty of the tasks.

[0142] 3.1) Master State Classification Task: The master state classification task is used to predict the specific state category of network communication within the current time window. It obtains the probability distribution of each category by inputting the fused feature vector into a linear classification layer and applying the Softmax function:

[0143]

[0144] In the formula, This represents the probability distribution vector of the predicted state, with dimension 1. Predicted state probability distribution vector The element Indicates that the sample belongs to the first The probability of a class;

[0145] This indicates the number of network communication states, with a value of 4, corresponding to normal, mild congestion, severe congestion, and signal degradation, respectively.

[0146] The weight matrix representing the classification head is a trainable parameter with dimension 1. Used to linearly map fused feature vectors to The category score space of the dimension;

[0147] The bias vector representing the classifier head is a trainable parameter with dimension 1. .

[0148] 3.2) State Confidence Regression Task: The state confidence regression task is used to estimate the model's confidence in the current classification result. It involves concatenating the fused feature vector with the local standard deviation vector and inputting the result into a linear layer. The result is then processed by the Sigmoid function to obtain a confidence score between 0 and 1, expressed as:

[0149]

[0150] In the formula, This represents the prediction confidence score, ranging from 0 to 1. A higher score indicates that the model is more confident in the classification result.

[0151] This represents a vector concatenation operation;

[0152] This represents a logarithmic function, with the default base being the natural constant.

[0153] This represents a local standard deviation vector with dimension . It reflects the degree of fluctuation of each feature value within the current window, and is determined by the local standard deviation of each feature. ( A vector composed of ) , dimension ;

[0154] This represents a very small constant, with examples of its values. This prevents numerical overflow caused by taking the logarithm of zero;

[0155] The weight matrix representing the confidence regression head is a trainable parameter with dimension 1. This is used to map the concatenated feature vector to a one-dimensional confidence score.

[0156] The bias term representing the confidence regression head is a trainable parameter with dimension 1, used for the bias of the linear transformation.

[0157] S3. Define the loss function

[0158] To ensure that the model optimization process closely aligns with the actual needs of IoT network communication state monitoring, the loss function not only considers classification accuracy but also incorporates calibration of prediction confidence and time-series features. Figure 1 Consistency constraints and weighted penalties based on significant outliers guide the model to learn more robust and interpretable feature representations. The entire loss function consists of two main parts: the main task loss and the auxiliary regularization loss. The specific steps are as follows:

[0159] 1) Main task loss

[0160] Traditional cross-entropy loss function treats every training sample equally. However, in network state monitoring, the certainty of the true label of samples located at the class boundary or with high data noise is inherently low. Forcing the model to fit these samples with high confidence may lead to overfitting or distortion of the decision boundary.

[0161] The main task loss of this invention dynamically adjusts the penalty strength of the classification loss for each sample using the confidence level predicted by the model itself, and calibrates the prediction of the confidence level, as expressed as:

[0162]

[0163] In the formula, Represents the loss of the main task; it is a scalar that measures the difference between the model's prediction and the true label.

[0164] This represents the total number of samples in the training batch;

[0165] This represents the sample index in the batch, with values ​​ranging from 1 to... ;

[0166] Indicates the first The prediction confidence score for each sample;

[0167] This represents the confidence-weighted hyperparameter, used to control the degree of influence of confidence on loss; the example value is 1.

[0168] This indicates an indicator function that takes the value 1 when the condition inside the parentheses is true, and 0 otherwise.

[0169] Indicates the first The one-hot encoding of the true label of each sample, if the sample belongs to category The value is 1 if it is 1, otherwise it is 0.

[0170] The model predicts the first... Each sample belongs to category The probability of is the first The first sample's predicted state probability distribution vector One element;

[0171] This represents the focus loss hyperparameter, with an example value of 2, which is used to reduce the weight of easily classified samples, allowing the model to focus on difficult-to-classify samples.

[0172] This represents the first balancing weight coefficient, which is the weight coefficient that balances the classification loss and the confidence regression loss. An example value is 0.5.

[0173] This represents the mean square error function, used to calculate the squared difference between two scalars.

[0174] Indicates the first The target confidence value for each sample, ranging from 0 to 1, is used to supervise confidence learning. Its construction is based on the inherent complexity of the samples. It is used to supervise the learning of confidence regression tasks, and its construction based on the inherent complexity of the samples ensures that the model predicts high confidence for simple samples and low confidence for complex or noisy samples, thereby improving the confidence calibration effect. It is defined as... ;

[0175] This represents the first scaling hyperparameter, with an example value of 2, which controls the impact of the number of abnormal fluctuation points.

[0176] This represents the second scaling hyperparameter, with an example value of 1, which controls the influence of multi-scale information entropy.

[0177] Indicates the first The saliency weight mask matrix of each sample The Middle line, number The elements of the column reflect the first Time slot number The significance of each feature;

[0178] Indicates the first The dynamic fusion weight vector of each sample, with a dimension of 3;

[0179] The information entropy calculation function indicates that the larger the entropy, the more the model integrates information from multiple scales, meaning that the importance of different scales is similar, the state patterns are more complex, and the determinism is correspondingly reduced.

[0180] It should be noted that, The term represents the average significance. The larger the value, the more abnormal fluctuation points there are within the sample, and the lower the certainty. Therefore, "1 minus this value" is taken as a positive indicator of certainty. The higher the average significance, the more abnormal fluctuation points there are within the sample, and the more unstable the state is, and therefore the lower the certainty.

[0181] 2) Auxiliary regularization loss

[0182] To improve the representation quality of temporal feature maps, making them smoother in the time dimension and better preserving the enhanced salient boundary information, the auxiliary regularization loss adopts a graph Laplacian regularization-based calculation method. This encourages adjacent and feature-similar points on the time axis to have similar representations, while suppressing noise interference, expressed as:

[0183]

[0184] In the formula, The auxiliary regularization loss is a scalar used to constrain the smoothness of the feature map in the time dimension. It encourages adjacent time slots with similar features to have similar representations in the high-level feature space, thereby suppressing noise fluctuations and enhancing the stability of features. At the same time, it preserves stable pattern clusters formed by strong similarity connections, thus improving classification robustness.

[0185] In the temporal adjacency graph, the first... The first time slot and the first The edge weights between time slots range from 0 to 1. A larger weight indicates that the two time slots need to be closer in the feature space, denoted as... ;

[0186] Represents the alignment enhancement matrix The Line, i.e., the first The alignment enhancement feature vector for each time slot, with dimension [missing information]. , Represents the alignment enhancement matrix The OK;

[0187] This represents a time slot index distinct from t, with values ​​ranging from 1 to... ;;

[0188] L2 norm represents the Euclidean distance between two vectors;

[0189] The kernel width hyperparameter represents the feature similarity and is used to control the decay rate of the feature space distance. The larger the value, the less sensitive it is to feature differences. It is set according to the average Euclidean distance of the feature vectors in the training data, for example, taking the median distance of all samples to the feature vectors.

[0190] The kernel width hyperparameter represents the temporal proximity and is used to control the decay rate over time. The larger the value, the less sensitive it is to time differences. The value is set according to the size of the time window, for example, taking... ;

[0191] Representation of time series feature maps The Line, i.e., the first The multi-scale fused feature vector of each time slot has a dimension of . , Representation of time series feature maps The OK.

[0192] In its implementation, the temporal adjacency graph is based on an alignment enhancement matrix. The constructed graph structure, where nodes correspond to various time slots and edge weights... The similarity and temporal proximity of two time slots in the original feature space are measured and used for Laplacian regularization, so that similar time slots in the feature space are also close in the high-level representation, thereby guiding feature learning.

[0193] It should be noted that, The term represents the squared Euclidean distance between two time slots in the high-level feature space, minimizing the auxiliary regularization loss. This ensures that time slots that are similar and temporally adjacent in the original feature space are represented as closely as possible in the high-dimensional feature space, thereby forcing the local smoothness of high-level temporal features, filtering out unstable fluctuations, and preserving the stable communication pattern cluster structure formed by strong similarity connections, thus improving the robustness of subsequent classification.

[0194] 3) Calculation of total loss function

[0195] The total loss function combines the main task classification loss and the auxiliary regularization loss through a weighted summation, constraining the model to learn more robust and discriminative temporal features while accurately classifying network communication states. It is expressed as:

[0196]

[0197] In the formula, This represents the total loss function, which combines the main task classification loss and the auxiliary regularization loss to train the entire model end-to-end, enabling the model to learn smooth and discriminative temporal features while accurately classifying them.

[0198] This represents the second balancing weight coefficient, a hyperparameter of the strength of the balancing regularization term, with an example value of 0.1.

[0199] S4, Communication Status Monitoring Model Training

[0200] Based on the completed construction of the training dataset, the communication status monitoring model is trained end-to-end. The training process first loads the preprocessed training samples. Each sample contains an original feature matrix of "10 time slots × 5 features", the corresponding real state label (one of four classes), the context feature vector (including terminal type encoding, time period cosine embedding and long-term signal fluctuation variance), and a pre-calculated local standard deviation vector.

[0201] All trainable parameters of the model were randomly initialized. The optimizer was chosen to be the adaptive moment estimator, with the initial learning rate set to an empirical value, and a learning rate decay strategy was used to ensure stable convergence.

[0202] The loss function is the total loss function, which is a weighted combination of the main task loss and the auxiliary regularization loss. The main task loss includes the confidence-weighted classification loss and the mean squared error term of the confidence regression, while the auxiliary regularization loss constrains the local smoothness of the time series feature map through graph Laplacian regularization.

[0203] Training iterations are performed in batches, with each batch randomly selecting several samples. Forward propagation is then executed sequentially: First, the original feature matrix is ​​aligned with the standard reference matrix using dynamic time warping. Next, local statistical properties are calculated and a saliency weight mask is generated. Finally, adaptive spatiotemporal feature enhancement is achieved through learnable one-dimensional convolution, resulting in an aligned enhancement matrix. Then, a temporal feature map is generated through multi-scale dilated convolution and dynamically fused weights. Finally, a fused feature vector is obtained through slot-feature decoupling, cross-attention, and context-aware dynamic gating fusion. The predicted state probability distribution and confidence score are then output. After calculating the total loss, the gradients of each parameter are calculated using the backpropagation algorithm, and the optimizer updates the parameters.

[0204] After each training cycle, model performance is evaluated on the validation set, primarily monitoring changes in classification accuracy and total loss. To prevent overfitting, an early stopping strategy is employed: if the validation set loss does not decrease for several consecutive cycles, training is stopped, and the model parameters at which the validation set performance is optimal are saved as the final model. Furthermore, a maximum number of iteration cycles is set during training to avoid infinite loops; for example, training is forcibly terminated when a preset limit is reached. Through these iterative optimizations, the model gradually learns feature representations and decision boundaries that can accurately determine the network communication status of IoT terminals.

[0205] S5 IoT terminal network communication status monitoring

[0206] Once the communication status monitoring model has been trained and its parameters have been fixed, it can be deployed on IoT terminals or edge gateways to achieve real-time monitoring of the terminal network communication status.

[0207] During the actual monitoring phase, for each time window to be evaluated, the terminal acquisition agent also collects five features—packet loss rate, round-trip time, signal strength, TCP retransmission rate, and data packet reception throughput—in units of ten consecutive time slots, forming a 10×5 original feature matrix. Simultaneously, it acquires the current terminal type code, the current time period (converted to cosine embedding), and the long-term variance of signal strength over a past period, constructing a context feature vector.

[0208] The original feature matrix is ​​input into the model and firstly dynamically time-warped and aligned with a pre-stored standard reference matrix to eliminate local time shifts caused by network jitter. Then, after local statistical property calculation, saliency weight mask generation, and one-dimensional convolution enhancement, an alignment enhancement matrix is ​​obtained. Next, a temporal feature map is generated through multi-scale dilated convolution and dynamic fusion. Then, a fused feature vector is obtained through slot-feature decoupling cross-attention and context-aware dynamic gating fusion. Finally, a four-dimensional probability distribution vector is output through the classification head, corresponding to the probabilities of four states: normal, mild congestion, severe congestion, and signal degradation. At the same time, a confidence regression head outputs a confidence score between 0 and 1.

[0209] The monitoring system selects the category with the highest probability as the network status judgment result for the current window and combines it with a confidence score to comprehensively evaluate the reliability of the judgment. If the confidence score is low or the status category is abnormal (such as severe congestion or signal degradation), the system can trigger an alarm or initiate adaptive adjustment strategies (such as switching communication links or adjusting the data transmission rate). The entire monitoring process is continuously performed in a sliding window manner, with the window sliding forward one time slot each time, thereby achieving continuous tracking and real-time early warning of network communication status.

[0210] In one embodiment, the analysis model performs fine-grained performance analysis on four network communication states (normal, mild congestion, severe congestion, and signal degradation), such as... Figure 3 As shown, a grouped bar chart is used to display the three metrics: precision, recall, and F1 score. The horizontal axis represents the four state categories, and the vertical axis represents the scores. Three bars are arranged side-by-side under each state category, using red, blue, and green to represent precision, recall, and F1 score, respectively. Precision measures the proportion of samples predicted by the model to actually belong to that category; recall measures the proportion of samples that actually belong to that category and are correctly predicted; and the F1 score is the harmonic mean of the two. The experimental data is based on statistics from the test set, which shows a balanced distribution of samples across categories and covers different terminal types and time periods. Experimental results demonstrate that this technique exhibits extremely high discriminative reliability, proving the model's ability to model complex interactions between features.

[0211] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus.

[0212] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.

Claims

1. A method for monitoring the network communication status of IoT terminals, characterized in that, Includes the following steps: S1. Obtain network communication data from IoT terminals, and preprocess and label the network communication data to construct a training dataset; S2. Construct a communication status monitoring model, which is used to output communication status classification results based on the input network communication data; S3. Define the loss function of the communication status monitoring model based on the main task loss and the auxiliary regularization loss. The loss function is used to quantify the difference between the model output and the true label. S4. Train the communication status monitoring model using the training dataset, update the model parameters by minimizing the loss function until the model converges, and obtain the trained communication status monitoring model. S5. Input the real-time network communication data to be monitored into the trained communication status monitoring model to obtain the current communication status monitoring results, and perform corresponding early warning or control operations based on the monitoring results.

2. The method for monitoring the network communication status of an IoT terminal according to claim 1, characterized in that, In S1, a lightweight data acquisition agent is embedded in each participating terminal. This agent continuously captures key performance indicators of the network layer and transport layer at fixed time intervals, including packet loss rate, round-trip time, signal strength, TCP retransmission rate, and data packet reception throughput, totaling 5 features.

3. The method for monitoring the network communication status of an IoT terminal according to claim 1, characterized in that, S2 specifically refers to: S21. Adaptive spatiotemporal feature alignment and enhancement processing is adopted. First, the optimal alignment path between the temporal series and the standard reference sequence within the window is found through a dynamic time warping strategy. Then, the local statistical properties are used to enhance the contrast of the aligned data, and the aligned and enhanced normalized feature tensor is output. S22. Multi-scale temporal dependencies are captured by parallel dilated convolutions with different dilation rates, and feature maps of different scales are adaptively fused according to the local variance of the alignment enhancement matrix to construct a temporal feature map rich in multi-scale information. S23. Construct a lightweight but highly expressive communication state classification model. This model deeply integrates multi-scale features, time slot attention mechanism, and dynamic gating calibration based on environmental context to achieve accurate judgment of the current communication state.

4. The method for monitoring the network communication status of an IoT terminal according to claim 1, characterized in that, S3 specifically refers to: The main task loss uses the confidence level predicted by the model itself to dynamically adjust the penalty strength of the classification loss for each sample and calibrates the prediction of the confidence level. The auxiliary regularization loss is calculated using a graph Laplacian regularization-based method. The total loss function combines the main task classification loss with the auxiliary regularization loss through a weighted summation.

5. The method for monitoring the network communication status of an IoT terminal according to claim 3, characterized in that, S21 specifically refers to, Dynamic time warping alignment: Through dynamic time warping alignment, the time sequence of dimensions is aligned. The original feature matrix With dimension Standard reference matrix Alignment is performed along the time dimension to eliminate local time offsets caused by network jitter, resulting in a dimension of... Alignment matrix ; Indicates the size of the time window. Indicates the number of features; Local statistical property calculation: Calculate the local mean and local standard deviation of each feature in the alignment matrix within the current time window; Significance weight mask generation: A normalized exponential weighting method based on local mean and standard deviation is used to generate significance weights for each time slot and feature to enhance the fluctuation points that deviate from the local mean; Adaptive spatiotemporal feature enhancement utilizes alignment matrices, saliency masks, and learnable temporal convolution kernels for feature enhancement. The alignment enhancement matrix is ​​obtained by element-wise multiplication of the alignment matrix and the saliency mask, followed by addition to the local context features extracted through one-dimensional convolution.

6. The method for monitoring the network communication status of an IoT terminal according to claim 3, characterized in that, S22 specifically refers to, Multi-scale dilated convolution feature extraction: Three parallel dilated convolutional layers are applied to the alignment enhancement matrix, with different dilation rates set respectively, to generate three initial feature maps with different receptive fields; Dynamic fusion weight generation: Calculate the dynamic fusion weight vector related to the input based on the local fluctuation characteristics of the alignment enhancement matrix; Multi-scale feature weighted fusion: Using the generated dynamic fusion weight vector, the feature maps of three different scales are weighted and summed to obtain the temporal feature map.

7. The method for monitoring the network communication status of an IoT terminal according to claim 3, characterized in that, S23 specifically refers to, A decoupled cross-attention mechanism is applied to the temporal feature map. The decoupled cross-attention mechanism treats the time dimension and the feature dimension as two independent sequences. By calculating their interaction with a set of learnable state prototypes, an attention distribution that can characterize the similarity between the current sample and various typical state patterns is generated. The dynamic gating mechanism based on context feature vectors adaptively fuses attention aggregation features and projection global pooling features. The dynamic gating mechanism can dynamically adjust the contribution of the two features according to the current environmental context, making the fused features more adaptable to different communication scenarios and improving the accuracy of state discrimination. A multi-task classification head is adopted. Based on the fusion of features, the branch is divided into two tasks. The two tasks are dynamically weighted and jointly optimized by considering the homoscedasticity uncertainty of the tasks.

8. The method for monitoring the network communication status of an IoT terminal according to claim 7, characterized in that, A multi-task classification head is adopted. Based on the fused features, the branch is divided into two tasks. The two tasks are dynamically weighted and jointly optimized by considering the homoscedasticity uncertainty of the tasks. Specifically: Main state classification task: The main state classification task is used to predict the specific state category of network communication within the current time window. It obtains the probability distribution of each category by inputting the fused feature vector into a linear classification layer and applying the Softmax function. State confidence regression task: The state confidence regression task is used to estimate the confidence level of the model in the current classification result. It is obtained by concatenating the fused feature vector and the local standard deviation vector and inputting them into a linear layer, and then passing them through the Sigmoid function to obtain a confidence score between 0 and 1.