A method, system, and application for clustering atmospheric particulate matter time series based on temporal multi-representation fusion.
By combining piecewise linear representation and piecewise aggregation approximation with K-nearest neighbor importance, an improved clustering method for atmospheric particulate matter time series data is proposed. This method solves the problems of clustering accuracy and efficiency for high-dimensional atmospheric particulate matter data, and achieves efficient clustering and anomaly detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANDONG UNIV
- Filing Date
- 2024-01-09
- Publication Date
- 2026-06-30
AI Technical Summary
Existing time series clustering methods suffer from unsatisfactory clustering accuracy and low efficiency when dealing with high-dimensional, massive atmospheric particulate matter data, especially when dealing with regions with variable density distribution or uneven density, making it difficult to accurately identify cluster centers.
We employ a piecewise linear representation (PLR) and piecewise aggregate approximation (PAA) strategy, combining trend inflection points and data mean, to extract time series features from multi-representation fusion. We also improve the regional density calculation method by calculating the importance of K-nearest neighbors to identify cluster centers.
By reducing data dimensionality while retaining key time-series information, clustering efficiency is improved, and cluster centers are identified through regional density peaks to enhance clustering accuracy and effectively detect atmospheric anomalies.
Smart Images

Figure CN118051794B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of time series data mining technology, specifically relating to a clustering method, system, and application for atmospheric particulate matter time series sensor data. Background Technology
[0002] With the arrival of the information age of "comprehensive perception and the Internet of Everything," data, as the carrier of information, has also grown rapidly, exhibiting characteristics such as "massive volume," "high dimensionality," and "heterogeneity." Time series data (referred to as time series) is a type of large data with temporal correlation, capable of accurately recording the development and changes of things over time, and is widely found in various fields of human social life, including industrial manufacturing, smart healthcare, e-commerce, and finance. Time series sensor data of atmospheric particulate matter (PM2.5, PM10, etc.) collected by sensors is referred to as atmospheric particulate matter time series. Time series data mining research is conducted on the above data to discover the objective laws and potential knowledge contained within it.
[0003] Time series clustering is an important research problem in the field of time series data mining. This study treats time series as a series of high-dimensional vectors. Based on the representation of temporal correlation, it uses similarity measurement as a means and unsupervised learning to cluster time series into different clusters according to their similarity. Simultaneously, it achieves maximum similarity of vectors within each cluster and minimum similarity of vectors between clusters, thereby effectively identifying temporal patterns and anomalous trends contained in the time series. Therefore, time series clustering research can not only reveal periodic data patterns in atmospheric particulate matter time series but also effectively detect anomalies in sensor data, possessing significant application value.
[0004] While time series clustering research has yielded some results, it still faces the following challenges:
[0005] (1) How to improve clustering accuracy. Existing time series clustering methods are mainly based on partitioning and density. Partition-based clustering methods are more sensitive to noisy data and may not achieve satisfactory clustering results. Density-based clustering methods can effectively discover clusters of arbitrary shapes by utilizing the density connectivity of clusters, but they still have the problem of not being able to accurately identify cluster centers for data with variable density distribution or uneven density distribution, resulting in unsatisfactory clustering accuracy.
[0006] (2) How to ensure clustering efficiency. Because time series data often have massive and high-dimensional characteristics, it poses a challenge to the rapid and effective implementation of traditional time series clustering. Therefore, how to perform reasonable time series representation so that the represented data features can not only retain the important temporal information in the original data, but also reduce the data dimensionality, thereby improving the efficiency of time series clustering, remains a great challenge. Summary of the Invention
[0007] To address existing problems, this invention proposes a Temporal Multi-representation Fusion based Clustering (TMFC) method for atmospheric particulate matter, which improves traditional time series representation methods and regional density calculation methods, thereby improving clustering accuracy while ensuring clustering efficiency.
[0008] This invention, based on the Piecewise Linear Representation (PLR) and Piecewise Aggregate Approximation (PAA) strategies, combines key trend inflection points and data means in time series to extract major temporal features, achieving multi-representation fusion for time series representation. It reduces data dimensionality and improves clustering efficiency while preserving as much of the key temporal information as possible from the original atmospheric particulate matter sensor data. Furthermore, based on the importance of K-nearest neighbors for calculating regional density in time series, this invention improves the regional density calculation method, enabling more accurate identification of cluster centers and achieving higher-precision clustering for time series with variable or uneven density distributions.
[0009] Terminology Explanation:
[0010] 1. Piecewise Linear Representation (PLR): PLR is a time series representation strategy that uses a linear model to segment and represent time series, while reducing the data dimensionality. It is a relatively intuitive way of representing data features and is widely used in time series data mining.
[0011] 2. Piecewise Aggregate Approximation (PAA): PAA is a time series representation strategy that divides the time series into segments by average, and uses the mean of the segmented sequences to represent the original time series, while simultaneously reducing the data dimensionality.
[0012] 3. Symbolic Aggregate Approximation (SAX): SAX is a time series representation strategy. It first divides the time series into segments, then converts the mean of each segment into a character representation, ultimately resulting in a string sequence, thus reducing the data dimensionality. Because strings have a specific data storage structure and relatively mature algorithms, they can solve some real-world problems that are difficult to represent with actual data.
[0013] To achieve the above objectives, the technical solution adopted by the present invention is as follows:
[0014] The first objective of this invention is to provide a time series clustering method for atmospheric particulate matter based on the fusion of time-series multi-characteristics.
[0015] A time-series clustering method for atmospheric particulate matter based on temporal multi-representation fusion includes:
[0016] Real-time monitoring of atmospheric particulate matter concentration data, including collection time, collection location, monitoring value, and pollution type, forms time series data;
[0017] Based on the piecewise linear representation (PLR) and piecewise aggregate approximation (PAA) strategies, corresponding temporal features are extracted from the given time series data, thereby realizing the multi-representation fusion of atmospheric particulate matter time series feature representation;
[0018] Based on the importance of K-nearest neighbors, the regional density of time series is calculated, thereby realizing the time series clustering of atmospheric particulate matter based on regional density.
[0019] As a further preferred option, a method for representing the time series features of atmospheric particulate matter through multi-characteristic fusion includes:
[0020] Piecewise linear representation of time series: Identify all trend turning points (TP) in each time series, i.e., data points in the time series that reflect the trend of data change; sort the trend turning points in descending order according to their importance weight (TrendTurning Point Importance Index, ζ) to determine the time series T. i The linear segmentation points in the time series are used to obtain a piecewise linear representation of the time series.
[0021] Time series segmented aggregation approximation: using the data mean of the time series to approximate the time series T. i The segmented aggregation approximation is used as another data feature of time series in the fusion representation method;
[0022] By combining the data obtained from the linear representation of time series segments with the data features obtained from the approximate aggregation of time series segments, a multi-representation fusion time series feature representation is completed.
[0023] Further preferred time series segmented aggregation approximations include:
[0024] For a given time series, divide it into several equal segments and calculate the average of each segment; the set of averages for each segment is taken as the time series T. i The segmented aggregation approximation is given by: 1≤i≤N, 1≤x≤X.
[0025] As a further preferred option, a method for representing the time series features of atmospheric particulate matter through multi-characteristic fusion specifically includes:
[0026] Identify time series T i Trend inflection point TP; for time series T i ={v 1,i , ..., v j,i …, v M,i}, 1≤i≤N, 1≤j≤M; N represents the total number of time series, and M represents the length of each time series; assume the time series T i Current time series point v j,i Compared with its two preceding and following time points (v j-1,i v j+1,i If the value of v satisfies any one of the inequalities in formula (I), then v j,i The trend inflection point TP was identified, and the time series T was obtained. i The set of trend turning points TP = {TP1, ..., TP} r …, TP R}, 1 < R < M, 1 ≤ r ≤ R, the calculation formula is shown in equation (I):
[0027]
[0028] Calculate the importance weight of the trend inflection point TP; for time series T i The r-th trend turning point TP r Importance weights are defined as ζ r (1≤r≤R), the calculation formula is shown in equation (II):
[0029]
[0030] in, Represents the time series T iThe average value of all time series points is used; trend turning points are sorted in descending order of importance weight to determine the optimal number of segments C, and (C-1) trend turning points are determined, i.e., TP = {TP1, ..., TP2}. c …, TP C-1}, 1≤c≤C-1, 1≤C-1≤R, are used as time series segmentation points;
[0031] Calculate the slope of each segment of the time series after segmentation; divide the time series T... i The slopes of the lines connecting the selected segment points end to end are calculated. The slope of the c-th segment is defined as k. c The calculation formula is shown in equation (III):
[0032]
[0033] in, This represents the value of the c-th inflection point; This represents the value of the (c+1)th inflection point; This indicates the index of the c-th turning point; Indicates the index of the (c+1)th inflection point; combines the slopes of all segments of the time series Ti as the time series T in the fusion representation method. i A data feature, denoted as T i-plr ={k1, ..., k c …, k C}, 1≤C-1≤R, complete the calculation for time series T i Piecewise linear representation.
[0034] As a further preferred option, a time-series clustering method for regional density peak atmospheric particulate matter based on K-nearest neighbor importance includes:
[0035] Calculate time series T i The importance of K-nearest neighbors to the density of the region; search time series T i The K nearest neighbors are used to obtain the time series T. i KNN distance (KDist) i ), that is, time series T i The mean distance to its K nearest neighbor time series; by giving the time series T i The K nearest neighbors are assigned different weights, thus deriving the K nearest neighbor relationship to T. i The importance of regional density;
[0036] Calculate the regional density of the K-nearest neighbor time series against the time series T i The weighting of regional density;
[0037] Find the time series T iThe average weight of the K-nearest neighbor time series on its regional density is used as the final T. i K-nearest neighbor time series pairs T i The weighting of regional density;
[0038] Define a method for calculating regional density based on the importance of K-nearest neighbors, and use the time series T i The regional density is defined as ρ i ;
[0039] After determining the cluster centers, the remaining non-center sequences are assigned. The assignment strategy is to assign each non-center sequence to the cluster containing the time series that is closest to it and has a higher regional density.
[0040] Further preferred, calculate the time series T i The importance of K-nearest neighbors to the density of the region includes:
[0041] Calculate the KNN distance for each time series: calculate the distance d between any two time series. pq , (1≤p≤N, 1≤q≤N), as shown in equation (Ⅳ); distance time series T i The most recent K sequences are defined as time series T. i K-nearest neighbor time series T κ,i 1≤κ≤K, thus yielding KDist i The calculation formula is shown in equation (V):
[0042]
[0043]
[0044] Where K represents the number of K nearest neighbors; d pq For time series T p With T q The Euclidean distance between them; v x,p Represents the time series T p Data points in; v x,q Represents the time series T q Data points in; d κi Represents the time series T i K-nearest neighbor time series T κ,i With T i Euclidean distance;
[0045] Calculate the KNN density for each time series: Time series T i The KNN density is defined as The calculation formula is shown in equation (VI):
[0046]
[0047] From the formula, we can obtain the time series T i The KNN density decreases as the KNN distance increases;
[0048] Evaluation of K-nearest neighbor time series pairs with time series T i Importance of regional density: Time series T i K-nearest neighbor time series T κ,i Its importance to the density of the region is defined as δ κ,i The calculation formula is shown in equation (VII):
[0049]
[0050] Where, d κi For time series T i K-nearest neighbor time series T κ,i With T i Euclidean distance; For K-nearest neighbor time series T κ,i KNN density.
[0051] Further preferred, the regional density of the K-nearest neighbor time series is calculated in relation to the time series T. i The weighting of regional density includes:
[0052] K-nearest neighbor time series T κ,i Regional density versus time series T i The effect weight of region density is defined as ω κ,i The calculation formula is shown in equation (VIII):
[0053]
[0054] Where, d κi For time series T i K-nearest neighbor time series T κ,i With T i Euclidean distance; For K-nearest neighbor time series T κ,i KNN density.
[0055] Further optimization involves obtaining the time series T. i The average weight of the K-nearest neighbor time series on its regional density is used as the final T. i K-nearest neighbor time series pairs T i The effect weight of regional density The calculation formula is shown in equation (IX):
[0056]
[0057] Where K represents the number of K nearest neighbors; dκi For time series T i K-nearest neighbor time series T κ,i With T i Euclidean distance; d max For K-nearest neighbor time series and time series T i The maximum distance; For K-nearest neighbor time series T κ,i KNN density.
[0058] A further preferred approach is to define a region density calculation method based on the importance of K-nearest neighbors, which calculates the time series T. i The regional density is defined as ρ i The calculation formula is shown in equation (X):
[0059]
[0060] Where K represents the number of K nearest neighbors; d κi For time series T i K-nearest neighbor time series T κ,i With T i Euclidean distance; d max For K-nearest neighbor time series and time series T i The maximum distance; For K-nearest neighbor time series T κ,i KNN density; For time series T i KNN density;
[0061] Using the above formula, the regional density of all time series is calculated and sorted in descending order according to the regional density value; then, based on the required number of clusters K, the top K points are selected as the final cluster centers.
[0062] A computer device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to implement the steps of a time series clustering method for atmospheric particulate matter based on time-series multi-representation fusion.
[0063] A computer-readable storage medium having a computer program stored thereon, the computer program being executed by a processor to implement the steps of a time series clustering method for atmospheric particulate matter based on time-series multi-representation fusion.
[0064] The second objective of this invention is to provide a time series clustering system for atmospheric particulate matter based on time-series multi-characteristic fusion.
[0065] A time-series clustering system for atmospheric particulate matter based on temporal multi-representation fusion includes:
[0066] The time series characterization module is configured to extract corresponding time series features from the given time series data based on the piecewise linear characterization (PLR) and piecewise aggregate approximation (PAA) strategies, thereby realizing the multi-characterization fusion of atmospheric particulate matter time series feature representation;
[0067] The time series clustering module is configured to: calculate the density of each time series region based on the importance of K-nearest neighbors, identify cluster centers based on the peak density of the regions, and then complete the allocation of non-center time series according to the relationship between the remaining time series and its nearest cluster centers.
[0068] A third objective of this invention is to provide an application of the aforementioned atmospheric particulate matter time series clustering method based on temporal multi-representation fusion, including: effective detection of atmospheric anomaly data; including:
[0069] Given an atmospheric particulate matter time series dataset D = {T1, ..., T...} l ,…,T L}, 1≤l≤L;
[0070] A time series clustering method for atmospheric particulate matter based on temporal multi-representation fusion was used to complete the data clustering.
[0071] Assume the dataset after clustering is partitioned as follows: D = {K1, ..., K} a K out1 , ..., K outm}, where K i Clusters consisting of normal time series, 1≤i≤a; K outj The cluster represents the class containing anomalous time series data, 1≤j≤m; according to the clustering results, normal time series data are all clustered in the same class, while anomalous time series data are also clustered, thus achieving effective identification of anomalous time series and ultimately detecting atmospheric anomaly data.
[0072] Compared with the prior art, the beneficial effects of the present invention are as follows:
[0073] 1. This invention is based on the Piecewise Linear Representation (PLR) and Piecewise Aggregate Approximation (PAA) strategies, combined with important trend turning points in time series, to construct a multi-representation fusion method for atmospheric particulate matter time series representation. This method can effectively retain the key feature information of the original time series data, thereby reducing data dimensionality and ensuring clustering efficiency.
[0074] 2. This invention evaluates the importance of the K nearest neighbors of a time series to its regional density based on KNN distance and KNN density, thereby calculating the regional density of each time series. Cluster centers are identified based on the regional density peaks, and the remaining time series are assigned according to the relationship between them and their nearest cluster centers, thus achieving time series clustering. Density peak clustering based on regional density can more effectively reflect the regional distribution characteristics of time series and, to a certain extent, considers their distribution across the overall time series, resulting in better clustering accuracy. Attached Figure Description
[0075] Figure 1 This is a flowchart illustrating the atmospheric particulate matter time series clustering method based on temporal multi-representation fusion proposed in this invention. Detailed Implementation
[0076] The present invention will be further described below with reference to the accompanying drawings and embodiments.
[0077] Example 1
[0078] A time-series clustering method for atmospheric particulate matter based on temporal multi-representation fusion, such as Figure 1 As shown, it includes:
[0079] Real-time monitoring of atmospheric particulate matter concentration data is conducted, including collection time, collection location, monitoring value, and pollution type, forming time-series data. Specifically, dust concentration sensors are used to monitor the concentrations of fine particulate matter (PM2.5) and inhalable particulate matter (PM10) in a certain province in real time, sampling data once per minute to form time-series data. Taking PM2.5 data as an example, the collected content includes collection time, collection location, PM2.5 monitoring value, and pollution type, forming a meteorological PM2.5 dataset for that province. The PM10 data collection content is similar, forming a meteorological PM10 dataset for that province.
[0080] Based on the piecewise linear representation (PLR) and piecewise aggregate approximation (PAA) strategies, corresponding temporal features are extracted from the given time series data, thereby realizing the multi-representation fusion of atmospheric particulate matter time series feature representation;
[0081] Based on the importance of K-nearest neighbors, the regional density of time series is calculated, thereby realizing the time series clustering of atmospheric particulate matter based on regional density.
[0082] Example 2
[0083] The difference between the atmospheric particulate matter time series clustering method based on temporal multi-representation fusion described in Example 1 and the method described in Example 1 is as follows:
[0084] Methods for representing time-series features of atmospheric particulate matter by achieving multi-representation fusion include:
[0085] Piecewise linear representation of time series: Without loss of generality, assume that the i-th time series T i ={vt 1,i , ..., vt j,i …, vt M,i}, where vt j,i T represents i The j-th time point in the series. Considering T i All data points satisfy the time-series accumulation property in the time domain, therefore T is omitted. i The timing symbols in, and T i Simplified as T i ={v 1,i , ..., v j,i …, v M,i}, (1≤i≤N, 1≤j≤M). To achieve piecewise linear representation (PLR), identify each time series T. i All trend turning points (TPs) in the time series are data points that reflect the trend of data change. The trend turning points are then sorted in descending order according to their importance weight (Trend Turning Point Importance Index, ζ) to determine the time series T. i The linear segmentation points in the time series are used to obtain the piecewise linear representation T of the time series. i-plr ;
[0086] Time series segmented aggregation approximation: using time series T i Data mean Implement time series T i The segmented aggregation approximation is used as the time series T in the fusion representation method. i Another data characteristic;
[0087] This involves combining the data obtained from piecewise linear representation of the time series with the data features obtained from piecewise aggregation approximation of the time series, thus completing a multi-representation fusion time series feature representation. The data features obtained from piecewise linear representation of the time series are concatenated with the data features obtained from piecewise aggregation approximation of the time series; for a given time series T... i , represented as T i ={T i-plr T i-paa}, to complete the time series feature representation of multi-representation fusion.
[0088] Time series segmented aggregation approximation includes:
[0089] To achieve a piecewise aggregation approximation, for a given time series T iDivide the sequence into several equal segments (X), each segment having a length of L; calculate the average value of each segment; for example: calculate the average value for the x-th segment of the time series. Where v (x-1)L+1,i Let v be the starting data point of the x-th time series segment. xL,i Let $x$ be the last data point of the $x$-th time series segment. Let $V$ be the set of averages for each segment. 1,i , ..., V x,i …, V X,i} as a time series T i The segmented aggregation approximation of T i - paa 1≤i≤N, 1≤x≤X.
[0090] A method for representing the time series features of atmospheric particulate matter by achieving multi-representation fusion, specifically including:
[0091] Identify time series T i Trend inflection point TP; for time series T i ={v 1,i , ..., v j,i …, v M,i}, (1≤i≤N, 1≤j≤M), without loss of generality, N represents the total number of time series, and M represents the length of each time series; assume the time series T i Current time series point v j,i Compared with its two preceding and following time points (v j-1i v j+1,i If the value of v satisfies any one of the inequalities in formula (I), then v j,i The trend inflection point TP was identified, and the time series T was obtained. i The set of trend turning points TP = {TP1, ..., TP} r …, TP R}, 1 < R < M, 1 ≤ r ≤ R, the calculation formula is shown in equation (I):
[0092]
[0093] Calculate the importance weight of the trend inflection point TP; for time series T i The r-th trend turning point TP r Importance weights are defined as ζ r (1≤r≤R), the calculation formula is shown in equation (II):
[0094]
[0095] in, Represents the time series T iThe average value of all time series points is used. Trend inflection points are sorted in descending order of importance weight, with those having higher weights having higher priority. For the dataset, considering data characteristics, task requirements, and computational efficiency, the optimal number of segments C is determined. Since the beginning and end of the time series data points will serve as inflection points, (C-1) more trend inflection points need to be determined, i.e., TP = {TP1, ..., TP2}. c …, TP C-1}, 1≤c≤C-1, 1≤C-1≤R, are used as time series segmentation points;
[0096] Calculate the slope of each segment of the time series after segmentation; divide the time series T... i The slopes of the lines connecting the selected segment points end to end are calculated. The slope of the c-th segment is defined as k. c The calculation formula is shown in equation (III):
[0097]
[0098] in, This represents the value of the c-th inflection point; This represents the value of the (c+1)th inflection point; This indicates the index of the c-th turning point; Indicate the index of the (c+1)th turning point; Transform the time series T i The slopes of all segments are combined as the time series T in the fusion representation method. i A data feature, denoted as T i-plr ={k1, ..., k c …, k C}, 1≤C-1≤R, complete the calculation for time series T i Piecewise linear representation.
[0099] Time series clustering methods for regional density peak atmospheric particulate matter based on K-nearest neighbor importance include:
[0100] Calculate time series T i The importance of K-nearest neighbors to the density of the region; search time series T i The K nearest neighbors are used to obtain the time series T. i KNN distance (KDist) i ), that is, time series T i The mean distance to its K nearest neighbor time series; by giving the time series T i The K nearest neighbors are assigned different weights, thus deriving the K nearest neighbor relationship to T. i The importance of regional density;
[0101] Calculate the regional density of the K-nearest neighbor time series against the time series Ti The weighting of regional density;
[0102] Based on the above steps, the time series T is obtained. i The average weight of the K-nearest neighbor time series on its regional density is used as the final T. i K-nearest neighbor time series pairs T i The weighting of regional density;
[0103] Based on the above concepts and formulas, a method for calculating regional density based on K-nearest neighbor importance is defined, which calculates the time series T. i The regional density is defined as ρ i ;
[0104] After determining the cluster centers, the remaining non-center sequences are assigned. The assignment strategy is to assign each non-center sequence to the cluster containing the time series that is closest to it and has a higher regional density.
[0105] A complete regional density peak clustering method based on K-nearest neighbor importance is defined, and this method is defined as NNI-DPC (Near Neighbour Importance-DPC).
[0106] Calculate time series T i The importance of K-nearest neighbors to the density of the region includes:
[0107] Calculate the KNN distance for each time series: calculate the distance d between any two time series. pq , (1≤p≤N, 1≤q≤N), as shown in equation (Ⅳ); distance time series T i The most recent K sequences are defined as time series T. i K-nearest neighbor time series T κ,i 1≤κ≤K, thus yielding KDist i The calculation formula is shown in equation (V):
[0108]
[0109]
[0110] Where K represents the number of K nearest neighbors; d pq For time series T p With T q The Euclidean distance between them; v x,p Represents the time series T p Data points in; v x,q Represents the time series T q Data points in; d κi Represents the time series T i K-nearest neighbor time series Tκ,i With T i Euclidean distance;
[0111] Calculate the KNN density for each time series: Time series T i The KNN density is defined as The calculation formula is shown in equation (VI):
[0112]
[0113] From the formula, we can obtain the time series T i The KNN density decreases as the KNN distance increases;
[0114] Evaluation of K-nearest neighbor time series pairs with time series T i Importance of regional density: Time series T i K-nearest neighbor time series T κ,i Its importance to the density of the region is defined as δ κ,i The calculation formula is shown in equation (VII):
[0115]
[0116] Where, d κi For time series T i K-nearest neighbor time series T κ,i With T i Euclidean distance; For K-nearest neighbor time series T κ,i The KNN density. From the formula, we can obtain the nearest neighbor time series T. κ,i For time series T i The importance of regional density increases with T κ,i The increase in KNN density is accompanied by T i With T κ,i The effect weakens as the distance between them increases.
[0117] Calculate the regional density of the K-nearest neighbor time series against the time series T i The weighting of regional density includes:
[0118] K-nearest neighbor time series T κ,i Regional density versus time series T i The effect weight of region density is defined as ω κ,i The calculation formula is shown in equation (VIII):
[0119]
[0120] Where, d κi For time series T i K-nearest neighbor time series T κ,iWith T i Euclidean distance; For K-nearest neighbor time series T κ,i The KNN density. From the formula, we can obtain the nearest neighbor time series T. κ,i Regional density versus time series T i The effect weight of regional density and T κ,i For T i The importance of regional density is positively correlated.
[0121] Find the time series T i The average weight of the K-nearest neighbor time series on its regional density is used as the final T. i K-nearest neighbor time series pairs T i The effect weight of regional density The calculation formula is shown in equation (IX):
[0122]
[0123] Where K represents the number of K nearest neighbors; d κi For time series T i K-nearest neighbor time series T κ,i With T i Euclidean distance; d max For K-nearest neighbor time series and time series T i The maximum distance; For K-nearest neighbor time series T κ,i KNN density.
[0124] Considering that the limited number of decimal places may lead to a loss of precision and thus affect the clustering accuracy, the calculation formula is processed accordingly to avoid the situation where the denominator approaches 0 and the result is infinite due to the value being too small.
[0125] Define a method for calculating regional density based on the importance of K-nearest neighbors, and use the time series T i The regional density is defined as ρ i The calculation formula is shown in equation (X):
[0126]
[0127] Where K represents the number of K nearest neighbors; d κi For time series T i K-nearest neighbor time series T κ,i With T i Euclidean distance; d max For K-nearest neighbor time series and time series T i The maximum distance; For K-nearest neighbor time series T κ,i KNN density; For time series T i KNN density;
[0128] Using the above formula, the regional density of all time series is calculated and sorted in descending order according to the regional density value; then, based on the required number of clusters K, the top K points are selected as the final cluster centers.
[0129] Table 1 is an introduction to the comparative models of the present invention; Table 2 is a comparison table of the clustering accuracy of the present invention; Table 3 is a comparison table of the clustering time of the present invention.
[0130] Table 1
[0131]
[0132]
[0133] Table 2
[0134]
[0135]
[0136] Table 3
[0137]
[0138]
[0139] As shown in Tables 2 and 3, comparisons were made with six clustering models in terms of both clustering accuracy and clustering time. Table 2 shows that, through comparative experiments on the Shandong Provincial Meteorological PM2.5 dataset (SDQXPM25), the Shandong Provincial Meteorological PM10 dataset (SDQXPM10), and 17 time-series benchmark datasets, the present invention demonstrates higher clustering efficiency compared to the baseline method. Table 3 shows that, through comparative experiments on the Shandong Provincial Meteorological PM2.5 dataset (SDQXPM25), the Shandong Provincial Meteorological PM10 dataset (SDQXPM10), and 17 time-series benchmark datasets, the present invention exhibits a significant advantage in clustering accuracy compared to the baseline method. These results demonstrate that the time-series clustering method proposed in this invention, based on multi-representation fusion, achieves high clustering accuracy and significant superiority in most datasets. Furthermore, by effectively representing the features of the time series, the clustering time of this method is significantly better than the comparative models on most datasets, ensuring efficient clustering.
[0140] Example 3
[0141] A computer device includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the steps of the atmospheric particulate matter time series clustering method based on time series multi-representation fusion as described in Embodiment 1 or 2.
[0142] Example 4
[0143] A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the atmospheric particulate matter time series clustering method based on temporal multi-representation fusion as described in Embodiment 1 or 2.
[0144] Example 5
[0145] A time-series clustering system for atmospheric particulate matter based on temporal multi-representation fusion includes:
[0146] The time series characterization module is configured to extract corresponding time series features from the given time series data based on the piecewise linear characterization (PLR) and piecewise aggregate approximation (PAA) strategies, thereby realizing the multi-characterization fusion of atmospheric particulate matter time series feature representation;
[0147] The time series clustering module is configured to: calculate the density of each time series region based on the importance of K-nearest neighbors, identify cluster centers based on the peak density of the regions, and then complete the allocation of non-center time series according to the relationship between the remaining time series and its nearest cluster centers.
[0148] Example 6
[0149] The application of the atmospheric particulate matter time series clustering method based on temporal multi-representation fusion described in Example 1 or 2 includes: effective detection of atmospheric anomaly data; including:
[0150] Given an atmospheric particulate matter time series dataset D = {T1, ..., T...} l ,…,T L}, 1≤l≤L;
[0151] A time series clustering method for atmospheric particulate matter based on temporal multi-representation fusion was used to complete the data clustering.
[0152] Without loss of generality, assume that the dataset after clustering is partitioned as follows: D = {K1, ..., K} a K out1 , ..., K outm}, where K i Clusters consisting of normal time series, 1≤i≤a; K outjThe cluster represents the cluster containing anomalous time series data, 1≤j≤m. According to the clustering results, it is not difficult to find that normal time series data are all clustered in the same cluster, while anomalous time series data are also clustered. That is, the clusters formed by anomalous time series and normal time series have obvious distance differences in similarity measurement, thereby realizing the effective identification of anomalous time series and ultimately detecting atmospheric anomalous data.
Claims
1. A time series clustering method for atmospheric particulate matter based on time-series multi-representation fusion, characterized in that, include: Real-time monitoring of atmospheric particulate matter concentration data, including collection time, collection location, monitoring value, and pollution type, forms time series data; Based on piecewise linear representation and piecewise aggregation approximation strategy, corresponding time series features are extracted from the given time series data, thereby realizing the time series feature representation of atmospheric particulate matter through multi-representation fusion; Based on the importance of K-nearest neighbors, the regional density of time series is calculated, thereby realizing the clustering of atmospheric particulate matter time series based on regional density; Methods for representing time-series features of atmospheric particulate matter by achieving multi-representation fusion include: Piecewise linear representation of time series: Identify all trend inflection points in each time series, i.e., data points in the time series that reflect the trend of data change; sort the trend inflection points in descending order according to their importance weight to determine the linear segmentation points in each time series, and thus obtain the piecewise linear representation of the time series. Time series piecewise aggregate approximation: using data mean of time series to realize piecewise aggregate approximation representation of time series as another data feature of time series in fusion representation method; By combining the data obtained from the linear representation of time series segments with the data features obtained from the approximate aggregation of time series segments, a time series feature representation with multi-representation fusion is completed; Time series segmented aggregation approximation includes: For a given time series, divide it into several equal segments and calculate the average value of each segment; use the set of average values of each segment as a segmented aggregation approximation of the time series. Time series clustering methods for regional density peak atmospheric particulate matter based on K-nearest neighbor importance include: Calculate time series The importance of K-nearest neighbors to the density of their region; search time series The K nearest neighbors were used to derive the time series. The KNN distance, i.e., the time series The mean distance to its K nearest neighbor time series; by giving the time series The K nearest neighbors are assigned different weights, thus deriving the K nearest neighbor pair. The importance of regional density; Calculate the regional density of the K-nearest neighbor time series against the time series The weighting of regional density; Find the time series The average weight of the K-nearest neighbor time series on its regional density is used as the final K-nearest neighbor time series pairs The weighting of regional density; Define a method for calculating regional density based on the importance of K-nearest neighbors, and apply this method to time series data. The region density is defined as ; After determining the cluster centers, the remaining non-center sequences are assigned. The assignment strategy is to assign each non-center sequence to the cluster containing the time series that is closest to it and has a higher regional density.
2. The atmospheric particulate matter time series clustering method based on temporal multi-representation fusion according to claim 1, characterized in that, A method for representing the time series features of atmospheric particulate matter by achieving multi-representation fusion, specifically including: Identify time series Trend turning points (TP) in time series; Represents the total number of time series data points. Represents the length of each time series; assuming the time series... Current time series point Compared with the two time points before and after ( If the value of satisfies any one of the inequalities in formula (Ⅰ), then The trend inflection point TP was identified, and the time series was obtained. Set of trend turning points The calculation formula is shown in equation (Ⅰ): (Ⅰ) Calculate the importance weight of trend inflection points (TP); for time series... The Middle Trend turning point Importance weight is defined as The calculation formula is shown in equation (II): (Ⅱ) in, Representing time series The mean of all time series points is used; trend turning points are sorted in descending order of importance weight to determine the optimal number of segments. ,Sure A trend turning point, namely , as time series segmentation points; Calculate the slope of each segment of the time series after segmentation; divide the time series... The slopes of the lines connecting the selected segment points end to end are calculated. The slope of the c-th segment is defined as follows: The calculation formula is shown in equation (Ⅲ): (Ⅲ) in, This represents the value of the c-th inflection point; This represents the value of the (c+1)th inflection point; This indicates the index of the c-th turning point; Indicate the index of the (c+1)th turning point; [This refers to the time series...] The slopes of all segments are combined as the time series in the fusion representation method. A data feature, represented as Complete the time series The piecewise linear representation.
3. The atmospheric particulate matter time series clustering method based on temporal multi-representation fusion according to claim 1, characterized in that, Calculate time series The importance of K-nearest neighbors to the density of the region includes: Calculate the KNN distance for each time series: calculate the distance between any two time series. , As shown in equation (Ⅳ); distance time series The K most recent sequences are defined as time series. K-nearest neighbor time series , Thus, we can conclude The calculation formula is shown in equation (V): (Ⅳ) (Ⅴ) in, Indicates the number of K nearest neighbors; Time series and The Euclidean distance between them; Representing time series Data points in; Representing time series Data points in; Representing time series K-nearest neighbor time series and Euclidean distance; Calculate the KNN density for each time series: Time series The KNN density is defined as The calculation formula is shown in equation (VI): (Ⅵ) Evaluation of K-nearest neighbor time series pairs The Importance of Regional Density: Time Series K-nearest neighbor time series Its importance to regional density is defined as The calculation formula is shown in equation (VII): (Ⅶ) in, Time series K-nearest neighbor time series and Euclidean distance; K-nearest neighbor time series KNN density; Calculate the regional density of the K-nearest neighbor time series against the time series The weighting of regional density includes: K-nearest neighbor time series Regional density versus time series The effect weight of regional density is defined as The calculation formula is shown in equation (VIII): (Ⅷ) in, Time series K-nearest neighbor time series and Euclidean distance; K-nearest neighbor time series KNN density.
4. The atmospheric particulate matter time series clustering method based on temporal multi-representation fusion according to claim 1, characterized in that, Find the time series The average weight of the K-nearest neighbor time series on its regional density is used as the final K-nearest neighbor time series pairs The effect weight of regional density The calculation formula is shown in equation (IX): (Ⅸ) in, Indicates the number of K nearest neighbors; Time series K-nearest neighbor time series and Euclidean distance; K-nearest neighbor time series and time series The maximum distance; K-nearest neighbor time series KNN density; Define a method for calculating regional density based on the importance of K-nearest neighbors, and apply this method to time series data. The region density is defined as The calculation formula is shown in equation (X): (Ⅹ) in, Indicates the number of K nearest neighbors; Time series K-nearest neighbor time series and Euclidean distance; K-nearest neighbor time series and time series The maximum distance; K-nearest neighbor time series KNN density; Time series KNN density; Using the above formula, calculate the regional density of all time series and sort them in descending order according to the regional density values; then, according to the required... The number of clusters is K, and the top K points are selected as the final cluster centers.
5. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the atmospheric particulate matter time series clustering method based on time series multi-representation fusion as described in any one of claims 1-4.
6. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps of the atmospheric particulate matter time series clustering method based on time series multi-representation fusion as described in any one of claims 1-4.
7. A time-series clustering system for atmospheric particulate matter based on temporal multi-representation fusion, characterized in that, include: The time series characterization module is configured to extract corresponding time series features from the given time series data based on piecewise linear characterization and piecewise aggregation approximation strategies, thereby realizing the time series feature representation of atmospheric particulate matter through multi-characterization fusion. The time series clustering module is configured to: calculate the density of each time series region based on the importance of K-nearest neighbors, identify cluster centers based on the peak density of the regions, and then complete the allocation of non-center time series according to the relationship between the remaining time series and its nearest cluster centers; Methods for representing time-series features of atmospheric particulate matter by achieving multi-representation fusion include: Piecewise linear representation of time series: Identify all trend inflection points in each time series, i.e., data points in the time series that reflect the trend of data change; sort the trend inflection points in descending order according to their importance weight to determine the linear segmentation points in each time series, and thus obtain the piecewise linear representation of the time series. Time series segmented aggregation approximation: using the data mean of a time series to achieve the aggregation of the time series. The segmented aggregation approximation is used as another data feature of time series in the fusion representation method; By combining the data obtained from the linear representation of time series segments with the data features obtained from the approximate aggregation of time series segments, a time series feature representation with multi-representation fusion is completed; Time series segmented aggregation approximation includes: For a given time series, divide it into several equal segments and calculate the average value of each segment; use the set of average values of each segment as a segmented aggregation approximation of the time series. Time series clustering methods for regional density peak atmospheric particulate matter based on K-nearest neighbor importance include: Calculate time series The importance of K-nearest neighbors to the density of their region; search time series The K nearest neighbors were used to derive the time series. The KNN distance, i.e., the time series The mean distance to its K nearest neighbor time series; by giving the time series The K nearest neighbors are assigned different weights, thus deriving the K nearest neighbor pair. The importance of regional density; Calculate the regional density of the K-nearest neighbor time series against the time series The weighting of regional density; Find the time series The average weight of the K-nearest neighbor time series on its regional density is used as the final K-nearest neighbor time series pairs The weighting of regional density; Define a method for calculating regional density based on the importance of K-nearest neighbors, and apply this method to time series data. The region density is defined as ; After determining the cluster centers, the remaining non-center sequences are assigned. The assignment strategy is to assign each non-center sequence to the cluster containing the time series that is closest to it and has a higher regional density.
8. An application of a time series clustering method for atmospheric particulate matter based on temporal multi-representation fusion, characterized in that, include: Effective detection of atmospheric anomaly data; including: Given a time series dataset of atmospheric particulate matter , ; The atmospheric particulate matter time series clustering method based on time series multi-representation fusion as described in any one of claims 1-4 is used to complete the data clustering; Assume the dataset is partitioned as follows after clustering: ,in Clusters composed of normal time series This represents a cluster containing anomalous time series data. According to the clustering results, normal time series data are all clustered in the same cluster, while abnormal time series data are also clustered, thus enabling effective identification of abnormal time series and ultimately detecting atmospheric anomalies.