A traffic flow data missing repair system evaluation and strategy design method
By constructing traffic flow data missing repair models that consider different spatiotemporal correlations, the problem of insufficient model applicability in existing technologies is solved, and a more comprehensive and accurate traffic flow data missing repair strategy is realized, which is applicable to a variety of application scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NANJING YUNJIE TECH CO LTD
- Filing Date
- 2023-02-26
- Publication Date
- 2026-06-26
AI Technical Summary
The existing methods for repairing missing traffic flow data lack performance evaluation and applicability research under different influencing factors, making it impossible to select appropriate models for different application scenarios and affecting the formulation of traffic flow data missing repair strategies.
We construct different traffic flow data missing repair models that consider spatiotemporal correlation and those that do not. By evaluating the performance of various models under different influencing factors, we formulate a general strategy for traffic flow data missing repair.
It provides a more comprehensive and accurate method for repairing missing traffic flow data, applicable to different application scenarios, and improves the repair effect and applicability of the model.
Smart Images

Figure CN116467545B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of traffic engineering, specifically to a method for evaluating and designing strategies for a traffic flow data missing repair system. Background Technology
[0002] Traffic flow data often suffers from missing data due to factors such as severe weather, complex road environments, detector malfunctions, and communication failures, leading to subsequent modeling failures or significant performance degradation. Traffic flow data missing data repair is a crucial step in traffic data analysis and modeling, and a core component of intelligent transportation systems. Current research focuses on designing methods for traffic flow data missing data repair, while studies on comprehensive model evaluation and applicability under various influencing factors such as different missing scenarios, missing proportions, missing durations, road types, and data collection intervals are relatively lacking.
[0003] Traffic flow data missing data repair is a crucial step in traffic data preprocessing and a fundamental functional component of intelligent transportation systems. Accurate traffic flow data missing data repair provides reliable data support for traffic state estimation, prediction, and congestion analysis. Traffic flow data missing data repair has consistently been a key and hot research topic in the transportation field.
[0004] Traditional methods for repairing missing traffic flow data are mostly based on numerical calculation methods such as simple interpolation, including linear interpolation, quadratic interpolation, nearest neighbor interpolation, moving average, and inverse distance weighted interpolation.
[0005] Many different types of traffic flow missing data repair models have been proposed, ranging from simple naive models to complex deep learning-based models. However, in real-world traffic engineering applications, model accuracy and complexity should be considered comprehensively. Some simple models have poor performance but are easy to understand, implement, and deploy. Some complex models have high accuracy but usually require complex modeling work. Therefore, a balance between model performance and complexity should be struck in practical applications.
[0006] Models based on spatiotemporal correlation analysis and iterative interpolation mechanisms are simple enough to implement and can achieve good repair results. However, current research focuses on traffic flow repair algorithms and model design, lacking a comprehensive evaluation and comparison of model performance under different influencing factors. This makes it difficult to select suitable models for different application scenarios and hinders the development of corresponding traffic flow loss repair strategies for different application scenarios.
[0007] This application aims to collect traffic flow data from different research road networks, construct traffic flow data missing repair models that consider spatiotemporal correlation and those that do not, and comprehensively evaluate the performance and applicability of various models under different influencing factors. Based on this, a general strategy for traffic flow data missing repair using the three types of models is formulated, providing a decision-making basis for the selection of traffic flow missing repair models and the formulation of universal traffic flow missing repair strategies. Summary of the Invention
[0008] The purpose of this invention is to provide a method for evaluating and designing strategies for traffic flow data missing repair systems, in order to solve the problems in the prior art.
[0009] To achieve the above objectives, the present invention provides the following technical solution:
[0010] An evaluation and strategy design method for a traffic flow data missing repair system:
[0011] S1: Construct a research dataset using time series data of traffic flow at different cross-sections, evaluate the different degrees of data missing due to multiple factors, including discrete and continuous data missing, and analyze various traffic flow data missing repair models under different data missing conditions.
[0012] S2: Based on different modeling methods, two types of traffic flow data missing repair models were constructed, one without considering spatiotemporal correlation and the other considering spatiotemporal correlation. The missing traffic flow data was repaired and analyzed according to different data repair methods.
[0013] S3: Evaluate the missing data repair performance of traffic flow data missing repair models that do not consider spatiotemporal correlation and those that do consider spatiotemporal correlation. Compare and analyze the model evaluation indicators of different modeling methods under discrete missing data and continuous missing data. The comparison shows that the traffic flow data missing data repair model that considers spatiotemporal correlation repairs missing data more comprehensively and accurately than the traffic flow data missing data repair model that does not consider spatiotemporal correlation repair. Moreover, the repair effect of different models under discrete missing data is better than that of models under continuous missing data.
[0014] S4: By comprehensively comparing the performance of different traffic flow missing data repair models that consider spatiotemporal correlation under discrete and continuous missing data scenarios, the error indices of different road traffic flow missing data repair models under different missing data scenarios are analyzed. Combining the advantages of different traffic flow missing data repair models, a general strategy for traffic flow data missing data repair with good adaptability is formulated.
[0015] Further settings: Step S1 also includes the following steps:
[0016] S10: Based on the traffic flow time series data of the constructed research road network, datasets with discrete and continuous missing conditions are constructed by randomly and continuously removing traffic flow time series data.
[0017] S11: Based on the dataset, model missing road traffic flow information, considering the spatiotemporal correlation of traffic flow. The Pearson product-moment correlation coefficient method is used to measure the spatiotemporal correlation of road traffic flow from both temporal and spatial dimensions, according to the formula:
[0018]
[0019] In the above formula, r (r∈[-1,1]) is the Pearson product-moment correlation coefficient between the traffic flow time series x to be repaired and the associated traffic flow time series y;
[0020] If y is a historical time series of x at the same traffic monitoring station, then r is defined as the time correlation coefficient between the two.
[0021] If y is a time series of x associated with nearby traffic monitoring stations, then r is defined as the spatial correlation coefficient;
[0022] Where, x i Let y be the i-th observation sample in the time series x. i Let be the i-th observation in dataset y, and n be the number of observation samples in time series x and y. and The average values of time series x and y, respectively;
[0023] If r = 0, it indicates that there is no linear correlation between x and y. If r > 0, it indicates that there is a positive correlation between x and y, and y increases continuously as x increases. If r < 0, it indicates that there is a negative correlation between x and y, and y decreases continuously as x increases. The larger the absolute value of r, the stronger the spatiotemporal correlation between x and y.
[0024] S12: Select the set of traffic flow time series with the strongest spatiotemporal correlation from the missing traffic flow time series, and construct a spatiotemporal correlation dataset for training the traffic flow missing repair model. Specifically, calculate the temporal correlation coefficient between the traffic flow time series of adjacent days and corresponding days of adjacent weeks of the missing traffic flow time series and the missing time series, respectively, and set the temporal correlation coefficient as r. T Calculate the spatial correlation coefficient between the traffic flow time series of upstream and downstream road segments and adjacent lanes and the missing time series, and set the spatial correlation coefficient as r. S And select r respectively T and r S The two traffic flow time series with the highest correlation coefficient, together with the missing time series, constitute the traffic flow spatiotemporal correlation dataset.
[0025] Further settings: Step S2 also includes the following steps:
[0026] S20: When constructing a missing data repair model that does not consider spatiotemporal correlation, the classic naive interpolation method and linear interpolation method are selected to repair the missing traffic flow data;
[0027] S21: When constructing a missing data repair model that considers spatiotemporal correlation, select different iterative imputation-based models based on Bayesian ridge regression, K-nearest neighbor regression, and random forest to repair missing traffic flow data.
[0028] Further settings: Step S20 specifically includes the following steps:
[0029] S20-1: Check for missing values in the traffic flow time series data of the constructed study road network in sequence, assuming the traffic flow time series data is N(N1,N2,…,N…). g ,…,N j );
[0030] S20-2: The classic naive imputation method is used to repair missing traffic flow data by using the preceding complete value of the missing value, with the missing value set to N. g (2≤g≤j), if N1 is missing, then N1 takes the nearest complete value to N1, and the previous complete value N is used. g-1 Fill in the missing value N g , until all missing values are corrected;
[0031] S20-3: A linear interpolation method is used to calculate a linear function using the two nearest complete values of each missing value. Missing values are then repaired based on the sequence of missing values, with the missing value set to N. g (1≤g≤j), find the missing value N. g (m g ,n g The two nearest complete values N x (m x ,n x ),N y (m y ,n y ), where m g For time, n g For m g The flow rate at any given time, substituted into N according to the formula. x (m x ,n x ) and N y (m y ,n y Calculate a linear function:
[0032]
[0033] m with missing values g Substituting into a linear function, for the missing values n g Perform the repair by repeating the above steps until all missing values are repaired.
[0034] Further settings: Step S21 specifically includes the following steps:
[0035] S21-1: Based on the traffic flow spatiotemporal correlation dataset, construct an i×j-dimensional traffic flow data matrix, where i is the number of rows in the matrix, representing the total number of samples on the observation day, and j is the number of columns in the matrix;
[0036] S21-2: Construct a missing data repair model based on nearest neighbor regression. The process includes constructing a traffic flow missing data state vector, constructing a traffic flow historical data state vector library, calculating the Euclidean distance between the traffic flow missing data state vector and the historical data state vector, selecting the correlation vector for traffic flow missing data repair, repairing the missing traffic flow data, and repeating the operation until all missing values of the traffic flow data are repaired.
[0037] S21-3: Construct a missing value repair model based on random forest. The process includes sorting the data matrix according to its features, sorting the columns of the data matrix from the fewest to the most missing values, with the first column having the fewest missing values and the last column having the most missing values. After sorting, fill all missing values in all columns with 0 values.
[0038] Based on the sorted data matrix, construct training and test sets, train a random forest regression model, use the trained random forest regression model to update the matrix, repeat the operation until multiple iterations, and use the last random forest prediction value as the missing data repair value.
[0039] S21-4: Construct a missing value repair model based on Bayesian ridge regression. The process includes sorting the data matrix according to its features, sorting the columns by the number of missing values from fewest to most, with the first column having the fewest missing values and the last column having the most missing values. After sorting, all missing values in all columns are filled with 0 values.
[0040] The training and test sets are divided according to the sorted data matrix. The Bayesian model is optimized and fitted and trained. The matrix is updated using the optimized Bayesian ridge regression missing data repair model. The operation is repeated until multiple iterations are performed. The last Bayesian regression prediction value is used as the missing data repair value.
[0041] S21-5: Select hyperparameters for the K-nearest neighbor regression model and the random forest regression model, including optimizing the number of neighbors in the K-nearest neighbor missing data repair model and the number of regression trees in the random forest missing data repair model.
[0042] Further settings: Step S3 also includes the following steps:
[0043] S30: Different modeling methods are set up in discrete missing cases and continuous missing cases respectively. The evaluation indicators of the missing repair model without considering spatiotemporal correlation and the missing repair model considering spatiotemporal correlation are summed according to the discrete missing ratio and the continuous missing ratio respectively. The average value is calculated to obtain the average performance of different repair models under discrete missing ratio and continuous missing ratio. The average index of traffic flow missing repair is analyzed based on the different road type datasets under discrete missing cases and continuous missing cases on weekdays and non-weekdays.
[0044] S31: Based on the average index of traffic flow missing repair in discrete missing scenarios on weekdays and non-weekdays for different road type datasets, the model evaluation error considering spatiotemporal correlation is significantly smaller than the model evaluation error not considering spatiotemporal correlation, and the performance of the repair models built on weekday and non-weekday datasets is similar.
[0045] The data has high dispersion and traffic flow fluctuates greatly. Models that consider spatiotemporal correlation are better than those that do not. The smaller the data dispersion, the smaller the change in the data before and after. Missing data repair models that do not consider spatiotemporal correlation have similar repair effects as those that do.
[0046] S32: Based on the average index of traffic flow missing repair in the case of continuous missing traffic flow on weekdays and non-weekdays for different road type datasets, the model evaluation error considering spatiotemporal correlation is significantly smaller than the model evaluation error not considering spatiotemporal correlation, and the repair models built on weekday and non-weekday datasets have similar performance.
[0047] Under the same data missing duration, the smaller the traffic flow data collection interval and the higher the collection frequency, the larger the amount of missing data, and the higher the error of the model repair. The smaller the data dispersion, the better the performance of the repair model.
[0048] S33: The above evaluation of the model considering spatiotemporal correlation and the model not considering spatiotemporal correlation shows that considering spatiotemporal correlation obtains traffic flow missing data patterns from both spatiotemporal dimensions, making the model considering spatiotemporal correlation more comprehensive and accurate in repairing missing data. At the same time, the repair effect of the model in discrete missing cases is better than that of the model in continuous missing cases.
[0049] Further steps: In step S4, the performance of different traffic flow missing data repair models considering spatiotemporal correlation is comprehensively compared under discrete and continuous missing data scenarios. The error indices of different road traffic flow missing data repair models under different missing data scenarios are analyzed. This also includes the following steps:
[0050] S40: The K-nearest neighbor regression missing data repair model, random forest missing data repair model, and Bayesian ridge regression missing data repair model are set up in discrete missing data cases and continuous missing data cases respectively. They are classified according to weekdays and non-weekdays. The error indices of different traffic flow missing data repair models are accumulated and summed, and then averaged to obtain the evaluation index reflecting the average performance of the traffic flow missing data repair models. A comprehensive comparison is then made.
[0051] S41: In the case of discrete missing data, both on weekdays and non-weekdays, the Bayesian Ridge Regression Missing Data Repair Model and the Random Forest Missing Data Repair Model outperform the K Nearest Neighbor Regression Missing Data Repair Model. In the weekday dataset, when the discrete missing data ratio is less than 10%, the Random Forest Missing Data Repair Model outperforms the Bayesian Ridge Regression Missing Data Repair Model. When the discrete missing data ratio is greater than 10%, the Bayesian Ridge Regression Missing Data Repair Model performs better. In the non-weekday dataset, the Bayesian Ridge Regression Missing Data Repair Model outperforms both the Random Forest Missing Data Repair Model and the K Nearest Neighbor Regression Missing Data Repair Model.
[0052] S42: In the case of consecutive missing values, the Bayesian Ridge Regression missing value repair model outperforms the Random Forest missing value repair model and the K-Nearest Neighbors Regression missing value repair model on both weekdays and non-weekdays. Among them, the K-Nearest Neighbors Regression missing value repair model performs the worst. In the weekday dataset, when the consecutive missing value duration is less than or equal to 3 hours, the K-Nearest Neighbors Regression missing value repair model is more convenient for big data computation and processing. When the consecutive missing value duration is greater than or equal to 6 hours, the Bayesian Ridge Regression missing value repair model outperforms the other two models.
[0053] Further setup: In step S4, combining the advantages of different traffic flow missing data repair models, a general strategy for traffic flow data missing data repair with good adaptability is formulated, which also includes the following steps:
[0054] S4-1: Obtain traffic flow data for the road network under study and determine whether there are any missing traffic flow data;
[0055] S4-2: When it is determined that there are missing traffic flow data in the road network under study, the time of the missing traffic flow data is determined to be a weekday or a non-weekday;
[0056] S4-3: When traffic flow data is missing on non-working days, a Bayesian ridge regression missing data repair model is used to repair the missing data;
[0057] S4-4: When traffic flow data is missing on weekdays, determine the circumstances of the missing data. The circumstances of missing data include discrete missing data and continuous missing data.
[0058] S4-5: When the missing data is continuous, determine whether the duration of the missing data is greater than or equal to 3 hours. If the duration of the missing data is greater than or equal to 3 hours, use the Bayesian Ridge Regression Missing Data Repair Model to repair the missing data. If the duration of the missing data is less than 3 hours, use the K-Nearest Neighbor Regression Missing Data Repair Model to repair the missing data.
[0059] S4-6: When the missing data is discrete, determine whether the missing data ratio is greater than or equal to 10%. If the missing data ratio is greater than or equal to 10%, use the Bayesian Ridge Regression Missing Data Repair Model to repair the missing data. If the missing data ratio is less than 10%, use the Random Forest Missing Data Repair Model to repair the missing data.
[0060] Compared with the prior art, the beneficial effects of the present invention are: to collect traffic flow data of different research road networks, to construct traffic flow data missing repair models that consider spatiotemporal correlation and traffic flow data missing repair models that do not consider spatiotemporal correlation, and to comprehensively evaluate the performance and applicability of various models under different influencing factors, and to formulate a general strategy for traffic flow data missing repair using the three types of models. Attached Figure Description
[0061] To make the content of this invention easier to understand, the invention will be further described in detail below with reference to specific embodiments and accompanying drawings.
[0062] Figure 1 This is a schematic diagram illustrating the steps of an evaluation and strategy design method for a traffic flow data missing repair system according to the present invention;
[0063] Figure 2 This is a detailed schematic diagram of step S1 of the traffic flow data missing repair system evaluation and strategy design method of the present invention;
[0064] Figure 3 This is a detailed schematic diagram of step S2 of the traffic flow data missing repair system evaluation and strategy design method of the present invention;
[0065] Figure 4 This is a detailed schematic diagram of step S3 of the traffic flow data missing repair system evaluation and strategy design method of the present invention;
[0066] Figure 5 This is a detailed schematic diagram of step S4 of the traffic flow data missing repair system evaluation and strategy design method of the present invention;
[0067] Figure 6 This is a detailed schematic diagram of step S4 of the traffic flow data missing repair system evaluation and strategy design method of the present invention;
[0068] Figure 7 This is a schematic diagram illustrating the specific implementation steps of the traffic flow data missing repair system evaluation and strategy design method of the present invention. Detailed Implementation
[0069] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0070] Please see Figures 1-7 In the embodiments of the present invention, such as Figure 1 As shown, an evaluation and strategy design method for a traffic flow data missing repair system is presented:
[0071] S1: Construct a research dataset using time series data of traffic flow at different cross-sections, evaluate the different degrees of data missing due to multiple factors, including discrete and continuous data missing, and analyze various traffic flow data missing repair models under different data missing conditions.
[0072] Further explanation is needed: such as Figure 2 As shown, step S1 also includes the following steps:
[0073] S10: Based on the traffic flow time series data of the constructed research road network, datasets with discrete and continuous missing conditions are constructed by randomly and continuously removing traffic flow time series data.
[0074] S11: Based on the dataset, model missing road traffic flow information, considering the spatiotemporal correlation of traffic flow. The Pearson product-moment correlation coefficient method is used to measure the spatiotemporal correlation of road traffic flow from both temporal and spatial dimensions, according to the formula:
[0075]
[0076] In the above formula, r (r∈[-1,1]) is the Pearson product-moment correlation coefficient between the traffic flow time series x to be repaired and the associated traffic flow time series y;
[0077] If y is a historical time series of x at the same traffic monitoring station, then r is defined as the time correlation coefficient between the two.
[0078] If y is a time series of x associated with nearby traffic monitoring stations, then r is defined as the spatial correlation coefficient;
[0079] Where, x i Let y be the i-th observation sample in the time series x. i Let be the i-th observation in dataset y, and n be the number of observation samples in time series x and y. and The average values of time series x and y, respectively;
[0080] If r = 0, it indicates that there is no linear correlation between x and y. If r > 0, it indicates that there is a positive correlation between x and y, and y increases continuously as x increases. If r < 0, it indicates that there is a negative correlation between x and y, and y decreases continuously as x increases. The larger the absolute value of r, the stronger the spatiotemporal correlation between x and y.
[0081] S12: Select the set of traffic flow time series with the strongest spatiotemporal correlation from the missing traffic flow time series, and construct a spatiotemporal correlation dataset for training the traffic flow missing repair model. Specifically, calculate the temporal correlation coefficient between the traffic flow time series of adjacent days and corresponding days of adjacent weeks of the missing traffic flow time series and the missing time series, respectively, and set the temporal correlation coefficient as r. T Calculate the spatial correlation coefficient between the traffic flow time series of upstream and downstream road segments and adjacent lanes and the missing time series, and set the spatial correlation coefficient as r. S And select r respectively T and r S The two traffic flow time series with the highest correlation coefficient, together with the missing time series, constitute the traffic flow spatiotemporal correlation dataset.
[0082] S2: Based on different modeling methods, two types of traffic flow data missing repair models were constructed, one without considering spatiotemporal correlation and the other considering spatiotemporal correlation. The missing traffic flow data was repaired and analyzed according to different data repair methods.
[0083] Further explanation is needed: such as Figure 3 As shown, step S2 also includes the following steps:
[0084] S20: When constructing a missing data repair model that does not consider spatiotemporal correlation, the classic naive interpolation method and linear interpolation method are selected to repair the missing traffic flow data;
[0085] S21: When constructing a missing data repair model that considers spatiotemporal correlation, select different iterative imputation-based models based on Bayesian ridge regression, K-nearest neighbor regression, and random forest to repair missing traffic flow data.
[0086] Specifically, step S20 includes the following steps:
[0087] S20-1: Check for missing values in the traffic flow time series data of the constructed study road network in sequence, assuming the traffic flow time series data is N(N1,N2,…,N…). g ,…,N j );
[0088] S20-2: The classic naive imputation method is used to repair missing traffic flow data by using the preceding complete value of the missing value, with the missing value set to N. g (2≤g≤j), if N1 is missing, then N1 takes the nearest complete value to N1, and the previous complete value N is used. g-1 Fill in the missing value N g , until all missing values are corrected;
[0089] S20-3: A linear interpolation method is used to calculate a linear function using the two nearest complete values of each missing value. Missing values are then repaired based on the sequence of missing values, with the missing value set to N. g (1≤g≤j), find the missing value N. g (m g ,n g The two nearest complete values N x (m x ,n x ),N y (m y ,n y ), where m g For time, n g For m g The flow rate at any given time, substituted into N according to the formula. x (m x ,n x ) and N y (m y ,n y Calculate a linear function:
[0090]
[0091] m with missing values g Substituting into a linear function, for the missing values n g Perform the repair by repeating the above steps until all missing values are repaired.
[0092] Specifically, step S21 includes the following steps:
[0093] S21-1: Based on the traffic flow spatiotemporal correlation dataset, construct an i×j-dimensional traffic flow data matrix, where i is the number of rows in the matrix, representing the total number of samples on the observation day, and j is the number of columns in the matrix;
[0094] S21-2: Constructing a missing data repair model based on nearest neighbor regression, the process includes:
[0095] A1: Set the column of data to be repaired to X. g (1≤g≤j), let A be the i×j dimensional traffic flow data matrix, where each column is a feature of matrix A, according to the formula:
[0096]
[0097] A2: Construct the state vector for missing traffic flow data, let X g The index of the row containing the missing data in the column is H. mis Set the row index to H mis Establish missing data state vectors from row vectors
[0098] A3: Construct a traffic flow historical data state vector library. In the traffic flow data matrix A, except for row index H... mis For the first row, establish a historical data state vector library for the other rows.
[0099] A4: Define the Euclidean distance between the state vector of missing traffic flow data and the state vector of historical data as d. i Calculate it according to the formula:
[0100]
[0101] A5: Select the correlation vector for traffic flow missing repair, and set the distance d... i Sort the states in ascending order and find the K closest state vectors in space as the missing data repair correlation vectors.
[0102] A6: Repair missing traffic flow data by taking K correlation vectors. In Value, obtain weighted average Will The value is used as a missing repair value;
[0103] Repeat steps A4, A5, and A6 until X. g All missing values in the column have been corrected.
[0104] S21-3: Construct a missing data repair model based on random forest, the process of which includes:
[0105] B1: Set the column of data to be repaired to X g (1≤g≤j), let B be the i×j dimensional traffic flow data matrix, where each column is a feature of matrix B, according to the formula:
[0106]
[0107] B2: Sort the data matrix according to its characteristics. Sort each column of the data matrix from the smallest to the largest number of missing values. The column with the fewest missing values is the first column, and the column with the most missing values is the last column. After sorting, fill the missing values in all columns with 0 values.
[0108] B3: Construct training and test sets based on the sorted data matrix, select the feature S with the smallest missing proportion, and divide matrix B into 4 parts: feature S contains missing parts. The S feature has no missing parts. The portion of all variables other than the S feature corresponding to the missing index of the S feature The portion of all variables other than the S feature that has no missing index corresponding to the S feature.
[0109] B4: Train the random forest regression model, and A new dataset D is formed. n rows of data are selected from dataset D using the sampling with replacement method as a training set. A regression tree is generated using the sampled data set. m features are selected without repetition at each decision tree node. The Gini index is used to find the best splitting feature. The training set is selected N times in a row to generate N regression trees. The results of the N regression trees are fused to obtain the final repair model.
[0110] B5: Use the trained random forest regression model to update the matrix. (The sentence is incomplete and requires more context to translate accurately.) As a predictive feature, Make a prediction and obtain the predicted value D. i Using the predicted value D i Instead of 0, features S are extracted in ascending order of missing proportion. Predictions are made for features S with missing proportions, and the predicted values from the random forest model are filled into matrix B. This completes one iteration.
[0111] B6: Repeat steps B3, B4, and B5 until the loop iterates M times, and use the last random forest prediction value as the missing data repair value.
[0112] S21-4: Constructing a missing data repair model based on Bayesian ridge regression, the process includes:
[0113] C1: Let C be the i×j dimensional traffic flow data matrix, where each column represents a feature of matrix C, according to the formula:
[0114]
[0115] C2: Sort the data matrix according to its characteristics, sorting each column by the number of missing values from fewest to most, with the first column having the fewest missing values and the last column having the most missing values. After sorting, fill in all missing values in all columns with 0 values.
[0116] C3: Based on the sorted data matrix, divide the training and test sets. First, select the feature S with the smallest missing rate, and divide the C matrix into 4 parts: the missing part of feature S is defined as... The missing portion of the S feature is defined as The portion of all variables other than the S feature corresponding to the missing index of the S feature is defined as... The portion of all variables other than the S feature that has no missing index corresponding to the S feature is defined as...
[0117] C4: Optimize and train the Bayesian model to fit the data. Based on Bayes' theorem, derive the distribution of parameter w in the basic formula of linear regression. According to Bayes' theorem:
[0118]
[0119] Will As input and As output, the Bayesian model is optimized and fitted according to the formula:
[0120] y = f(x) + ε = w T +ε
[0121] Where w is the weighting coefficient, ε is the residual, and ε ~ N(0,σ) 2 );
[0122] C5: Utilize the optimized Bayesian ridge regression missing data repair model to update the matrix. As the original dataset pair Make a prediction and obtain the predicted value D. i The predicted value D i Instead of 0 values, features S are extracted sequentially in ascending order of missing value ratio. All features S with missing values are iterated over and the predicted values are filled into matrix C to complete one iteration.
[0123] C6: Repeat steps C3, C4, and C5 until M iterations are completed, and use the last Bayesian regression prediction value as the missing data repair value.
[0124] S21-5: Select hyperparameters for the K-nearest neighbor regression model and the random forest regression model, including optimizing the number of neighbors in the K-nearest neighbor missing data repair model and the number of regression trees in the random forest missing data repair model.
[0125] S3: Evaluate the missing data repair performance of traffic flow data missing repair models that do not consider spatiotemporal correlation and those that do consider spatiotemporal correlation. Compare and analyze the model evaluation indicators of different modeling methods under discrete missing data and continuous missing data. The comparison shows that the traffic flow data missing data repair model that considers spatiotemporal correlation repairs missing data more comprehensively and accurately than the traffic flow data missing data repair model that does not consider spatiotemporal correlation repair. Moreover, the repair effect of different models under discrete missing data is better than that of models under continuous missing data.
[0126] Further explanation is needed, such as Figure 4 As shown, step S3 also includes the following steps:
[0127] S30: Different modeling methods are set up in discrete missing cases and continuous missing cases respectively. The evaluation indicators of the missing repair model without considering spatiotemporal correlation and the missing repair model considering spatiotemporal correlation are summed according to the discrete missing ratio and the continuous missing ratio respectively. The average value is calculated to obtain the average performance of different repair models under discrete missing ratio and continuous missing ratio. The average index of traffic flow missing repair is analyzed based on the different road type datasets under discrete missing cases and continuous missing cases on weekdays and non-weekdays.
[0128] S31: Based on the average index of traffic flow missing repair in discrete missing scenarios on weekdays and non-weekdays for different road type datasets, the model evaluation error considering spatiotemporal correlation is significantly smaller than the model evaluation error not considering spatiotemporal correlation, and the performance of the repair models built on weekday and non-weekday datasets is similar.
[0129] The data has high dispersion and traffic flow fluctuates greatly. Models that consider spatiotemporal correlation are better than those that do not. The smaller the data dispersion, the smaller the change in the data before and after. Missing data repair models that do not consider spatiotemporal correlation have similar repair effects as those that do.
[0130] S32: Based on the average index of traffic flow missing repair in the case of continuous missing traffic flow on weekdays and non-weekdays for different road type datasets, the model evaluation error considering spatiotemporal correlation is significantly smaller than the model evaluation error not considering spatiotemporal correlation, and the repair models built on weekday and non-weekday datasets have similar performance.
[0131] Under the same data missing duration, the smaller the traffic flow data collection interval and the higher the collection frequency, the larger the amount of missing data, and the higher the error of the model repair. The smaller the data dispersion, the better the performance of the repair model.
[0132] S33: The above evaluation of the model considering spatiotemporal correlation and the model not considering spatiotemporal correlation shows that considering spatiotemporal correlation obtains traffic flow missing data patterns from both spatiotemporal dimensions, making the model considering spatiotemporal correlation more comprehensive and accurate in repairing missing data. At the same time, the repair effect of the model in discrete missing cases is better than that of the model in continuous missing cases.
[0133] S4: By comprehensively comparing the performance of different traffic flow missing data repair models that consider spatiotemporal correlation under discrete and continuous missing data scenarios, the error indices of different road traffic flow missing data repair models under different missing data scenarios are analyzed. Combining the advantages of different traffic flow missing data repair models, a general strategy for traffic flow data missing data repair with good adaptability is formulated.
[0134] Further explanation is needed, such as Figure 5 As shown, in step S4, the performance of different traffic flow missing repair models considering spatiotemporal correlation is comprehensively compared under discrete and continuous missing cases, and the error indices of different road traffic flow missing repair models under different missing cases are analyzed. This also includes the following steps:
[0135] S40: The K-nearest neighbor regression missing data repair model, random forest missing data repair model, and Bayesian ridge regression missing data repair model are set up in discrete missing data cases and continuous missing data cases respectively. They are classified according to weekdays and non-weekdays. The error indices of different traffic flow missing data repair models are accumulated and summed, and then averaged to obtain the evaluation index reflecting the average performance of the traffic flow missing data repair models. A comprehensive comparison is then made.
[0136] S41: In the case of discrete missing data, both on weekdays and non-weekdays, the Bayesian Ridge Regression Missing Data Repair Model and the Random Forest Missing Data Repair Model outperform the K Nearest Neighbor Regression Missing Data Repair Model. In the weekday dataset, when the discrete missing data ratio is less than 10%, the Random Forest Missing Data Repair Model outperforms the Bayesian Ridge Regression Missing Data Repair Model. When the discrete missing data ratio is greater than 10%, the Bayesian Ridge Regression Missing Data Repair Model performs better. In the non-weekday dataset, the Bayesian Ridge Regression Missing Data Repair Model outperforms both the Random Forest Missing Data Repair Model and the K Nearest Neighbor Regression Missing Data Repair Model.
[0137] S42: In the case of consecutive missing values, the Bayesian Ridge Regression missing value repair model outperforms the Random Forest missing value repair model and the K-Nearest Neighbors Regression missing value repair model on both weekdays and non-weekdays. Among them, the K-Nearest Neighbors Regression missing value repair model performs the worst. In the weekday dataset, when the consecutive missing value duration is less than or equal to 3 hours, the K-Nearest Neighbors Regression missing value repair model is more convenient for big data computation and processing. When the consecutive missing value duration is greater than or equal to 6 hours, the Bayesian Ridge Regression missing value repair model outperforms the other two models.
[0138] Further explanation is needed, such as Figure 6 As shown, in step S4, combining the advantages of different traffic flow missing data repair models, a general strategy for traffic flow data missing data repair with good adaptability is formulated, which also includes the following steps:
[0139] S4-1: Obtain traffic flow data for the road network under study and determine whether there are any missing traffic flow data;
[0140] S4-2: When it is determined that there are missing traffic flow data in the road network under study, the time of the missing traffic flow data is determined to be a weekday or a non-weekday;
[0141] S4-3: When traffic flow data is missing on non-working days, a Bayesian ridge regression missing data repair model is used to repair the missing data;
[0142] S4-4: When traffic flow data is missing on weekdays, determine the circumstances of the missing data. The circumstances of missing data include discrete missing data and continuous missing data.
[0143] S4-5: When the missing data is continuous, determine whether the duration of the missing data is greater than or equal to 3 hours. If the duration of the missing data is greater than or equal to 3 hours, use the Bayesian Ridge Regression Missing Data Repair Model to repair the missing data. If the duration of the missing data is less than 3 hours, use the K-Nearest Neighbor Regression Missing Data Repair Model to repair the missing data.
[0144] S4-6: When the missing data is discrete, determine whether the missing data ratio is greater than or equal to 10%. If the missing data ratio is greater than or equal to 10%, use the Bayesian Ridge Regression Missing Data Repair Model to repair the missing data. If the missing data ratio is less than 10%, use the Random Forest Missing Data Repair Model to repair the missing data.
[0145] It will be apparent to those skilled in the art that the present invention is not limited to the details of the exemplary embodiments described above, and that the invention can be implemented in other specific forms without departing from its spirit or essential characteristics. Therefore, the embodiments should be considered in all respects as exemplary and non-limiting, and the scope of the invention is defined by the appended claims rather than the foregoing description. Thus, all variations falling within the meaning and scope of equivalents of the claims are intended to be included within the present invention. No reference numerals in the claims should be construed as limiting the scope of the claims.
Claims
1. A method for evaluating and designing strategies for a traffic flow data missing repair system, characterized in that: S1: Construct a research dataset using time series data of traffic flow at different cross-sections, evaluate the different degrees of data missing due to multiple factors, including discrete and continuous data missing, and analyze various traffic flow data missing repair models under different data missing conditions. S2: Based on different modeling methods, two types of traffic flow data missing repair models were constructed, one without considering spatiotemporal correlation and the other considering spatiotemporal correlation. The missing traffic flow data was repaired and analyzed according to different data repair methods. S3: Evaluate the missing data repair performance of traffic flow data missing repair models that do not consider spatiotemporal correlation and those that do consider spatiotemporal correlation. Compare and analyze the model evaluation indicators of different modeling methods under discrete missing data and continuous missing data. The comparison shows that the traffic flow data missing data repair model that considers spatiotemporal correlation repairs missing data more comprehensively and accurately than the traffic flow data missing data repair model that does not consider spatiotemporal correlation repair. Moreover, the repair effect of different models under discrete missing data is better than that of models under continuous missing data. S4: By comprehensively comparing the performance of different traffic flow missing data repair models that consider spatiotemporal correlation under discrete and continuous missing data scenarios, the error indices of different road traffic flow missing data repair models under different missing data scenarios are analyzed. Combining the advantages of different traffic flow missing data repair models, a general strategy for traffic flow data missing data repair with good adaptability is formulated.
2. The method for evaluating and designing a traffic flow data missing repair system according to claim 1, characterized in that... Step S1 further includes the following steps: S10: Based on the traffic flow time series data of the constructed research road network, datasets with discrete and continuous missing conditions are constructed by randomly and continuously removing traffic flow time series data. S11: Based on the dataset, model missing road traffic flow information, considering the spatiotemporal correlation of traffic flow. The Pearson product-moment correlation coefficient method is used to measure the spatiotemporal correlation of road traffic flow from both temporal and spatial dimensions, according to the formula: In the above formula, r (r∈[-1,1]) is the Pearson product-moment correlation coefficient between the traffic flow time series x to be repaired and the associated traffic flow time series y; If y is a historical time series of x at the same traffic monitoring station, then r is defined as the time correlation coefficient between the two. If y is a time series of x associated with nearby traffic monitoring stations, then r is defined as the spatial correlation coefficient; Where, x i Let y be the i-th observation sample in the time series x. i Let be the i-th observation in dataset y, and n be the number of observation samples in time series x and y. and The average values of time series x and y, respectively; If r = 0, it indicates that there is no linear correlation between x and y. If r > 0, it indicates that there is a positive correlation between x and y, and y increases continuously as x increases. If r < 0, it indicates that there is a negative correlation between x and y, and y decreases continuously as x increases. The larger the absolute value of r, the stronger the spatiotemporal correlation between x and y. S12: Select the set of traffic flow time series with the strongest spatiotemporal correlation from the missing traffic flow time series, and construct a spatiotemporal correlation dataset for training the traffic flow missing repair model. Specifically, calculate the temporal correlation coefficient between the traffic flow time series of adjacent days and corresponding days of adjacent weeks of the missing traffic flow time series and the missing time series, respectively, and set the temporal correlation coefficient as r. T Calculate the spatial correlation coefficient between the traffic flow time series of upstream and downstream road segments and adjacent lanes and the missing time series, and set the spatial correlation coefficient as r. S And select r respectively T and r S The two traffic flow time series with the highest correlation coefficient, together with the missing time series, constitute the traffic flow spatiotemporal correlation dataset.
3. The method for evaluating and designing a traffic flow data missing repair system according to claim 1, characterized in that... Step S2 further includes the following steps: S20: When constructing a missing data repair model that does not consider spatiotemporal correlation, the classic naive interpolation method and linear interpolation method are selected to repair the missing traffic flow data; S21: When constructing a missing data repair model that considers spatiotemporal correlation, select different iterative imputation-based models based on Bayesian ridge regression, K-nearest neighbor regression, and random forest to repair missing traffic flow data.
4. The method for evaluating and designing a traffic flow data missing repair system according to claim 3, characterized in that... Step S20 specifically includes the following steps: S20-1: Check for missing values in the traffic flow time series data of the constructed study road network in sequence, assuming the traffic flow time series data is N(N1,N2,…,N…). g ,…,N j ); S20-2: The classic naive imputation method is used to repair missing traffic flow data by using the preceding complete value of the missing value, with the missing value set to N. g (2≤g≤j), if N1 is missing, then N1 takes the nearest complete value to N1, and the previous complete value N is used. g-1 Fill in the missing value N g , until all missing values are corrected; S20-3: A linear interpolation method is used to calculate a linear function using the two nearest complete values of each missing value. Missing values are then repaired based on the sequence of missing values, with the missing value set to N. g (1≤g≤j), find the missing value N. g (m g ,n g The two nearest complete values N x (m x ,n x ),N y (m y ,n y ), where m g For time, n g For m g The flow rate at any given time, substituted into N according to the formula. x (m x ,n x ) and N y (m y ,n y Calculate a linear function: m with missing values g Substituting into a linear function, for the missing values n g Perform the repair by repeating the above steps until all missing values are repaired.
5. The method for evaluating and designing a traffic flow data missing repair system according to claim 3, characterized in that... Step S21 specifically includes the following steps: S21-1: Based on the traffic flow spatiotemporal correlation dataset, construct an i×j-dimensional traffic flow data matrix, where i is the number of rows in the matrix, representing the total number of samples on the observation day, and j is the number of columns in the matrix; S21-2: Construct a missing data repair model based on nearest neighbor regression. The process includes constructing a traffic flow missing data state vector, constructing a traffic flow historical data state vector library, calculating the Euclidean distance between the traffic flow missing data state vector and the historical data state vector, selecting the correlation vector for traffic flow missing data repair, repairing the missing traffic flow data, and repeating the operation until all missing values of the traffic flow data are repaired. S21-3: Construct a missing value repair model based on random forest. The process includes sorting the data matrix according to its features, sorting the columns of the data matrix from the fewest to the most missing values, with the first column having the fewest missing values and the last column having the most missing values. After sorting, fill all missing values in all columns with 0 values. Based on the sorted data matrix, construct training and test sets, train a random forest regression model, use the trained random forest regression model to update the matrix, repeat the operation until multiple iterations, and use the last random forest prediction value as the missing data repair value. S21-4: Construct a missing value repair model based on Bayesian ridge regression. The process includes sorting the data matrix according to its features, sorting the columns by the number of missing values from fewest to most, with the first column having the fewest missing values and the last column having the most missing values. After sorting, all missing values in all columns are filled with 0 values. The training and test sets are divided according to the sorted data matrix. The Bayesian model is optimized and fitted and trained. The matrix is updated using the optimized Bayesian ridge regression missing data repair model. The operation is repeated until multiple iterations are performed. The last Bayesian regression prediction value is used as the missing data repair value. S21-5: Select hyperparameters for the K-nearest neighbor regression model and the random forest regression model, including optimizing the number of neighbors in the K-nearest neighbor missing data repair model and the number of regression trees in the random forest missing data repair model.
6. The method for evaluating and designing a traffic flow data missing repair system according to claim 1, characterized in that... Step S3 further includes the following steps: S30: Different modeling methods are set up in discrete missing cases and continuous missing cases respectively. The evaluation indicators of the missing repair model without considering spatiotemporal correlation and the missing repair model considering spatiotemporal correlation are summed according to the discrete missing ratio and the continuous missing ratio respectively. The average value is calculated to obtain the average performance of different repair models under discrete missing ratio and continuous missing ratio. The average index of traffic flow missing repair is analyzed based on the different road type datasets under discrete missing cases and continuous missing cases on weekdays and non-weekdays. S31: Based on the average index of traffic flow missing repair in discrete missing scenarios on weekdays and non-weekdays for different road type datasets, the model evaluation error considering spatiotemporal correlation is significantly smaller than the model evaluation error not considering spatiotemporal correlation, and the performance of the repair models built on weekday and non-weekday datasets is similar. The data has high dispersion and traffic flow fluctuates greatly. Models that consider spatiotemporal correlation are better than those that do not. The smaller the data dispersion, the smaller the change in the data before and after. Missing data repair models that do not consider spatiotemporal correlation have similar repair effects as those that do. S32: Based on the average index of traffic flow missing repair in the case of continuous missing traffic flow on weekdays and non-weekdays for different road type datasets, the model evaluation error considering spatiotemporal correlation is significantly smaller than the model evaluation error not considering spatiotemporal correlation, and the repair models built on weekday and non-weekday datasets have similar performance. Under the same data missing duration, the smaller the traffic flow data collection interval and the higher the collection frequency, the larger the amount of missing data, and the higher the error of the model repair. The smaller the data dispersion, the better the performance of the repair model. S33: The above evaluation of the model considering spatiotemporal correlation and the model not considering spatiotemporal correlation shows that considering spatiotemporal correlation obtains traffic flow missing data patterns from both spatiotemporal dimensions, making the model considering spatiotemporal correlation more comprehensive and accurate in repairing missing data. At the same time, the repair effect of the model in discrete missing cases is better than that of the model in continuous missing cases.
7. The method for evaluating and designing a traffic flow data missing repair system according to claim 1, characterized in that... In step S4, the performance of different traffic flow missing repair models considering spatiotemporal correlation is comprehensively compared under discrete and continuous missing scenarios. The error indices of different road traffic flow missing repair models under different missing scenarios are analyzed. The steps also include: S40: The K-nearest neighbor regression missing data repair model, random forest missing data repair model, and Bayesian ridge regression missing data repair model are set up in discrete missing data cases and continuous missing data cases respectively. They are classified according to weekdays and non-weekdays. The error indices of different traffic flow missing data repair models are accumulated and summed, and then averaged to obtain the evaluation index reflecting the average performance of the traffic flow missing data repair models. A comprehensive comparison is then made. S41: In the case of discrete missing data, both on weekdays and non-weekdays, the Bayesian Ridge Regression Missing Data Repair Model and the Random Forest Missing Data Repair Model outperform the K Nearest Neighbor Regression Missing Data Repair Model. In the weekday dataset, when the discrete missing data ratio is less than 10%, the Random Forest Missing Data Repair Model outperforms the Bayesian Ridge Regression Missing Data Repair Model. When the discrete missing data ratio is greater than 10%, the Bayesian Ridge Regression Missing Data Repair Model performs better. In the non-weekday dataset, the Bayesian Ridge Regression Missing Data Repair Model outperforms both the Random Forest Missing Data Repair Model and the K Nearest Neighbor Regression Missing Data Repair Model. S42: In the case of consecutive missing values, the Bayesian Ridge Regression missing value repair model outperforms the Random Forest missing value repair model and the K-Nearest Neighbors Regression missing value repair model on both weekdays and non-weekdays. Among them, the K-Nearest Neighbors Regression missing value repair model performs the worst. In the weekday dataset, when the consecutive missing value duration is less than or equal to 3 hours, the K-Nearest Neighbors Regression missing value repair model is more convenient for big data computation and processing. When the consecutive missing value duration is greater than or equal to 6 hours, the Bayesian Ridge Regression missing value repair model outperforms the other two models.
8. The method for evaluating and designing a traffic flow data missing repair system according to claim 1, characterized in that... In step S4, combining the advantages of different traffic flow missing data repair models, a general strategy for traffic flow missing data repair with good adaptability is formulated, which also includes the following steps: S4-1: Obtain traffic flow data for the road network under study and determine whether there are any missing traffic flow data; S4-2: When it is determined that there are missing traffic flow data in the road network under study, the time of the missing traffic flow data is determined to be a weekday or a non-weekday; S4-3: When traffic flow data is missing on non-working days, a Bayesian ridge regression missing data repair model is used to repair the missing data; S4-4: When traffic flow data is missing on weekdays, determine the circumstances of the missing data. The circumstances of missing data include discrete missing data and continuous missing data. S4-5: When the missing data is continuous, determine whether the duration of the missing data is greater than or equal to 3 hours. If the duration of the missing data is greater than or equal to 3 hours, use the Bayesian Ridge Regression Missing Data Repair Model to repair the missing data. If the duration of the missing data is less than 3 hours, use the K-Nearest Neighbor Regression Missing Data Repair Model to repair the missing data. S4-6: When the missing data is discrete, determine whether the missing data ratio is greater than or equal to 10%. If the missing data ratio is greater than or equal to 10%, use the Bayesian Ridge Regression Missing Data Repair Model to repair the missing data. If the missing data ratio is less than 10%, use the Random Forest Missing Data Repair Model to repair the missing data.