A fast preprocessing method for OBD remote emission monitoring data
By combining adaptive bucket sorting and columnar storage format, the quality problem of OBD remote emission monitoring data is solved, the data processing speed and quality are improved, multi-threaded processing is supported, specific research needs are met, and the efficiency of emission analysis is improved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINESE RES ACAD OF ENVIRONMENTAL SCI
- Filing Date
- 2024-08-26
- Publication Date
- 2026-06-26
AI Technical Summary
Existing OBD remote emission monitoring data processing suffers from data quality issues, such as sensors malfunctioning at low emission temperatures, insufficient accuracy, and data out-of-order transmission, loss, and retransmission, leading to inefficient data processing and affecting the efficiency of emission result analysis.
An adaptive bucket sort algorithm is used to organize the data along the timeline, convert it to a columnar storage format, remove retransmitted frames and perform run-length segmentation, identify and remove abnormal data, evaluate data representativeness, and perform specific processing for specific research needs.
It improves data reading speed and processing efficiency, ensures data quality, supports multi-threaded processing, enhances the reliability of data representativeness, and adapts to specific research needs.
Smart Images

Figure CN119066057B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of emission monitoring technology, and in particular to a rapid preprocessing method for OBD remote emission monitoring data. Background Technology
[0002] Heavy-duty diesel vehicles, as a significant component of mobile road sources, are major sources of pollutants such as nitrogen oxides (NOx). Although new energy vehicles have made significant progress in the light-duty vehicle sector, the transition to new energy vehicles has been relatively slow in the heavy-duty vehicle sector, and heavy-duty diesel vehicles will continue to maintain a large fleet and stable annual production and sales volume. To control emissions, countries are continuously tightening emission standards, and exhaust after-treatment technologies are constantly being upgraded. However, engines that pass bench tests often exceed emission standards in real-world driving, making real-world vehicle emissions monitoring a necessary measure.
[0003] Currently, practical road emission monitoring methods mainly include PEMS testing, remote sensing testing, vehicle-following testing, tunnel testing, and remote OBD. PEMS testing is costly, time-consuming, and yields limited data; while non-contact methods such as remote sensing, vehicle-following testing, and tunnel testing cannot obtain engine data. Remote OBD relies on the vehicle's original OBD system, requiring only the installation of an additional communication terminal, with minimal modification costs. It can acquire a large amount of data, provides synchronized engine data, and allows for long-term tracking and analysis of individual vehicles, making it a highly practical emission monitoring method. With the widespread application of remote OBD systems, the low speed and inefficiency of processing massive amounts of data has become a research bottleneck. OBD remote emission monitoring data not only includes synchronized engine data but also involves a comprehensive record of vehicle operating status, resulting in significant data quality issues (such as sensors malfunctioning at low exhaust temperatures, insufficient accuracy, and data out-of-order transmission, loss, and retransmission). This makes data preprocessing a necessary task. Therefore, optimizing the preprocessing process and improving processing speed are crucial for enhancing overall research efficiency. Summary of the Invention
[0004] The purpose of this invention is to provide a rapid preprocessing method for remote emission monitoring data of OBD, which is designed for large volumes of remote OBD data. It can efficiently identify and process quality problems in the data (such as sensors stopping working at low emission temperatures, insufficient accuracy, data disorder, loss, and retransmission during transmission), thereby accelerating and improving the efficiency of necessary preliminary work for emission result analysis.
[0005] To achieve the above objectives, the present invention adopts the following technical solution:
[0006] A rapid preprocessing method for OBD remote emission monitoring data includes the following steps:
[0007] Convert the original row-based data file into a column-based storage format;
[0008] An adaptive bucket sort algorithm is used to reshape the converted columnar storage data in a time sequence, while clearing retransmitted frames. Run-time segmentation is performed after the timeline is organized.
[0009] Perform abnormal data cleaning on the processed data, including identifying and removing standard invalid values, out-of-limit values, and unreasonable consecutive duplicate values;
[0010] Evaluate the mileage coverage and effective data ratio of the cleaned data to comprehensively assess the data representativeness.
[0011] Preferably, the columnar storage format adopts the Apache Parquet format to improve data reading speed and compression efficiency.
[0012] Preferably, the adaptive bucket sort algorithm includes:
[0013] Traverse the data to determine the timestamp range, and divide the timeline within the coverage area into multiple buckets;
[0014] Place the data frames into the corresponding buckets according to their timestamps;
[0015] Perform fast sorting on the data frames within each bucket;
[0016] By merging the data frames within the bucket in sequence, a data table in chronological order is obtained.
[0017] Preferably, the abnormal data cleaning step includes:
[0018] Based on the standard invalid return value table, identify and remove standard invalid values;
[0019] Based on the effective upper and lower limit table, identify and remove unreasonable out-of-limit values;
[0020] An algorithm based on differential binarization is used to identify and remove unreasonable consecutive duplicate values.
[0021] Data from an engine-off state was excluded before cleaning.
[0022] Preferably, the data representativeness assessment step includes:
[0023] Calculate the mileage coverage of the cleaned data;
[0024] The representativeness of the data is calculated by multiplying the weighted mileage coverage by the effective data ratio.
[0025] Preferably, it also includes specific data processing steps for specific research needs, including:
[0026] Reshaping the nonnegativity of NOx concentration downstream of SCR;
[0027] The problem of GPS positioning jumps and drifts is corrected by using OBD vehicle speed comparison based on wheel rotation calculation.
[0028] Preferably, it includes: a storage format conversion module for converting the original row-based storage data file into a column-based storage format;
[0029] The timeline sorting module uses an adaptive bucket sorting algorithm to sort the converted data on the timeline. It removes retransmitted frames based on whether the timestamps of adjacent frames are the same, and performs stroke segmentation based on large timestamp jumps and low SCR outlet temperatures.
[0030] The abnormal data cleaning module is used to clean outliers from the processed data.
[0031] The data evaluation module is used to evaluate the mileage coverage and effective data ratio of the cleaned data, and to comprehensively evaluate the representativeness of the data.
[0032] The specialized processing module is used to perform specialized processing on data for specific research needs.
[0033] This invention has at least the following beneficial effects:
[0034] Data file preparation: Convert the original row-based storage file into a column-based storage format (such as Apache Parquet) to improve data reading speed, reduce IO operations, support selective reading of column features, and optimize storage and compression efficiency.
[0035] Timeline organization: An adaptive bucket sort algorithm is adopted. By traversing the data to determine the timestamp range, the timeline within the range is divided into multiple buckets, sorted separately, and then merged, effectively shortening the sorting time. Multi-threaded processing is supported to further improve speed. Retransmitted frames are cleared based on whether adjacent frame timestamps are identical, and stroke segmentation is performed based on large timestamp jumps and low SCR outlet temperatures.
[0036] Abnormal data cleaning: Cleaning strategies are developed for invalid values, out-of-limit values, and unreasonable consecutive duplicate values. In particular, data from when the engine is not running is excluded before cleaning to reduce the workload of subsequent cleaning. After cleaning, the mileage coverage and effective data ratio of the data are evaluated to comprehensively assess the representativeness of the data.
[0037] Specialized processing: Specialized processing is applied to specific research needs, such as NOx emission analysis and GPS positioning correction. NOx concentration data below 0 are re-restored to non-negativity by setting them to 0, and GPS positioning jumps and drifts are corrected using OBD vehicle speed calculations based on wheel rotation. Attached Figure Description
[0038] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the following description of the embodiments will be briefly introduced. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0039] Figure 1 This is a schematic diagram of the adaptive bucket sorting process of the present invention;
[0040] Figure 2 This is a schematic diagram of the stroke segmentation process of the present invention;
[0041] Figure 3 This is a schematic diagram of the three-step cleaning process for invalid values, out-of-limit values, and unreasonable continuous repeating values in this invention. Detailed Implementation
[0042] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.
[0043] Example 1, refer to Figure 1-3 A rapid preprocessing method for OBD remote emission monitoring data includes the following steps:
[0044] Convert the original row-based data file into a column-based storage format;
[0045] An adaptive bucket sorting algorithm is used to organize the timeline of the converted data. Retransmitted frames are removed based on whether the timestamps of adjacent frames are the same. Travel segmentation is performed based on large timestamp jumps and low SCR outlet temperatures.
[0046] Perform abnormal data cleaning on the processed data, including identifying and removing standard invalid values, out-of-limit values, and unreasonable consecutive duplicate values;
[0047] Evaluate the mileage coverage and effective data ratio of the cleaned data to comprehensively assess the data representativeness.
[0048] Data file preparation:
[0049] The data file preparation section converts the original row-based storage file (usually a CSV file) containing remote OBD data exported from the database into a column-based storage format. Remote OBD data is transmitted to the monitoring end in real-time, frame by frame. The monitoring end typically uses a row-based relational database to accommodate the vertical accumulation of data. Row-based storage offers no advantage in the preprocessing and analysis of exported data; a comparison between row-based and column-based storage is shown in Table 1. Remote OBD data column features only have 20 items, typically 2 to 5 orders of magnitude fewer than the number of frames in a single file. Column-based storage reduces I / O operations. Most analysis scenarios do not require all remote OBD features; column-based storage also improves read speed when selecting specific features for reading. The different data formats of column features provide a basis for targeted compression, reducing file size and indirectly improving read efficiency by reducing the number of bytes required to read.
[0050] Table 1 Comparison of Row-based and Column-based Storage in Various Aspects
[0051]
[0052] Timeline organization:
[0053] Errors in remote OBD data can be categorized into two types: timeline errors and abnormal data. Timeline errors originate from anomalies during communication, such as packet loss, improper retransmission, and retransmission, which are the main causes of missing data frames, multiple data entries with the same timestamp, and out-of-order data, respectively. This section first addresses the out-of-order problem by sorting the data, then cleans up redundant frames based on adjacent frame timestamps to resolve the retransmission issue, further segments the journey based on long-term missing context, and finally calculates and evaluates the mileage coverage of the data.
[0054] This section employs an adaptive bucket sort to organize the timeline. First, it iterates through the data to obtain upper and lower bounds for the timestamps, defining each hour within these bounds as a bucket. Each data frame is then placed into its corresponding bucket according to its timestamp. Next, any quicksort algorithm is used to organize the data frames in each bucket. Finally, each bucket is traversed, and the data frames are retrieved and joined in order to obtain a chronologically ordered data table. The algorithm is as follows: Figure 1 As shown.
[0055] As can be seen from the above embodiments:
[0056] Data file preparation: Convert the original row-oriented storage file (usually a CSV file) into a column-oriented storage format (such as Apache Parquet) to improve data reading speed, reduce IO operations, support selective reading, and optimize storage and compression efficiency.
[0057] Timeline organization: An adaptive bucket sort algorithm is adopted, which determines the timestamp range by traversing the data, divides the time into multiple buckets, sorts them separately, and then merges them, effectively shortening the sorting time. Multi-threaded processing is supported to further improve speed. Retransmitted frames are cleared based on whether adjacent frame timestamps are identical, and stroke segmentation is performed based on large timestamp jumps and low SCR exit temperature.
[0058] Abnormal data cleaning: Cleaning strategies are developed for invalid values, out-of-limit values, and unreasonable consecutive duplicate values. In particular, data from when the engine is not running is excluded before cleaning to reduce the workload of subsequent cleaning. After cleaning, the mileage coverage and effective data ratio of the data are evaluated to comprehensively assess the representativeness of the data.
[0059] Specialized processing: Specialized processing is applied to specific research needs, such as NOx emission analysis and GPS positioning correction. This involves restoring non-negativity by setting NOx concentration data below 0 to zero, and using vehicle speed as a reference to correct GPS positioning jumps and drift issues.
[0060] Example 2, refer to Figure 1-3 The columnar storage format adopts the Apache Parquet format to improve data reading speed and compression efficiency.
[0061] The adaptive bucket sort algorithm includes:
[0062] Traverse the data to determine the timestamp range, and divide the timeline within the coverage area into multiple buckets;
[0063] Place the data frames into the corresponding buckets according to their timestamps;
[0064] Perform fast sorting on the data frames within each bucket;
[0065] By merging the data frames within the bucket in sequence, a data table in chronological order is obtained.
[0066] The abnormal data cleaning steps include:
[0067] Based on the standard invalid return value table, identify and remove standard invalid values;
[0068] Based on the effective upper and lower limit table, identify and remove unreasonable out-of-limit values;
[0069] An algorithm based on differential binarization is used to identify and remove unreasonable consecutive duplicate values.
[0070] Data from an engine-off state was excluded before cleaning.
[0071] The data representativeness assessment steps include:
[0072] Calculate the mileage coverage of the cleaned data;
[0073] The representativeness of the data is calculated by multiplying the weighted mileage coverage by the effective data ratio.
[0074] It also includes specialized data processing steps for specific research needs, including:
[0075] Reshaping the nonnegativity of NOx concentration downstream of SCR;
[0076] The problem of GPS positioning jumps and drifts is corrected by using OBD vehicle speed comparison based on wheel rotation calculation.
[0077] Includes: a storage format conversion module, used to convert raw row-based data files into column-based storage format;
[0078] The timeline organization module uses an adaptive bucket sort algorithm to organize the converted data into a timeline.
[0079] The abnormal data cleaning module is used to clean outliers from the processed data.
[0080] The data evaluation module is used to evaluate the mileage coverage and effective data ratio of the cleaned data, and to comprehensively evaluate the representativeness of the data.
[0081] The specialized processing module is used to perform specialized processing on data for specific research needs.
[0082] Theoretically, bucket sort has a time complexity of O(n), which is necessarily better than the O(nlogn) time complexity of general sorting. In practice, based on assumptions that hold true in most cases, adaptive bucket sort can indeed speed up the sorting of remote OBD data: in the worst case, based on the superadditivity of the function f(x) = xlogx, the time taken by the adaptive bucket sort is less than... The standard quicksort algorithm takes cnlogn time, where c is a constant coefficient representing the time taken for a single operation. As long as the total number of frames in the data table is greater than 3600, the improved bucket sort can speed up the sorting process, which translates to a time span of at least one hour in practical applications. Furthermore, the adaptive bucket sort algorithm is highly thread-friendly and can benefit from further speed improvements with multi-core CPUs in real-world applications.
[0083] For adjacent data frames arranged in ascending order, the timestamp of the next data frame is subtracted from the timestamp of the previous data frame to obtain the recording interval between the two frames. The intervals are then categorized according to the table below. For each unique timestamp, only a single data frame is retained, meaning retransmitted data frames are directly removed. Short-term missed transmissions of less than ten seconds generally occur during engine startup, and their impact can be completely eliminated by subsequent cleaning; therefore, they are not processed in this step. Long-term gaps on the timeline may be due to communication errors or are normal phenomena during parking; both are quite common. To eliminate the influence of factors such as cold starts, single-trip segmentation is essential. The trip segmentation referred to here essentially detects whether a significant long-term parking event occurs during the long gap. If so, the gap is set as a segmentation point, and the two segments before and after it are assigned to different trips. The specific process is as follows: Figure 2 As shown.
[0084] This process uses two conditions to determine whether to segment: if the gap time is too long, it is considered that the vehicle's condition has undergone a significant change, and segmentation is marked; if the SCR outlet temperature at the starting point of the subsequent segment is within the normal air temperature range, it is considered that the SCR has cooled down after a long period of parking, and segmentation is marked. It is worth noting that after marking, the resulting segmented trips can filter out trips that are too short and do not have high analytical value. Finally, the mileage coverage of the data table is calculated using the following formula:
[0085]
[0086] Where n represents n long-term vacancies, time_abs_start i time_abs_end represents the timestamp of the data before the start of the first vacancy. i Mileage represents the timestamp of the data after the i-th vacancy ends. time_end and Mileage time_start These represent the timestamps of the earliest and latest rows of data, respectively. Since the number of missing data frames is unknown, mileage is used here to represent the frame count, which will be used in subsequent calculations of data representativeness.
[0087] Abnormal data cleaning
[0088] Abnormal data from remote OBD originates from malfunctions within the OBD system, such as sensor failure, abnormal data processing, or improper software behavior. This manifests in the data itself as invalid return values as defined by standards, out-of-limit values that contradict real-world physical laws, and abnormally consecutive identical values. This section addresses these three types of abnormal data with a three-step targeted cleaning process, such as… Figure 3 As shown.
[0089] The cleaning criteria for each feature are shown in Table 2. The primary criterion for whether to perform continuous repeated cleaning is data accuracy. When the accuracy is very low, continuous repetition is a very normal phenomenon. It is worth noting that the upstream and downstream temperatures of the SCR with higher accuracy have a very high correlation with engine speed. When the engine speed does not change abruptly, its value tends to be stable, and the frequency of no change in the upstream and downstream temperatures of the SCR in adjacent frames is quite high. Therefore, the features that should be continuously and repeatedly cleaned are only vehicle speed, engine speed, intake air volume, engine fuel flow rate, and NOx concentration upstream and downstream of the SCR.
[0090] Table 2 Cleaning Basis for Each Feature
[0091]
[0092]
[0093] *: Features marked "Yes" in this column require continuous repeated cleaning. These features have high data accuracy. If they consistently return the same value within the normal range for a relatively long period (more than 60 seconds in practice), this data segment can be considered an abnormal continuous repetitive segment and needs cleaning. Specifically, engine fuel flow and vehicle speed being 0 for extended periods is normal; if these two features have prolonged periods of 0 values, cleaning is not required. In the specific implementation, standard invalid value cleaning is based on equality judgment:
[0094] entry.isInvalid=(entry[feature]==invalid[feature])
[0095] Here, `invalid` is a table storing standard invalid return values, and `isInvalid` is a flag indicating whether an entry is considered a standard invalid return value. Out-of-Range cleaning is based on upper and lower bounds: `entry.isOutofRange = (entry[feature] < lowerbound[feature])`.
[0096] ||(entry[feature]>upperbound[feature])
[0097] The lowerbound and upperbound tables store the effective lower and upper bounds of the features, respectively, while isOutofRange is a flag indicating whether an entry is considered to be out of bounds.
[0098] As can be seen from the above embodiments, the beneficial effects of the present invention are mainly reflected in the following aspects:
[0099] Significantly improves data retrieval speed: The application of columnar storage format increases data retrieval speed several times, laying a solid foundation for subsequent processing.
[0100] Optimize the timeline sorting process: The adaptive bucket sort algorithm significantly reduces sorting time and supports multi-threaded processing, further improving efficiency.
[0101] Improve the effectiveness of abnormal data cleaning: The three-step cleaning strategy, combined with engine-off status elimination, effectively removes various abnormal data and improves data quality.
[0102] Introducing data representativeness assessment: Data representativeness is comprehensively assessed by mileage coverage and effective data ratio to ensure the validity and reliability of preprocessing results.
[0103] Support for specialized data processing: Provide specialized processing solutions for specific research needs, enhancing the flexibility and practicality of preprocessing methods.
[0104] The foregoing has shown and described the basic principles, main features, and advantages of the present invention. Those skilled in the art should understand that the present invention is not limited to the above embodiments. The embodiments and descriptions in the specification are merely principles of the invention. Various changes and modifications can be made to the invention without departing from its spirit and scope, and all such changes and modifications fall within the scope of the claimed invention. The scope of protection claimed by the appended claims and their equivalents is defined.
Claims
1. A rapid preprocessing method for OBD remote emission monitoring data, characterized in that, Includes the following steps: Convert the original row-based data file into a column-based storage format; An adaptive bucket sorting algorithm is used to re-order the transformed columnar storage data. Retransmitted frames are cleared based on whether the timestamps of adjacent frames are the same. Travel segmentation is performed based on large timestamp jumps and low SCR outlet temperature. Perform abnormal data cleaning on the processed data, including identifying and removing standard invalid values, out-of-limit values, and unreasonable consecutive duplicate values; Evaluate the mileage coverage and effective data ratio of the cleaned data, and comprehensively assess the representativeness of the data. The adaptive bucket sort algorithm includes: Traverse the data to determine the timestamp range, and divide the timeline within the coverage area into multiple buckets; Place the data frames into the corresponding buckets according to their timestamps; Perform fast sorting on the data frames within each bucket; Merge the data frames within the bucket in sequence to obtain a data table in chronological order; The data representativeness assessment steps include: Calculate the mileage coverage of the cleaned data; The representativeness of the data is calculated by multiplying the weighted mileage coverage by the effective data ratio.
2. The rapid preprocessing method for OBD remote emission monitoring data according to claim 1, characterized in that, The columnar storage format uses the Apache Parquet format to improve data reading speed and compression efficiency.
3. The rapid preprocessing method for OBD remote emission monitoring data according to claim 1, characterized in that, The abnormal data cleaning steps include: Based on the standard invalid return value table, identify and remove standard invalid values; Based on the effective upper and lower limit table, identify and remove unreasonable out-of-limit values; An algorithm based on differential binarization is used to identify and remove unreasonable consecutive duplicate values. Before the three-stage cleaning, data from an engine-off state should be excluded.
4. The rapid preprocessing method for OBD remote emission monitoring data according to claim 1, characterized in that, It also includes specialized data processing steps for specific research needs, including: Reshaping the nonnegativity of NOx concentration downstream of SCR; The problem of GPS positioning jumps and drifts is corrected by using OBD vehicle speed comparison based on wheel rotation calculation.
5. A rapid preprocessing method for OBD remote emission monitoring data according to claim 1, characterized in that, include: The storage format conversion module is used to convert raw row-based data files into column-based storage formats. The timeline sorting module uses an adaptive bucket sorting algorithm to sort the converted data on the timeline. It removes retransmitted frames based on whether the timestamps of adjacent frames are the same, and performs stroke segmentation based on large timestamp jumps and low SCR outlet temperatures. The abnormal data cleaning module is used to clean outliers from the processed data. The data evaluation module is used to evaluate the mileage coverage and effective data ratio of the cleaned data, and to comprehensively evaluate the representativeness of the data. The specialized processing module is used to perform specialized processing on data for specific research needs.