A storage method and system based on automobile insurance data
By using entropy weight mapping and input-response binary graph technology, the problem of resource allocation mismatch in static fragmented storage technology is solved, realizing efficient and adaptive storage of car insurance data, improving the accuracy of data stream feature recognition and storage resource utilization.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHENZHEN YISHENGHENG TECH CO LTD
- Filing Date
- 2026-04-21
- Publication Date
- 2026-06-19
AI Technical Summary
Existing static sharding storage technology lacks the ability to deeply perceive the spatiotemporal characteristics, fluctuation complexity, and data storage value inherent in the data stream when processing auto insurance data. This leads to resource misallocation, making it impossible to efficiently persist high-value complex data or occupying valuable resources to store low-value redundant data.
By using entropy weight mapping based on operating conditions, heterogeneous time-series data streams are obtained, an input-response binary spectrum is constructed, wavelet packet decomposition and cross-correlation operations are performed, frequency domain complexity and time-series redundancy features are extracted, storage entropy weight coefficients are calculated using a data mapping model, a three-dimensional storage strategy space is constructed, and adaptive sharding and disk placement are achieved.
It accurately identifies the temporal lag relationships between data, quantifies the sparsity and encoding difficulty of data, adaptively selects the optimal storage path, optimizes the performance and resource utilization of the database cluster, and balances data writing pressure and storage resources.
Smart Images

Figure CN122240726A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data storage technology, specifically to a method and system for sharding and partitioning automotive insurance data. Background Technology
[0002] With the rapid development of vehicle-to-everything (V2X) technology and usage-based insurance business, the data processing model of the modern auto insurance industry is undergoing profound changes. During the driving process, vehicles generate massive heterogeneous time-series data streams at high frequency through CAN bus and various on-board sensors. These data cover multiple dimensions of vehicle operating status information and are the core assets for building user driving behavior profiles.
[0003] To cope with the storage pressure of massive amounts of data, existing distributed database systems typically employ database sharding and table partitioning techniques to achieve horizontal scaling. Common database sharding and table partitioning strategies in current technologies mainly rely on static sharding keys, such as hash modulo sharding based on vehicle unique identifiers, or range sharding based on the timestamps generated by the data. This static rule-based storage method is highly efficient when processing structured, evenly distributed business data, effectively distributing storage pressure and improving system throughput.
[0004] However, in high-concurrency write scenarios involving massive amounts of vehicle-to-everything (V2X) data, the heterogeneous time-series data streams generated by vehicles exhibit strong time-varying and non-stationary characteristics. Under different driving conditions (e.g., smooth cruise versus emergency avoidance), the information density, signal fluctuation frequency, and correlation between various sensor data often differ significantly. Furthermore, the load status of the storage cluster itself (e.g., IOPS utilization) and the frequency of I / O operations on different data from the business side are also dynamically changing. Currently, mainstream static sharding storage technologies typically treat data streams as homogeneous binary objects, lacking a deep understanding of the spatiotemporal characteristics, fluctuation complexity, and data storage value (hotness / coldness) inherent in the data stream. This can lead to technical problems such as resource allocation mismatches in practical applications.
[0005] Current mainstream static sharding storage technologies typically treat data streams as homogeneous binary objects, lacking a deep understanding of the spatiotemporal characteristics, fluctuation complexity, and actual business value inherent in the data stream. This can lead to resource allocation mismatch issues in practical applications: on the one hand, a large amount of high-value, complex data containing rich driving details may fall into low-performance storage partitions and cannot be efficiently persisted or face the risk of lossy compression; on the other hand, a large amount of low-value, redundant, and stable data may occupy valuable high-performance storage resources and bandwidth. How to achieve more refined and adaptive dynamic allocation of storage resources based on the inherent characteristics of the data stream and the real-time status of the system, while ensuring data integrity and system stability, is a technical problem that urgently needs to be solved in the field of large-scale automotive insurance data storage.
[0006] To address this, a method and system for sharding and partitioning automotive insurance data is proposed. Summary of the Invention
[0007] The purpose of this invention is to provide a sharded and partitioned storage method and system for automotive insurance data, which achieves adaptive allocation of storage resources through operational condition entropy weight mapping. This includes acquiring heterogeneous time-series data streams containing business keys and operational condition loads, synchronously collecting storage cluster load and historical read / access frequencies; constructing an input-response binary graph based on data generation causal logic, performing wavelet packet decomposition and cross-correlation operations on nodes such as suspension travel and braking pressure, and extracting frequency domain complexity and time-series redundancy features; calculating storage entropy weight coefficients characterizing the incompressibility of data using a data mapping model; constructing a three-dimensional strategy space containing data value, system load, and query popularity, and adaptively activating the LSM-Tree lossless write or wavelet compression archiving engine based on the coordinate region.
[0008] To achieve the above objectives, the present invention provides the following technical solution: 1. A method for sharding and partitioning storage of automobile insurance data, characterized in that it includes: acquiring the heterogeneous data stream to be stored, including a business key consisting of a vehicle unique identifier, a policy index, and a timestamp, and working condition load data consisting of steering wheel angle, brake master cylinder pressure, three-axis acceleration, and suspension compression stroke, and simultaneously collecting the storage cluster IOPS load rate and historical read access frequency.
[0009] By using business keys to lock vehicle entities and align data timing, an input-response binary graph is constructed. Wavelet packet decomposition is performed on suspension travel nodes to extract frequency band energy distribution and calculate complex signal features. The timing redundancy features between steering wheel angle and acceleration nodes are calculated using cross-correlation functions, and the signal correlation features between braking pressure and acceleration are calculated.
[0010] The complex features, temporal redundancy features, and signal correlation features of the signal are input into the data mapping model, and the storage entropy weight coefficient is output.
[0011] The storage entropy weight coefficient, IOPS load rate, and access frequency are defined as the data value dimension, system load dimension, and access heat dimension, respectively. A three-dimensional storage strategy mapping space is constructed and coordinate points are determined. Based on the location of the coordinate points in the preset partition routing table, adaptive sharding and disk placement are performed on the heterogeneous data stream containing business keys.
[0012] Preferably, the vehicle gateway loads a pre-set DBC communication protocol file to perform bit-field parsing on the CAN bus broadcast messages, extracting physical values of steering wheel angle, brake master cylinder pressure, three-axis acceleration, and suspension compression stroke; simultaneously, the vehicle identification message is parsed to obtain the vehicle's unique identifier, the policy index is associated according to a pre-set mapping relationship, and a timestamp is added to each sampling point to generate a business key; using the timestamp of the data item with the highest sampling frequency as a benchmark, linear interpolation is performed on other low-frequency data items, and the business key and the interpolated operating load data are encapsulated to generate a time-aligned heterogeneous data frame; the number of disk read / write operations per second is read by a monitoring agent program deployed on the storage node, and the instantaneous traffic spikes are smoothed using an exponentially weighted moving average algorithm to calculate the IOPS load rate; the metadata monitoring service of the storage system is accessed, and the underlying read I / O request records for the data object within a set sliding time window are retrieved based on the vehicle's unique identifier, and the cumulative number of read I / O requests is counted as the historical read access frequency.
[0013] Preferably, the process of constructing the input-response binary graph includes: performing entity isolation on heterogeneous data streams based on the vehicle's unique identifier in the business key; and performing node encapsulation only within the data domain of the same vehicle entity according to the order of data generation: encapsulating the original values of the steering wheel angle and brake master cylinder pressure into front-wheel drive active node data objects, encapsulating the three-axis acceleration and four-wheel suspension shock absorber compression stroke into subsequent passive node data objects, and serializing them according to the timestamps in the business key and storing them in the node storage area of the in-memory graph database; and initializing the timing relationship in memory. A sliding window is used, with the window length set to cover a preset temporal correlation threshold. The preceding active node data objects in the real-time data stream are used as temporal index anchors, and the subsequent active node objects within the window whose timestamps lag behind the anchors are used as the target set to be matched. A temporal proximity retrieval algorithm is executed, traversing and calculating the time difference between each object in the target set and the temporal index anchor. When the time difference is less than the preset temporal correlation threshold, a directed topological connection edge is established, and the time difference is written as a temporal bias weight into the graph metadata to generate an input-response binary graph.
[0014] Preferably, the calculation process of the signal complexity feature, temporal redundancy feature, and signal correlation strength feature includes: selecting the Daubechies wavelet as the basis function, performing multi-level wavelet packet decomposition on the time-series data stream of the suspension shock absorber compression stroke, and obtaining the reconstruction coefficients of each frequency band node in the last layer; calculating the sum of squares of the reconstruction coefficients of each frequency band node to obtain the energy value of the frequency band, and normalizing the energy values of all frequency bands into an energy probability distribution sequence; using the Shannon entropy formula to calculate the negative logarithmic weighted sum of the probability distribution sequence to obtain the power spectral entropy, which is used as the signal complexity feature; and constructing a sliding cross-correlation operator for the discrete time series. The steering wheel angle data sequence is set as the reference input sequence, and the triaxial acceleration data sequence is set as the response input sequence. Within a preset time sliding window, the cross-correlation coefficient sequence of the two is calculated, and the time displacement corresponding to the maximum value of the absolute value of the cross-correlation coefficient is identified. The time displacement is determined as the response lag time and used as a temporal redundancy feature. Normalized statistical correlation analysis is performed on the original value sequence of brake master cylinder pressure and the longitudinal acceleration component sequence in the triaxial acceleration within the same time window. The Pearson product moment correlation coefficient between the two sequences is calculated, and the correlation coefficient is defined as the braking response linearity and used as a signal correlation feature.
[0015] Preferably, the data mapping model includes: a feature space projection layer configured with three parallel multilayer perceptron network branches, each branch containing an input layer, a hidden layer, and an output layer; the input layer receives complex signal features, temporal redundancy features, and signal correlation features as scalar inputs respectively; the hidden layer uses a hyperbolic tangent activation function for nonlinear transformation; and the output layer maps the transformed data to a high-dimensional Hilbert space, generating frequency domain feature vectors, temporal feature vectors, and correlation feature vectors respectively.
[0016] The tensor product correlation layer calculates the outer product of the frequency domain feature vector and the time-series feature vector to generate a two-dimensional feature matrix. It then calculates the outer product of the two-dimensional feature matrix and the correlated feature vector to construct a three-dimensional interactive feature tensor.
[0017] The manifold metric output layer collects three-dimensional interactive feature tensors from multiple consecutive time steps in a preset time buffer queue, constructs a tensor sequence sample set, calculates the second-order statistical covariance matrix of the sample set in the time dimension, maps the covariance matrix to a symmetric positive definite matrix manifold, calculates the Riemann metric distance relative to the origin of the identity matrix in the Riemann geometric space of the covariance matrix, and maps the Riemann metric distance to a storage entropy weight coefficient with a numerical range between 0 and 1 through an sigmoid nonlinear activation function.
[0018] Preferably, the process of constructing a three-dimensional storage strategy mapping space and determining coordinate points includes: obtaining the physical limit IOPS threshold of the storage cluster, calculating the ratio of the currently collected IOPS load rate to the physical limit IOPS threshold, and generating a system load saturation value between 0 and 1; calculating the maximum value of the global historical data access frequency of all data objects in the storage cluster, performing normalization processing on the historical data access frequency of the current data object and the maximum value, and generating an access heat quantile value between 0 and 1; and calling the storage entropy weight coefficient to instantiate a [database name missing] in memory. In a unit Euclidean cube space with a side length of 1, mutually orthogonal X-axis, Y-axis, and Z-axis are established. The storage entropy weight coefficient is mapped to the X-axis and defined as the data value coordinate; the system load saturation is mapped to the Y-axis and defined as the system load coordinate; and the access popularity quantile is mapped to the Z-axis and defined as the access popularity coordinate. Based on the data value coordinate, system load coordinate, and access popularity coordinate, these three coordinate values are directly used as the scalar components of the spatial basis vector to determine a unique spatial mapping point within the unit Euclidean cube space. The coordinates of the spatial mapping point are then used as three-dimensional coordinate points.
[0019] Preferably, the process of performing adaptive sharding and disk persistence includes: reading the high-order threshold set and low-order threshold set preset in the dynamic partition routing table, comparing the values of each dimension of the three-dimensional coordinate point with the threshold set; if all dimensions are higher than the high-order threshold, it is determined that the data falls into the core reserved area; if all dimensions are lower than the low-order threshold, it is determined that the data falls into the edge archive area; if none of the above conditions are met, the data falls into the standard elastic subspace and is routed to the general transaction processing cluster; for the data in the core reserved area, the log structure merging tree storage engine is called to append the data stream to the variable buffer and trigger a lossless columnar sequence. The data is persisted as an immutable sorted string table file using run-length encoding. For standard elastic subspace data, a general relational database transaction engine based on a B+ tree structure is invoked to activate the row-level lock manager to ensure strong transactional consistency of the data. Moderate lossless compression encoding is performed on the data page content, and an in-place update operation is performed to write it to disk. For edge archive data, a wavelet compression engine is invoked to perform multi-level discretization on the load data in the heterogeneous data stream, separating low-frequency approximate components and high-frequency detail components. The low-frequency approximate components are retained, and the high-frequency detail components are truncated to zero, storing only the processed sparse data.
[0020] A sharded storage system based on automobile insurance data includes: acquiring heterogeneous data streams to be stored, including business keys consisting of vehicle unique identifier, policy index and timestamp, and operating load data consisting of steering wheel angle, brake master cylinder pressure, three-axis acceleration and suspension compression stroke, and synchronously collecting storage cluster IOPS load rate and historical read access frequency.
[0021] By using business keys to lock vehicle entities and align data timing, an input-response binary graph is constructed. Wavelet packet decomposition is performed on suspension travel nodes to extract frequency band energy distribution and calculate complex signal features. The timing redundancy features between steering wheel angle and acceleration nodes are calculated using cross-correlation functions, and the signal correlation features between braking pressure and acceleration are calculated.
[0022] The complex features, temporal redundancy features, and signal correlation features of the signal are input into the data mapping model, and the storage entropy weight coefficient is output.
[0023] The storage entropy weight coefficient, IOPS load rate, and access frequency are defined as the data value dimension, system load dimension, and access heat dimension, respectively. A three-dimensional storage strategy mapping space is constructed and coordinate points are determined. Based on the location of the coordinate points in the preset partition routing table, adaptive sharding and disk placement are performed on the heterogeneous data stream containing business keys.
[0024] Compared with the prior art, the beneficial effects of the present invention are as follows: 1. By constructing an input-response binary data dependency graph based on data generation causal logic, it can effectively capture the inherent causal dependency of data streams in the generation process, avoiding the one-sided analysis of isolated data points by traditional methods. By using the time-series correlation sliding window and index anchor mechanism, it can accurately identify the time-series lag relationship between data, transforming the mechanical lag at the physical level into the topological weight at the data level, providing a rigorous data structure foundation for subsequent in-depth evaluation of the overall working condition complexity and correlation tightness of data streams, thereby improving the accuracy of feature identification of heterogeneous data streams under complex working conditions.
[0025] 2. By using a pre-set data flow dynamics feature mapping model and weighted Euclidean norm operation, the difference between different physical dimensions is eliminated by the standardization layer. The specific physical behavior of the vehicle is abstracted into general data coding characteristics (such as frequency domain uncertainty and timing asynchrony rate). This mechanism effectively establishes the mapping relationship between physical operating conditions and data storage value, enabling the system to objectively quantify the sparsity and coding difficulty of data from the perspective of information theory, and providing a quantifiable mathematical basis for distinguishing between high-value complex data and low-value redundant data.
[0026] 3. By constructing a three-dimensional storage strategy mapping space consisting of data value, system load, and access frequency dimensions, the system can adaptively select the optimal storage path based on the position of data flow coordinates in the dynamic partition routing table: for high-entropy data falling into the core retention area, a lossless columnar storage engine is activated to ensure data integrity and evidentiary validity; for low-entropy data falling into the edge archive area, a lossy wavelet compression engine is activated for downsampling processing to reduce storage costs. This multi-dimensional sharding mechanism effectively balances the contradiction between massive data writing pressure and limited storage resources, optimizing the overall performance and resource utilization of the database cluster. Attached Figure Description
[0027] Figure 1 This is a schematic diagram of a method for storing car insurance data in a sharded database and sharded tables according to the present invention.
[0028] Figure 2 This is a schematic diagram of the process for generating input-output response binary graphs according to the present invention.
[0029] Figure 3 This is a schematic diagram of a database sharding and table sharding storage system based on automobile insurance data according to the present invention. Detailed Implementation
[0030] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0031] Please see Figures 1 to 3 This invention provides a method and system for sharding and partitioning automotive insurance data, the technical solution of which is as follows: Example 1
[0032] A method for sharding and partitioning storage based on automobile insurance data includes: acquiring heterogeneous time-series data streams as objects to be stored, including steering wheel angle, original value of brake master cylinder pressure, three-axis acceleration and compression stroke of four-wheel suspension shock absorbers, and synchronously collecting storage cluster IOPS load rate and historical read access frequency.
[0033] Based on the data generation logic, an input-response binary spectrum is constructed. Wavelet packet decomposition is performed on the suspension travel node to extract the frequency band energy distribution and calculate the power spectral entropy as the signal complexity feature. The response lag time between the steering wheel angle and acceleration node is calculated using the cross-correlation function as the temporal redundancy feature. The longitudinal transmission efficiency between braking pressure and acceleration is calculated as the signal correlation strength feature.
[0034] The storage entropy weight coefficient, IOPS load rate, and access frequency are defined as the data value dimension, system load dimension, and access popularity dimension, respectively. A three-dimensional storage strategy mapping space is constructed and coordinate points are determined. Adaptive sharding and disk persistence are performed based on the location of the coordinate points in the preset partition routing table: if it falls into the core reserved area, the lossless columnar storage engine is activated for full persistence; if it falls into the edge archiving area, the lossy wavelet compression engine is activated for downsampling archiving.
[0035] Furthermore, the heterogeneous data stream acquisition process includes: loading a pre-set DBC communication protocol file using the vehicle gateway; parsing the bit field of the CAN bus broadcast message to extract the physical values of steering wheel angle, brake master cylinder pressure, three-axis acceleration, and suspension compression stroke; synchronously parsing the vehicle identification message to obtain the vehicle's unique identifier; associating the policy index according to a pre-set mapping relationship; and adding a timestamp to each sampling point to generate a business key; using the timestamp of the data item with the highest sampling frequency as a benchmark, performing linear interpolation on other low-frequency data items; and encapsulating the business key and the interpolated operating load data to generate a time-aligned heterogeneous data frame; reading the number of disk read / write operations per second using a monitoring agent program deployed on the storage node; smoothing instantaneous traffic spikes using an exponentially weighted moving average algorithm to calculate the IOPS load rate; accessing the storage system's metadata monitoring service; retrieving the underlying read I / O request records for the data object within a set sliding time window based on the vehicle's unique identifier; and counting the cumulative number of read I / O requests as the historical read access frequency.
[0036] As the data access center at the edge, the vehicle gateway preloads a DBC database file describing the vehicle's CAN network communication matrix. This file defines in detail the mapping rules between each broadcast message frame ID on the CAN bus and the specific physical signal, including the start bit, bit length, and byte order of the signal in the data frame, as well as the precision coefficient and offset required to convert binary values into physical values. When the gateway listens to the data frame on the CAN bus in real time, it matches the corresponding decoding rules according to the frame ID and obtains the original hexadecimal value through bit field extraction. Subsequently, the gateway performs multiplication on the original value using the precision coefficients defined in the DBC file and adds it back using the offset, thereby restoring the binary code stream to vehicle operating parameters with actual physical meaning, including steering wheel angle, electronic throttle opening feedback value, three-axis acceleration, four-wheel suspension shock absorber compression stroke, and original brake master cylinder pressure value. At the same time, the gateway extracts the vehicle's unique identifier by parsing vehicle identification messages (such as OBD response frames or specific broadcast frames), and associates the currently effective policy index with the preset vehicle-policy mapping table. Combined with the high-precision timestamp generated by the gateway's time synchronization module, the VIN, policy index, and timestamp are combined to generate a unique business key that identifies the current data frame's ownership.
[0037] Specifically, the pre-set mapping relationship is a key-value pair mapping table between the vehicle's unique identifier and the policy index, maintained by the system in a distributed cache cluster. This mapping table is updated asynchronously in real time by the core insurance business system through a message queue when generating new policies and endorsements, ensuring that the sharding system can always obtain the latest binding relationship between vehicles and policies, thereby guaranteeing the accuracy of business key construction.
[0038] Due to the inherent heterogeneity of sampling frequencies among sensors in different vehicle subsystems—for example, a triaxial accelerometer responsible for vehicle attitude monitoring typically uses high-frequency sampling to capture transient vibrations, while a steering wheel angle sensor responsible for driving intentions may use a relatively low sampling frequency—different data streams cannot be directly aligned on the time axis. To address this issue, the timestamp sequence of the data item with the highest sampling frequency (triaxial acceleration in this embodiment) is selected as the reference master clock. For other data items with lower sampling frequencies (such as steering wheel angle), the system detects whether there are missing values at the reference master clock's time point. For each missing data point, the system locates its two adjacent real sampling points and performs linear interpolation based on the time distance between the missing time point and these two real sampling points. This operation estimates the simulated value at the missing time by calculating the proportion of the time difference and assigning different weights to the two real values. Finally, a standardized data frame sequence with all fields strictly aligned on the time dimension is generated. The operating load data and the previously generated business key are structurally encapsulated to generate a heterogeneous data frame sequence containing complete business context information, eliminating timing errors and establishing entity boundaries for subsequent causal graph construction.
[0039] To accurately assess the real-time pressure on the storage cluster and eliminate interference from transient read / write spikes, a monitoring agent deployed on distributed storage nodes periodically reads the operating system's disk statistics file to obtain the number of read / write operations per second within the current time window. After obtaining the instantaneous IOPS value, it is not directly used as a load metric; instead, an exponentially weighted moving average algorithm is used for smoothing. The algorithm's logic is that the current effective load rate depends not only on the current instantaneous sample value but also on the historical load level retained from the previous moment. A smaller weight coefficient is assigned to the current instantaneous sample value, while a larger weight coefficient is assigned to the smoothed value from the previous moment. Through this weighted iterative approach, numerical spikes caused by individual sudden queries can be effectively smoothed out, thus obtaining an effective IOPS load rate value that reflects the steady-state pressure of the storage cluster, ensuring the stability of sharding decisions.
[0040] To quantify the access frequency of vehicle data at the business level, the metadata monitoring service of the access storage system is used. This service records all I / O operation metadata at the file system or object storage gateway level in real time. Using the vehicle's unique identifier in the current business key as the object index, the underlying read request count within the most recent sliding time window (e.g., the past six months) is retrieved. This statistical result is the historical data access frequency of the vehicle data. According to the principle of locality in computer storage, the higher the frequency, the more likely the data block is hot data in the system and will be frequently read again in the future. Therefore, it should be given higher I / O priority in the tiered storage strategy.
[0041] By employing DBC-based underlying signal analysis and multi-frequency linear interpolation alignment technology, the problem of time axis misalignment caused by heterogeneous sampling frequencies of vehicle sensors is effectively solved. This provides a standardized time-series data foundation for the subsequent construction of high-precision causal dependency graphs. At the same time, the exponentially weighted moving average algorithm is used to smooth the storage cluster load, and combined with access frequency statistics based on log auditing, not only is the interference of instantaneous read / write spikes on sharding decisions effectively eliminated, but also the perception of full-dimensional data features from underlying physical conditions to upper-layer business activity is achieved. This significantly improves the preprocessing quality and feature confidence of heterogeneous data streams before they enter the database sharding and table partitioning strategy.
[0042] Furthermore, the process of constructing the input-response binary graph includes: performing entity isolation on heterogeneous data streams based on the vehicle's unique identifier in the business key; and performing node encapsulation only within the data domain of the same vehicle entity, according to the order of data generation: encapsulating the original values of the steering wheel angle and brake master cylinder pressure into front-wheel drive active node data objects, and encapsulating the three-axis acceleration and four-wheel suspension shock absorber compression stroke into subsequent passive node data objects, and serializing them according to the timestamps in the business key and storing them in the node storage area of the in-memory graph database; and initializing the timing relationship in memory. A sliding window is used, with the window length set to cover a preset temporal correlation threshold. The preceding active node data objects in the real-time data stream are used as temporal index anchors, and the subsequent active node objects within the window whose timestamps lag behind the anchors are used as the target set to be matched. A temporal proximity retrieval algorithm is executed, traversing and calculating the time difference between each object in the target set and the temporal index anchor. When the time difference is less than the preset temporal correlation threshold, a directed topological connection edge is established, and the time difference is written as a temporal bias weight into the graph metadata to generate an input-response binary graph.
[0043] The system receives and processes a sequence of heterogeneous data frames, reads the service key contained in each data frame, and uses the vehicle unique identifier (VIN) in the service key as a hash bucket key to logically isolate the massive mixed data stream into several independent vehicle entity channels. This ensures that subsequent causal analysis is performed only within the physical context of the same vehicle. Within a specific vehicle entity channel, the system establishes the master-slave relationship of the data stream based on the signal transmission chain of the vehicle chassis dynamics: the original values of the steering wheel angle and brake master cylinder pressure, representing the driver's direct control intention, are defined as the excitation source of the system and encapsulated as front-drive active node data objects; correspondingly, the three-axis acceleration and four-wheel suspension shock absorber compression stroke, representing the vehicle's physical feedback, are defined as the response end of the system and encapsulated as subsequent passive node data objects. A globally unique memory address index is allocated to each encapsulated data object, and based on its high-precision timestamp, it is serialized and stored in the node storage area of the memory graph database, thereby completing the initial mapping from the original physical signals to the graph node entities.
[0044] To capture causal dependencies in dynamic data streams, an independent temporal correlation sliding window is created for each vehicle entity channel. Considering the inherent physical lag in the mechanical system from receiving a command to generating a physical action, the length of this sliding window is designed to cover a preset signal response temporal correlation threshold. In this embodiment, referencing automotive braking system industry standards, this threshold is rigorously set to 200 milliseconds. The logic behind this threshold setting is based on the mechanical response cycle of the vehicle's actuators. Specifically, the system pre-calculates the longest physical delay from brake pedal engagement to longitudinal acceleration under standard load and adds a 10% redundancy as an upper limit, thereby ensuring effective filtering of spurious correlation signals caused by asynchronous sensor acquisition cycles. This value covers the typical engineering delay range from brake pedal engagement to brake master cylinder pressure build-up (approximately 50 milliseconds) and from caliper clamping to longitudinal deceleration of the vehicle body (approximately 100 milliseconds). Any response that lags behind this duration usually originates from road surface excitation or other disturbances and no longer has a direct causal relationship for control. The real-time incoming data objects of the preceding active nodes and the data objects of the subsequent passive nodes are injected into the sliding window in ascending order of timestamps to ensure that the data frames within the window retain complete temporal context information.
[0045] Whenever a new preceding active node data object (in this embodiment, for example, brake master cylinder pressure data at a certain time T) enters the sliding window, it is marked as the current time-series index anchor point. The system then starts the retrieval algorithm, and within the window's cache range, filters out all subsequent driven node data objects (e.g., triaxial acceleration data) whose timestamps are strictly lagging behind the anchor point (i.e., time greater than T). These objects form a target set to be matched, and the system iterates through each target object in the set, calculating the difference between its timestamp and the timestamp of the current time-series index anchor point.
[0046] The calculated time difference is logically compared with a preset temporal correlation threshold (200 milliseconds in this embodiment). Simultaneously, to eliminate spurious causal relationships, a physical direction consistency check is performed: for the brake master cylinder pressure node (input), it is checked whether its corresponding longitudinal acceleration node (response) exhibits a negative value (i.e., deceleration); for the steering wheel angle node (input), it is checked whether the sign direction of its corresponding lateral acceleration node (response) conforms to the vehicle dynamics steering model. Only when the time difference is less than the threshold and the physical direction satisfies the consistency constraint is it determined that the two data nodes satisfy the physical causal constraint. At this point, in the in-memory graph database, a unidirectional directed topological connection edge is instantiated between the brake pressure node (cause) and the acceleration node (effect). More importantly, this specific time difference is defined as a temporal bias weight and written into the graph metadata attribute field of this connection edge. Through this process, the transient physical response characteristics of the vehicle are solidified into the edge weight attribute in the graph database, thus completing the construction of the input-response binary metadata dependency graph.
[0047] In this embodiment, the memory graph database maintains a lightweight time-series topology data structure in the server memory heap. The specific construction method is as follows: Node storage area: implemented using a hash mapping table structure, with the globally unique ID of the data object as the key and the serialized binary data object as the value. This structure ensures that the insertion and query operations of nodes have a time complexity of O(1) in the scenario of massive high-frequency writing.
[0048] Time-series index chain: In order to support fast scanning of sliding windows, a bidirectional skip list is maintained for node objects of the same vehicle entity, which is strictly sorted by UTC timestamp. This allows the system to quickly locate the predecessor and successor nodes within a specific time range without having to scan the entire table.
[0049] Edge storage structure: Adjacency list is used. Each predecessor active node object maintains an outgoing edge list, which stores memory pointers to successor active nodes and the calculated temporal bias weights.
[0050] In a preferred embodiment, the process of constructing the input-response binary graph further includes a graph topology dynamic pruning step: A temporal deviation variance pool is established for each instantiated directed topological connection edge in the in-memory graph database, and the statistical variance value of the response lag time calculated in the most recent N times for that connection edge is recorded in real time, where N is a preset positive integer; the input-response binary graph is periodically scanned, and the statistical variance value is compared with a preset causal stability threshold; when the statistical variance value is greater than the causal stability threshold, it is determined that there is physical transitive decoupling between the node objects at both ends of the connection edge, a topology pruning operation is performed, the directed topological connection edge is physically deleted from the graph database, and the predecessor active node data object associated with the connection edge is marked as an outlier; in response to the outlier marking, the corresponding heterogeneous temporal data stream is routed to an abnormal data isolation area, and separate log audit storage is performed.
[0051] Specifically, a temporal deviation variance pool is established for each instantiated directed topological connection edge in the memory graph database. The statistical variance value of the response lag time calculated for the connection edge in the most recent N times (N is set to 50 in this embodiment) is recorded in real time. The input-response binary graph is scanned periodically (e.g., every 10 seconds). The statistical variance value of each edge is compared with a preset causal stability threshold. When the statistical variance value of a connection edge is detected to be strictly greater than the causal stability threshold, it is determined that there is physical transmission decoupling between the node objects at both ends of the connection edge (e.g., data jump caused by sensor loosening). At this time, a topology pruning operation is performed to physically delete the directed topological connection edge from the graph database. The predecessor active node data object associated with the connection edge is marked as an isolated point. In response to the isolated point marking, the subsequent feature extraction of the node is no longer performed. Instead, the corresponding heterogeneous temporal data stream is directly routed to the abnormal data isolation area and stored separately for manual verification or subsequent offline model training and repair. By performing dynamic pruning of the graph topology, spurious causal connections caused by sensor drift or mechanical aging can be eliminated in real time. The cleaning mechanism based on time-series variance effectively prevents noisy data from interfering with subsequent feature extraction, ensuring the accuracy of the storage entropy weight coefficient calculation, thereby improving the robustness of the storage strategy in long-term operating environments.
[0052] By constructing an input-response binary metadata dependency graph, heterogeneous data is transformed from discrete temporal streams into a topological structure with causal logic. Utilizing a temporal correlation sliding window and index anchor mechanism, dynamic hysteresis features between signals can be accurately captured through timestamp comparison at the data level without constructing complex vehicle dynamics equations. The calculated time difference is solidified into temporal bias weights in edge attributes. This not only achieves an effective mapping from physical condition features to data structure features but also provides rigorous graph metadata support for subsequent quantification of the temporal redundancy and incompressibility of the data stream through cross-correlation analysis, thereby improving the accuracy of feature extraction.
[0053] Furthermore, key frames are extracted from the heterogeneous data stream at a preset sampling frequency, or the heterogeneous data stream is divided into batches based on a preset data block size. Frequency band energy distribution is extracted from the sampled frames or data blocks, and signal complexity features are calculated. The calculation process for the signal complexity features, temporal redundancy features, and signal correlation strength features includes: selecting the Daubechies wavelet as the basis function, performing multi-level wavelet packet decomposition on the temporal data stream of the suspension shock absorber compression stroke, and obtaining the reconstruction coefficients of each frequency band node in the last layer; calculating the sum of squares of the reconstruction coefficients of each frequency band node to obtain the energy value of the frequency band, and normalizing the energy values of all frequency bands into an energy probability distribution sequence; and using the Shannon entropy formula to calculate the negative logarithmic weighted sum of the probability distribution sequence. The power spectral entropy is obtained and used as a signal complexity feature. A sliding cross-correlation operator for discrete time series is constructed, with the steering wheel angle data sequence as the reference input sequence and the triaxial acceleration data sequence as the response input sequence. The cross-correlation coefficient sequence of the two is calculated within a preset time sliding window, and the time displacement corresponding to the maximum value of the absolute value of the cross-correlation coefficient is identified. The time displacement is determined as the response lag time and used as a temporal redundancy feature. Normalized statistical correlation analysis is performed on the original value sequence of brake master cylinder pressure and the longitudinal acceleration component sequence in triaxial acceleration within the same time window. The Pearson product-moment correlation coefficient between the two sequences is calculated and defined as the braking response linearity as a signal correlation feature.
[0054] Considering the high-frequency concurrent write characteristics of heterogeneous data streams and the computational overhead of subsequent tensor operations, the system does not perform feature extraction individually for each frame of instantaneous data. Instead, it employs a windowed batch processing mechanism. Specifically, before performing feature calculations, a preset sampling frequency (e.g., 10Hz) or a preset data block size (e.g., 4KB or 100 consecutive sampling points) is set. Using a sliding window buffer, the incoming heterogeneous data stream is segmented or downsampled to extract key frame sequences or construct standardized data blocks. Subsequently, the calculations of the aforementioned signal complexity features, temporal redundancy features, and signal correlation strength features are all performed on the sampled frame sequence or data block as a whole unit. In other words, for all the original microscopic data within the same time window or data block, a unique set of frequency band energy distribution and storage entropy weight coefficients are uniformly calculated and output. This batch processing strategy significantly reduces the frequency of calls to higher-order algorithms such as Riemannian manifold metrics while preserving the macroscopic operating characteristics of the data stream (such as the overall driving intensity), thus achieving an optimal balance between feature extraction accuracy and system write throughput.
[0055] The digital signal processing module is invoked to perform multi-level frequency domain analysis on the time-series data stream of the suspension shock absorber compression stroke. In this embodiment, the Daubechies wavelet is selected as the basis function, utilizing its good orthogonality and tight support characteristics to capture transient changes in mechanical vibration. A three-level wavelet packet decomposition is performed on the original data stream, recursively dividing the signal spectrum into eight independent sub-band nodes. For each sub-band node, its reconstruction coefficient sequence is extracted, and the sum of squares of all coefficients in the sequence is calculated. This result is defined as the energy value of that frequency band. Subsequently, the system normalizes the energy values of all sub-bands, calculates the proportion of each frequency band's energy to the total energy, thereby generating an energy probability distribution sequence. Finally, based on Shannon's information theory principle, the negative logarithmic weighted sum of this probability distribution sequence is calculated to obtain the power spectral entropy. This entropy value is directly defined as the signal complexity feature; the higher the value, the more complex and disordered the frequency components of the suspension vibration, and the greater the difficulty of data encoding and compression.
[0056] To evaluate the synchronicity of the driver input signal and the vehicle body response signal on the time axis, a discrete-time series sliding cross-correlation calculator was constructed. The driver's steering wheel angle data sequence was set as the reference input signal, and the vehicle body's three-axis acceleration data sequence was set as the response input signal. Within a preset time sliding window, the calculator calculates the cross-correlation coefficient of the two signals at different time displacements by gradually shifting the time alignment points of the two signals. By iterating through the calculation results, the calculator identifies the moment when the absolute value of the cross-correlation coefficient reaches its maximum value, and the time displacement corresponding to that moment is determined as the response lag time. This time value is used as a temporal redundancy feature. The shorter the lag time, the closer the phase of the input and output waveforms are, the higher the temporal redundancy between the data, and the higher the efficiency of compression using techniques such as differential coding.
[0057] Within a preset time sliding window, the cross-correlation coefficient sequence of the two is calculated, and the absolute value of the cross-correlation coefficient reaches its maximum value (i.e., peak correlation coefficient) and its corresponding time displacement is identified. At this time, in order to correct the limitation that cross-correlation analysis is only applicable to linear systems, a linearity confidence judgment is performed: the peak correlation coefficient is compared with a preset linearity threshold (e.g., 0.6). The logic for setting this threshold is designed to distinguish between linear stable conditions and nonlinear abnormal conditions. When the correlation coefficient is below 0.6, the vehicle is considered to be in a nonlinear range, such as a complex road surface or emergency avoidance. In this case, the data has higher unpredictability and storage value. This threshold is selected based on the statistical distribution characteristics of the sensor data at the instability critical point in the vehicle dynamics simulation model. If the peak correlation coefficient is greater than the threshold, it indicates that the vehicle dynamic response is in a linear range (such as smooth steering or braking). In this case, the time displacement is confirmed as an effective response lag time and used as a temporal redundancy feature. If the peak correlation coefficient is less than or equal to the threshold, it indicates that the vehicle is in a highly nonlinear condition (such as triggering the anti-lock braking system or vehicle instability). In this case, the linear time delay estimation is deemed to be invalid, and the temporal redundancy feature is directly set to zero to characterize the extremely low temporal redundancy and high incompressibility of the data in this state.
[0058] In this embodiment, the sliding cross-correlation unit for discrete time series is not a single physical hardware circuit, but a logic processing module running in the memory of the vehicle gateway or edge computing node. Its specific construction and operation logic are as follows: the unit allocates two fixed-length first-in-first-out circular queues in memory, used to cache the reference input sequence (i.e., steering wheel angle data) and the response input sequence (i.e., triaxial acceleration data), respectively. The length of these two queues is set to cover a complete analysis time window (e.g., 100 sampling points) to ensure that the system can always backtrack to complete waveform data within the most recent time period. As new data is written in real time, the oldest data is automatically overwritten, thus maintaining the dynamic updating of the window. The arithmetic unit keeps the reference input sequence (steering angle) stationary, defining it as the baseline template. Subsequently, within a preset time delay search range (e.g., shifting forward and backward by 20 sampling points), the arithmetic unit gradually performs displacement operations on the response input sequence (acceleration). In each displacement step, the arithmetic unit performs multiplication and summation operations on the corresponding data points of the overlapping parts of the two sequences. This process logically simulates sliding the acceleration waveform left and right on the time axis, attempting to find the moment when its overlap with the steering angle waveform is the highest. After a sliding scan, the arithmetic unit generates a series of correlation coefficient values, which form a correlation curve. The arithmetic unit traverses the curve and identifies the peak point with the largest absolute value. This peak point represents the waveform that is most similar to the steering action and the vehicle acceleration response under a specific displacement. The displacement index value corresponding to this peak point (i.e. how many sampling points it deviates from) is read and converted into a physical time unit (millisecond) in combination with the sampling frequency. This final time value is locked by the system as the response lag time, which accurately quantifies the mechanical transmission delay between the vehicle receiving the driving command and producing a physical action.
[0059] The preset time sliding window used for feature calculation in this step is functionally different from the window in the aforementioned map construction. In this embodiment, in order to accurately capture the transient delay features between driving input and vehicle response, the length of the preset time sliding window is set to 100 sampling points, which is equivalent to covering two seconds of physical time at a sampling rate of 50Hz. This length is chosen based on the Nyquist sampling theorem and the vehicle dynamic response constant, which ensures that the complete response waveform is included while avoiding the introduction of too much irrelevant historical data that would lead to excessive computational overhead.
[0060] Next, the signal correlation strength characteristic is calculated. The original values of the brake master cylinder pressure and the longitudinal acceleration component sequence in the triaxial acceleration are selected within the same time window, and normalized statistical correlation analysis is performed. Specifically, the Pearson product-moment correlation coefficient between the two sequences is calculated, which is the product of their covariances and their respective standard deviations. This correlation coefficient is defined as the braking response linearity and used as the signal correlation strength characteristic. This index has dimensionless characteristics and can objectively measure the degree of linear matching between the braking input and the vehicle deceleration response. If the calculated linearity is close to one, it indicates that the braking system is operating in the ideal linear region; if the value is low, it indicates that there may be nonlinear interference. This characteristic directly reflects the physical complexity of the operating condition.
[0061] By employing an algorithm combining wavelet packet decomposition and Shannon information entropy, the disorder of suspension data is quantified from a microscopic perspective of frequency domain energy distribution, transforming the physical complexity of mechanical vibration into a coded entropy value in information theory. By combining time delay estimation using cross-correlation functions with response linearity calculation based on Pearson correlation coefficients, a full-dimensional feature system encompassing time-domain phase, frequency-domain energy, and amplitude-domain linearity is constructed. This not only overcomes the shortcomings of traditional statistical indicators in their sensitivity to non-stationary signals but also provides high-confidence input features with clear physical meaning for subsequent data mapping models, thereby effectively improving the accuracy of assessing the compression potential of heterogeneous data streams.
[0062] Furthermore, the data mapping model includes: a feature space projection layer configured with three parallel multilayer perceptron network branches, each branch containing an input layer, a hidden layer, and an output layer; the input layer receives the signal frequency domain complexity, temporal redundancy, and signal correlation strength as scalar inputs respectively; the hidden layer uses the hyperbolic tangent activation function for nonlinear transformation; and the output layer maps the transformed data to a high-dimensional Hilbert space, generating frequency domain feature vectors, temporal feature vectors, and correlation feature vectors respectively.
[0063] The tensor product correlation layer calculates the outer product of the frequency domain feature vector and the time-series feature vector to generate a two-dimensional feature matrix. It then calculates the outer product of the two-dimensional feature matrix and the correlated feature vector to construct a three-dimensional interactive feature tensor.
[0064] The manifold metric output layer collects three-dimensional interactive feature tensors from multiple consecutive time steps in a preset time buffer queue, constructs a tensor sequence sample set, calculates the second-order statistical covariance matrix of the sample set in the time dimension, maps the covariance matrix to a symmetric positive definite matrix manifold, calculates the Riemann metric distance of the covariance matrix relative to the origin of the identity matrix in the Riemann geometric space, and maps the Riemann metric distance to a storage entropy weight coefficient with a numerical range between 0 and 1 through an sigmoid nonlinear activation function.
[0065] The feature space projection layer is configured with three parallel and structurally independent multilayer perceptron network branches, each employing a three-layer fully connected feedforward neural network architecture. Specifically, each branch contains an input node, two cascaded hidden layers, and an output layer. The first hidden layer has 16 neurons, and the second hidden layer has 8 neurons. Both hidden layers use the hyperbolic tangent function as a non-linear activation function to map the input signal to a numerical range of -1 to 1, preserving the positive and negative fluctuations of the data. The output layer has 4 neurons and does not use an activation function, directly outputting a four-dimensional linear feature vector. Specifically, the one-dimensional physical feature scalar received by the input layer is mapped layer by layer through the 16-dimensional first hidden layer and the 8-dimensional second hidden layer, ultimately transforming it into a four-dimensional feature vector. Each layer is configured with a bias term with an initial value of 0.01 to assist in linear biasing. During model initialization, a Xavier uniform distribution initialization strategy is used to assign values to the weight matrices of each layer to ensure that the variance of the signal remains consistent in the deep network. The input layer receives the signal frequency domain complexity, temporal redundancy, and signal correlation strength as scalar inputs. After nonlinear transformation by the hidden layers, the output layer maps the transformed data to a 4-dimensional high-dimensional Hilbert space as defined in this embodiment. Through this process, the original three scalar physical values are transformed into three dense feature vectors: a 4x1 frequency domain feature vector, a temporal feature vector, and a correlation feature vector. Before initialization, the weight matrix needs to be pre-trained offline. The specific training method is as follows: a historical dataset containing at least five typical driving conditions, such as stable, aggressive, and congested driving conditions, is selected as the sample source, and the storage priority evaluated by the expert system is used as the supervision label. After inputting the driving condition features into the model, the residual between the output value and the supervision label is calculated, and the weight parameters are corrected from the output layer to the projection layer using the backpropagation mechanism until the model's classification accuracy for known driving conditions reaches more than 95%, thereby obtaining a weight matrix with initial physical perception capability.
[0066] The tensor product correlation layer receives the three feature vectors generated above and constructs a multi-dimensional feature interaction structure using a step-by-step dimensional expansion approach. First, it performs a first-level tensor product operation, expanding the 4-dimensional frequency domain feature vector with the 4-dimensional temporal feature vector using the Kronecker product. Specifically, it multiplies each numerical element in the frequency domain vector with each numerical element in the temporal vector, generating a 4-row by 4-column two-dimensional feature matrix. Each element in this matrix represents the coupling strength between a specific frequency component and a specific time delay component. Subsequently, it performs a second-level tensor product operation, further expanding the two-dimensional feature matrix with the 4-dimensional correlation feature vector in three dimensions. Specifically, it multiplies each element in the two-dimensional matrix with each element in the correlation vector again, constructing a three-dimensional interactive feature tensor with a length, width, and height of 4 dimensions. This three-dimensional interactive feature tensor consists of 64 independent numerical elements in its data structure. These elements correspond to all orders of interaction between frequency domain complexity, temporal redundancy, and signal correlation strength in the feature space, providing rich high-order statistical features for subsequent manifold measurements.
[0067] The manifold metric output layer employs a sliding window-based statistical analysis method to construct Riemannian manifold features. First, a statistical buffer queue of 100 time steps is maintained in memory, collecting 3D interactive feature tensors from the tensor product correlation layer in real time. Each tensor is flattened into a 64-dimensional vector, thus constructing a statistical matrix containing 100 samples in the queue. Since the number of samples is greater than the feature dimension, the statistical matrix maintains good rank properties. Next, the second-order covariance matrix of this statistical matrix is calculated. To prevent singularity or irreversibility due to excessive data correlation, a diagonal loading regularization operation is performed: an identity matrix with the same dimension as the covariance matrix is constructed and multiplied by a small regularization coefficient (set to 10 in this embodiment). -6The regularization coefficient is specifically one part per million. By superimposing this tiny perturbation on the diagonal of the covariance matrix, it is ensured that the matrix does not undergo singularity degeneration in subsequent Riemannian manifold calculations. The result is then superimposed on the main diagonal of the covariance matrix, forcing it to satisfy the symmetric positive definite property. Subsequently, the generalized distance of the regularized covariance matrix in the Riemannian geometric space is calculated using the log-Euclidean metric. The specific calculation steps for this distance are as follows: First, perform a matrix logarithm operation on the regularized covariance matrix. Utilizing the isomorphism property of the tangent space of the Riemannian manifold, it is losslessly mapped from the curved manifold space to the flat manifold space. The first step involves generating a logarithmic covariance matrix in the Euclidean space. The second step involves calculating the Frobenius norm of this logarithmic covariance matrix, which is the arithmetic square root of the sum of the squares of all elements in the matrix. This norm is defined as the Riemann metric distance relative to the reference origin (identity matrix). Finally, the Riemann metric distance is mapped to a storage entropy weight coefficient with a numerical range between 0 and 1 using an sigmoid nonlinear activation function. The center offset of the activation function is set to 0.5, and the scaling factor is set to 10. This nonlinearly compresses the geometric distance in the Riemann space to a standard weight range, thereby quantifying the complexity of the operating conditions.
[0068] In a preferred embodiment, the step of inputting signal frequency domain complexity features, temporal redundancy features, and signal correlation strength features into the data mapping model further includes performing an online feedback calibration step: deploying a data retrieval counter in the edge archiving area to statistically analyze the data retrieval rate of compressed and archived data that has been decompressed and read by the business system within a preset monitoring period; generating a negative feedback gradient signal when the retrieval rate of data with a specific feature combination is found to be strictly greater than a preset heat escape threshold; and propagating the negative feedback gradient signal back to the data flow dynamics feature mapping model, using an online stochastic gradient descent algorithm to fine-tune and update the synaptic weight matrix of the multilayer perceptron in the feature space projection layer, so as to increase the output value of the storage entropy weight coefficient corresponding to the specific feature combination in the next mapping calculation.
[0069] First, a data retrieval counter is deployed at the data read interface of the edge archiving area (i.e., the cold storage medium used to store wavelet compressed data). Although this area is designed to store low-value historical data, business systems may still initiate read requests due to audit compliance, backlog case reopening, or other needs. A rolling monitoring cycle (e.g., 24 hours) is set to count in real time the number of times each type of data block with similar physical characteristic combinations (i.e., similar frequency domain complexity, temporal redundancy, and signal correlation strength) is decompressed and read. Based on this, the data retrieval rate is calculated, which is the percentage of data read in this cycle relative to the total archived amount of data with this type of characteristics, thereby quantifying the actual access popularity of cold data.
[0070] Secondly, the real-time calculated data retrieval rate is compared with a preset heat escape threshold (set to 5% in this embodiment). The logic for setting this threshold is based on the average audit frequency of historical policy data by the business side. If the proportion of data read requests under a specific working condition deviates significantly from the statistical average within a single monitoring period, heat escape is determined to have occurred. This 5% value can be periodically adjusted according to the seasonal fluctuations of insurance claims business to ensure the sensitivity of the model's online feedback. If a certain type of data was originally judged by the model as a low-entropy value and thus routed to the edge archiving area, but its actual retrieval rate is strictly greater than the preset threshold, it is determined that the data has experienced heat escape. This means that the model's value judgment of this type of working condition has deviated, misjudging the actual hot data as cold data. For specific feature combinations detected as escape, a specific error correction target and gradient update logic are constructed. First, a corrected target storage entropy weight coefficient is defined. This target value is set as the current model output coefficient plus a preset penalty step size (e.g., 0.2), with an upper limit not exceeding 1. This forces the model to output a higher entropy value when encountering this type of feature again. Second, a mean squared error loss function is constructed, calculating the square of the difference between the current output coefficient and the target storage entropy weight coefficient. Subsequently, based on the chain rule, the partial derivative of this loss function with respect to the multilayer perceptron weight matrix in the feature space projection layer is calculated to obtain the gradient vector. The logic for constructing the weights is based on the deviation between the business recovery rate and the preset escape threshold. When an abnormal increase in the recovery rate is detected, the loss function automatically increases the penalty coefficient, enabling the calculated gradient vector to drive the weights of the projection layer to evolve in the direction of increasing the output entropy weight value. This mathematically achieves automatic correction of misjudgments of cold data. Finally, an online stochastic gradient descent update operation is performed, subtracting the product of the gradient vector and the online learning rate from the original weight matrix. During this update operation, the system extracts the feature frames that trigger the judgment from the real-time data stream and encapsulates them with the corresponding historical access popularity tags to construct temporary training sample pairs. To prevent model oscillation caused by a single abnormal fluctuation, the update process only performs fine-tuning on the weights of the projection layer, and the update step size is limited by a very small online learning rate to ensure that the model maintains the stability of the original physical condition recognition while adapting to new business popularity. To ensure the stability of the model and prevent a single abnormal sample from destroying the existing knowledge structure, the online learning rate is strictly limited to a very small value (set to 10 in this embodiment). -5Furthermore, local updates are performed only on the specific perceptron branch that generated the error. After the weight update, when the model receives a data stream with the same or similar physical characteristics again, its output storage entropy weight coefficient will increase significantly. This prompts such data to migrate from the edge archiving area to the standard elastic subspace or core reserved area in future routing decisions, thereby reducing subsequent decompression overhead. This achieves self-correction and dynamic evolution of the storage strategy in response to business changes. A self-correction closed loop for the storage strategy is constructed through online feedback calibration based on the retrieval rate. This mechanism effectively corrects misjudgments of hot and cold data, realizes the dynamic evolution of the sharding strategy with business changes, and reduces decompression overhead caused by erroneous archiving.
[0071] By employing a multimodal tensor manifold metric architecture, a feature space projection layer maps low-dimensional physical scalar parameters to dense eigenvectors in a high-dimensional Hilbert space, enhancing the expressive power of the original signal in the feature space while preserving nonlinear fluctuation characteristics. A three-dimensional interactive tensor is constructed through vector outer product operations in the tensor product correlation layer, automatically capturing high-order coupling relationships between frequency, time, and amplitude domain features, providing richer feature interaction information than simple linear superposition. The Riemann metric distance in the Riemann tangent space is calculated using the manifold metric output layer. This method utilizes non-Euclidean geometry to process complex manifold data, enabling a quantitative assessment of the complexity and incompressibility of data stream operations, and providing a mathematical basis for the selection of adaptive storage engines.
[0072] Furthermore, the process of constructing the three-dimensional storage strategy mapping space and determining coordinate points includes: obtaining the physical limit IOPS threshold of the storage cluster; calculating the ratio of the currently collected IOPS load rate to the physical limit IOPS threshold; generating a system load saturation value between 0 and 1; calculating the maximum value of the global historical data access frequency of all data objects in the storage cluster; normalizing the historical data access frequency of the current data object with the maximum value; generating an access heat quantile value between 0 and 1; and calling the storage entropy weight coefficient to instantiate it in memory. In a unit Euclidean cube space with a side length of 1, mutually orthogonal X-axis, Y-axis, and Z-axis are established. The storage entropy weight coefficient is mapped to the X-axis and defined as the data value coordinate; the system load saturation is mapped to the Y-axis and defined as the system load coordinate; and the access popularity quantile is mapped to the Z-axis and defined as the access popularity coordinate. Based on the data value coordinate, system load coordinate, and access popularity coordinate, a unique spatial mapping point is determined within the unit Euclidean cube space, using the three coordinate values as scalar components of the spatial basis vector. The coordinates of this spatial mapping point are used as three-dimensional coordinate points.
[0073] First, the hardware physical limit parameters defined in the storage cluster configuration file are read to obtain the physical limit IOPS threshold (in this embodiment, it is set to the theoretical maximum read and write capacity of the storage cluster). Then, the monitoring probe is called to collect the current real-time IOPS load rate. In order to eliminate the difference in units, the ratio of the real-time collected value to the physical limit threshold is calculated to generate a system load saturation value that is strictly distributed in the closed interval between 0 and 1. This value directly reflects the current congestion level of the storage system. The closer the value is to 1, the closer the system load is to the edge of collapse, and a more conservative write strategy needs to be adopted.
[0074] The business key in the data frame header is read, and the vehicle unique identifier (VIN) is extracted. Using this VIN as the index key, the metadata statistics database is accessed to obtain the actual access frequency of the vehicle in the historical period (e.g., 500 times). At the same time, the maximum global historical read access frequency of all vehicles in the network is obtained. The Min-Max normalization algorithm is used to map the access frequency of the current vehicle to the global statistical range. This result is defined as the access popularity quantile, which is used to quantify the relative popularity of the specific vehicle data at the business level.
[0075] The system receives the storage entropy weight coefficient output by the data mapping model. Since this coefficient has been processed by the sigmoid activation function at the model output layer, its value is normalized to between 0 and 1. It is directly used as a standardized parameter to characterize the intrinsic value of the data. A virtual unit Euclidean cube space with a side length of 1 is instantiated in the data structure in memory. In this space, three mutually orthogonal coordinate axes are established: the X-axis is defined as the data value dimension, the Y-axis as the system load dimension, and the Z-axis as the access popularity dimension. The three calculated standardized parameters are mapped to the corresponding axes: the storage entropy weight coefficient is mapped to the X-axis, the system load saturation is mapped to the Y-axis, and the access popularity quantile is mapped to the Z-axis. Based on these three coordinate components, a unique spatial mapping point is determined in the unit Euclidean cube space, and a unique geometric point, namely the policy routing anchor point, is determined in the unit cube. The spatial coordinates of this anchor point are the finally determined three-dimensional coordinate points, representing the comprehensive state of the current data flow in the three dimensions of value, load, and popularity, providing an accurate geometric index for subsequent partitioning and disk placement.
[0076] By constructing a three-dimensional storage strategy mapping space composed of data value dimension, system load dimension, and access frequency dimension, and using physical limit ratio and Min-Max normalization algorithm, the entropy weights, IOPS load, and access frequency with vastly different physical dimensions and orders of magnitude are uniformly mapped to a standardized unit Euclidean cube. This spatial vector synthesis mechanism effectively eliminates heterogeneous interference between multi-dimensional indicators, transforming the complex storage resource allocation problem into a precise coordinate positioning problem in geometric space. Thus, based on a comprehensive consideration of the intrinsic value of data, real-time hardware pressure, and business access preferences, it provides a quantitative and globally-oriented decision anchor for subsequent database sharding and table partitioning strategy selection.
[0077] Furthermore, the process of performing adaptive sharding and disk persistence includes: reading the high-order threshold set and low-order threshold set preset in the dynamic partition routing table, comparing the values of each dimension of the three-dimensional coordinate point with the threshold set; if all dimensions are higher than the high-order threshold, it is determined that the data falls into the core reserved area; if all dimensions are lower than the low-order threshold, it is determined that the data falls into the edge archive area; if none of the above conditions are met, it is determined that the data falls into the standard elastic subspace and is routed to the general transaction processing cluster; for the data in the core reserved area, the log structure merge tree storage engine is called to append the data stream to the variable buffer. It triggers a lossless columnar serialization operation, using run-length encoding to persist the data as an immutable sorted string table file; for standard elastic subspace data, it calls a general relational database transaction engine based on a B+ tree structure, activates a row-level lock manager to ensure strong transactional consistency of the data, performs moderate lossless compression encoding on the data page content, and performs an in-place update operation to write to disk; for edge archive data, it calls a wavelet compression engine to perform multi-level discretization on the load data in the heterogeneous data stream, separating low-frequency approximate components and high-frequency detail components; it retains the low-frequency approximate components, performs a zero-setting truncation operation on the high-frequency detail components, and stores only the processed sparse data.
[0078] The sharding strategy manager first reads two sets of key thresholds pre-set in the dynamic partitioning routing table: a high-value threshold set defining high-value boundaries and a low-value threshold set defining low-value boundaries. It then obtains the three-dimensional coordinate points determined in the previous steps and compares the values of the data value, system load, and query popularity of these points with the aforementioned threshold sets one by one. If all dimensions of the coordinate point are strictly higher than the corresponding components in the high-value threshold set, the data stream is determined to be of high value, high load, and high popularity, and is routed to the core reserved area. If all dimensions of the coordinate point are strictly lower than the corresponding components in the low-value threshold set, the data stream is determined to be of low value, low load, and cold data type, and is routed to the edge archive area. If neither of these two extreme conditions is met, for example, some dimensions are high while others are low, the system determines that the data stream falls into the standard elastic subspace.
[0079] For data determined to fall into the core reserved area, an LSM-Tree write optimization process is executed to push the data to disk. The log structure merging tree storage engine is activated, and the data stream is first appended to a variable buffer in memory to achieve microsecond-level ultra-fast write response, effectively avoiding performance bottlenecks caused by random disk I / O. The memory usage of the buffer is monitored in real time, and when a preset level is reached, an immutable serialization operation is triggered. During this process, lossless compression is performed on columnar data blocks. Specifically, run-length encoding is used to compress repeated status bits, or differential encoding is used to compress continuously increasing timestamps. Finally, the data is persisted as a sorted string table file on disk. Once generated, this file cannot be modified and only supports background appending and merging, thereby maximizing the system's write throughput when processing high-volume data.
[0080] For data that falls into the standard elastic subspace, the RDBMS transaction optimization disk write process is executed. For this type of data, which usually involves policy status updates and complex business logic, the business key in the data frame is extracted as the database primary key, and the data is routed to the general relational database transaction engine. In order to ensure data consistency in a concurrent environment, the row-level lock manager is activated according to the business key. Before the data is written to the disk, the storage engine only performs moderate lossless compression encoding (e.g., using the Zstd algorithm) on the load data part. Then, an in-situ update operation is performed, using the B+ tree index to accurately locate the physical page of the corresponding vehicle unique identifier and directly overwrite the old data, ensuring that business queries can always obtain the latest data status.
[0081] In this embodiment, the medium-lossless compression coding refers to a coding strategy designed to balance transaction processing latency and storage space utilization. Specifically, unlike the high computational overhead and high compression ratio (e.g., compression ratio greater than 5:1) deep compression algorithms used in edge archives, this medium-lossless compression coding is configured to prioritize ensuring the real-time write throughput of the database. Technically, it preferably uses Level-1 to Level-5 compression levels of the Zstandard (Zstd) algorithm, or Level-1 to Level-3 compression levels of the Zlib algorithm. Under this configuration, the compression engine strictly controls the compression time of a single page of data to the microsecond level (e.g., single-page compression time <50μs) to avoid blocking row-level locking transactions in the database, while achieving a data compression ratio of 1.5:1 to 4:1.
[0082] For data determined to fall into the edge archiving area, a key-value separation and wavelet sparsification disk writing process is executed. In order to retain retrieval capability while compressing, a key-value separation storage strategy is implemented. For the business key part, its structured storage format is maintained or lossless compression is performed using dictionary encoding, and a sparse index is established. For the working condition load data part, the wavelet compression engine is activated, and multi-level discrete wavelet decomposition is performed only on time-series signals such as steering, braking, and suspension. The original signal is separated into low-frequency approximate components carrying macro trend information and high-frequency detail components carrying noise. The low-frequency approximate components are completely preserved, and the high-frequency detail components are truncated to zero. Finally, the lossless business key index block and the lossy load data block that has undergone sparse processing are repackaged and written to the cold storage medium, realizing high compression ratio data archiving.
[0083] When performing zero-cutting operation on high-frequency detail components, the preset noise baseline is not a fixed empirical value, but an adaptive threshold dynamically calculated based on a general threshold formula. The specific calculation logic is as follows: First, the noise standard deviation is estimated using the median absolute deviation of the high-frequency detail coefficients. Then, the noise standard deviation is multiplied by the square root of twice the natural logarithm of the signal sequence length to obtain the preset noise baseline. The threshold calculated by this logic can eliminate Gaussian white noise interference to the greatest extent and retain only statistically significant abrupt detail features, thereby maximizing the preservation of effective signals while ensuring the data compression ratio.
[0084] By dividing the storage strategy space into a core reserved area, a standard elastic subspace, and an edge archive area, precise adaptation between the underlying storage engine and data flow characteristics is achieved. The sequential write capability of the LSM-Tree engine ensures high-throughput writing of core data, the RDBMS engine establishes strong transactional consistency for regular data, and wavelet compression technology reduces storage costs for edge data. This differentiated disk persistence mechanism effectively improves the overall carrying capacity of the database cluster for massive amounts of heterogeneous auto insurance data while balancing write performance, data consistency, and storage space overhead.
[0085] By constructing a binary input-response metadata dependency graph based on causal logic, temporal topological relationships were established between heterogeneous data streams. Cross-correlation analysis was used to accurately quantify the dynamic hysteresis characteristics between physical signals. Combined with a data mapping model, the frequency domain energy distribution and temporal coupling characteristics were transformed into storage entropy weight coefficients that characterize the incompressibility of data, establishing a quantitative link between physical characteristics and storage value. A three-dimensional storage strategy mapping space was used to achieve differentiated engine selection. Under the premise of ensuring the throughput of high-value data writes and the consistency of core business transactions, low-value edge data can be compressed in a targeted manner, thereby achieving an effective balance between storage cost, system performance and data fidelity, and improving the comprehensive carrying capacity of the database cluster for massive amounts of auto insurance data. Example 2
[0086] Using the vehicle gateway via the CAN bus interface, the system captured real-time driving data streams of vehicles in congested urban areas. During this process, the gateway first parses the Vehicle Identification Number (VIN) message and associates it with the locally configured policy index, combined with the UTC timestamp generated by the timing module, to construct a unique business key identifying the current data frame, thus establishing the data's entity identity and timing reference. Subsequently, the collected front-wheel drive excitation data revealed the driver's frequent alternation between pressing the brake and accelerator pedals. The initial brake master cylinder pressure fluctuated frequently within the low to medium range of 0 to 30 Bar, while the steering wheel angle remained within a small range of ±15 degrees for fine-tuning. Simultaneously, the collected follow-up response data showed that the longitudinal component of the three-axis acceleration alternated between ±0.3g following the pedal action. Due to the relatively smooth urban road surface, the compression stroke of the four-wheel suspension shock absorbers mainly exhibited low-frequency fluctuations following the vehicle's pitch, without any severe high-frequency impacts. In addition, the database status collected synchronously by the system shows that the current IOPS load rate of the storage cluster is 45%, indicating that the system is under medium load; the historical read I / O count of the vehicle data is in the top 40% of all data objects in the network, indicating that its activity as a hot storage data is moderate.
[0087] The system maintains a timing-related sliding window in memory to capture delays between signals. When the system detects a rising edge signal in the brake master cylinder pressure, it immediately marks it as a timing index anchor point and searches for the corresponding response signal within the window. The search result shows that a falling edge appears in the longitudinal acceleration at a time lag of approximately 180 milliseconds. Since this 180-millisecond lag is less than the preset timing correlation threshold (200 milliseconds), the system determines that there is a valid physical causal dependency between the two signals. It then establishes a directed topological connection edge and writes 180 milliseconds as a timing bias weight into the graph metadata, completing the topological description of this braking behavior.
[0088] The system performs wavelet packet decomposition on the suspension travel nodes in the spectrum. The extraction results show that the frequency band energy is mainly concentrated in the low-frequency band, with fewer high-frequency noise components. The calculated power spectral entropy value is at a moderate level (e.g., 0.55). Meanwhile, the response lag time calculated using the cross-correlation function is stable at around 180 milliseconds, indicating that the mechanical response is within the normal linear range. In addition, the Pearson product-moment correlation coefficient between the brake master cylinder pressure sequence and the longitudinal acceleration sequence is calculated. The obtained braking response linearity is close to one (e.g., 0.96), indicating that the braking system is operating in the linear region and has not triggered nonlinear adjustment mechanisms such as ABS.
[0089] The system inputs the extracted physical features into a data mapping model. Internally, the model uses a log-Euclidean metric (LEM) to perform fast manifold operations on the feature covariance matrix. Due to the input data's moderate frequency domain complexity, good temporal synchronization, and high amplitude domain linearity, the model outputs a storage entropy weighting coefficient of 0.48. This indicates that the data segment has some compression potential, but excessive compression is not advisable to preserve business details. Based on this, the system constructs a three-dimensional storage strategy mapping space: mapping data value coordinates to 0.48, system load coordinates to 0.45, and access popularity coordinates to 0.60. Based on these three components, the system determines a unique strategy coordinate point in the three-dimensional space.
[0090] The system reads the preset partition routing table and compares the above coordinate points with the high-threshold set (0.8) and the low-threshold set (0.2). The setting logic of the high-threshold and low-threshold is dynamically adjusted based on the remaining capacity of the storage cluster and the importance level of the business. When the cluster storage space is sufficient, the system will automatically lower the high-threshold to accommodate more high-fidelity data; while when the storage pressure increases, it will raise the low-threshold to force more low-frequency data into the compressed archive area, thereby achieving efficient dynamic allocation of storage resources at the macro level. Since the three components of the coordinate point are not simultaneously greater than the high-threshold and not simultaneously less than the low-threshold, the system determines that the data stream falls into the standard elastic subspace. For the data in this area, the system extracts the business key in the data frame as the database primary key and routes it to the general relational database cluster. In order to ensure the accuracy of UBI premium calculation, the system activates the row-level lock manager to enable ACID transactions and ensure strong data consistency under concurrent updates. Before writing data to disk, the storage engine uses the Zstd algorithm to perform moderate lossless compression on the data pages, and then performs an in-place update operation to overwrite the data into the corresponding B+ leaf nodes.
[0091] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.
Claims
1. A method for storing automobile insurance data based on a separate database and a separate table, characterized in that, include: Acquire heterogeneous data streams to be stored, including business keys consisting of vehicle unique identifier, policy index and timestamp, and operating load data consisting of steering wheel angle, brake master cylinder pressure, three-axis acceleration and suspension compression stroke, and simultaneously collect storage cluster IOPS load rate and historical read access frequency. By using business keys to lock vehicle entities and align data timing, an input-response binary graph is constructed. Wavelet packet decomposition is performed on suspension travel nodes to extract frequency band energy distribution and calculate complex signal features. Cross-correlation functions are used to calculate the temporal redundancy features between steering wheel angle and acceleration nodes, and the signal correlation features between braking pressure and acceleration are calculated. Input the complex features of the signal, the temporal redundancy features, and the signal correlation features into the data mapping model, and output the storage entropy weight coefficient; The storage entropy weight coefficient, IOPS load rate, and access frequency are defined as the data value dimension, system load dimension, and access heat dimension, respectively. A three-dimensional storage strategy mapping space is constructed and coordinate points are determined. Based on the location of the coordinate points in the preset partition routing table, adaptive sharding and disk placement are performed on the heterogeneous data stream containing business keys.
2. The method for storing the data of automobile insurance in different databases and tables based on the automobile insurance data according to claim 1, characterized in that, The heterogeneous data stream acquisition process includes loading a pre-set DBC communication protocol file using an on-board gateway, parsing the bit field of the CAN bus broadcast message, and extracting the physical values of steering wheel angle, brake master cylinder pressure, three-axis acceleration, and suspension compression stroke. The vehicle identification message is parsed synchronously to obtain the vehicle's unique identifier, the policy index is associated with the preset mapping relationship, and a timestamp is added to each sampling point to generate a business key; Using the timestamp of the data item with the highest sampling frequency as a benchmark, linear interpolation is performed on other low-frequency data items, and the business key and the interpolated operating load data are encapsulated to generate a time-aligned heterogeneous data frame; the number of disk read / write operations per second is read by the monitoring agent program deployed on the storage node, and the instantaneous traffic spikes are smoothed using an exponentially weighted moving average algorithm to calculate the IOPS load rate; the metadata monitoring service of the storage system is accessed, and the underlying read I / O request records for the data object within a set sliding time window are retrieved based on the vehicle's unique identifier, and the cumulative number of read I / O requests is counted as the historical read access frequency.
3. The method of claim 1, wherein, The process of constructing the input-response binary graph includes: performing entity isolation on heterogeneous data streams based on the vehicle's unique identifier in the business key; and performing node encapsulation only within the data domain of the same vehicle entity according to the order of data generation: encapsulating the original values of the steering wheel angle and brake master cylinder pressure into front-wheel drive active node data objects, and encapsulating the triaxial acceleration and suspension damper compression stroke into subsequent passive node data objects, and serializing them according to the timestamps in the business key and storing them in the node storage area of the in-memory graph database; initializing a time-series correlation sliding window in memory, and setting the window length to be greater than a preset time-series correlation threshold; The preceding active node data object is used as the time-series index anchor point, and the subsequent active node objects whose timestamps lag behind the anchor point within the window are used as the target set to be matched; the time-series proximity retrieval algorithm is executed to traverse and calculate the time difference between each object in the target set to be matched and the time-series index anchor point; When the time difference is less than the preset temporal correlation threshold, a directed topological connection edge is established, and the time difference is written as a temporal bias weight into the graph metadata to generate an input-response binary graph.
4. The method of claim 1, wherein the method is characterized by: The calculation process for the signal complexity feature, temporal redundancy feature, and signal correlation strength feature includes: selecting the Daubechies wavelet as the basis function, performing multi-level wavelet packet decomposition on the time-series data stream of the suspension shock absorber compression stroke, and obtaining the reconstruction coefficients of each frequency band node in the last layer; calculating the sum of squares of the reconstruction coefficients of each frequency band node to obtain the energy value of the frequency band, and normalizing the energy values of all frequency bands into an energy probability distribution sequence; using the Shannon entropy formula to calculate the negative logarithmic weighted sum of the probability distribution sequence to obtain the power spectral entropy, which is used as the signal complexity feature; constructing a sliding cross-correlation operator for the discrete time series, and then... The steering wheel angle data sequence is set as the reference input sequence, and the triaxial acceleration data sequence is set as the response input sequence. Within a preset time sliding window, the cross-correlation coefficient sequence of the two is calculated, and the time displacement corresponding to the maximum value of the absolute value of the cross-correlation coefficient is identified. The time displacement is determined as the response lag time and used as a temporal redundancy feature. Normalized statistical correlation analysis is performed on the original value sequence of brake master cylinder pressure and the longitudinal acceleration component sequence in the triaxial acceleration within the same time window. The Pearson product moment correlation coefficient between the two sequences is calculated, and this coefficient is defined as the braking response linearity and used as a signal correlation feature.
5. The method for sharding and partitioning automotive insurance data according to claim 1, characterized in that, The data mapping model includes: The feature space projection layer is configured with three parallel multilayer perceptron network branches, each of which contains an input layer, a hidden layer, and an output layer. The input layer receives complex signal features, temporal redundancy features, and signal correlation features as scalar inputs, respectively. The hidden layer uses the hyperbolic tangent activation function for nonlinear transformation. The output layer maps the transformed data to a high-dimensional Hilbert space, generating frequency domain feature vectors, temporal feature vectors, and correlation feature vectors, respectively. Tensor product correlation layer: calculate the outer product of the frequency domain feature vector and the time series feature vector to generate a two-dimensional feature matrix; calculate the outer product of the two-dimensional feature matrix and the correlation feature vector to construct a three-dimensional interactive feature tensor. The manifold metric output layer collects three-dimensional interactive feature tensors from multiple consecutive time steps in a preset time buffer queue, constructs a tensor sequence sample set, calculates the second-order statistical covariance matrix of the sample set in the time dimension, maps the covariance matrix to a symmetric positive definite matrix manifold, calculates the Riemann metric distance relative to the origin of the identity matrix in the Riemann geometric space of the covariance matrix, and maps the Riemann metric distance to a storage entropy weight coefficient with a numerical range between 0 and 1 through an sigmoid nonlinear activation function.
6. The method for sharding and partitioning automotive insurance data according to claim 1, characterized in that, The process of constructing a three-dimensional storage strategy mapping space and determining coordinate points includes: obtaining the physical limit IOPS threshold of the storage cluster, calculating the ratio of the currently collected IOPS load rate to the physical limit IOPS threshold, and generating a system load saturation value between 0 and 1; calculating the maximum value of the global historical read access frequency, normalizing the current vehicle's historical data access frequency with the maximum value, and generating access heat quantiles between 0 and 1; calling the storage entropy weight coefficient, instantiating a unit Euclidean cube space with a side length of 1 in memory, and establishing mutually orthogonal X-axis, Y-axis, and Z-axis; mapping the storage entropy weight coefficient to the X-axis to define data value coordinates, mapping the system load saturation to the Y-axis to define system load coordinates, and mapping the access heat quantiles to the Z-axis to define access heat coordinates; based on the data value coordinates, system load coordinates, and access heat coordinates, using these three coordinate values as scalar components of the spatial basis vector, determining a unique spatial mapping point within the unit Euclidean cube space, and using the coordinates of the spatial mapping point as a three-dimensional coordinate point.
7. The method for sharding and partitioning automotive insurance data according to claim 1, characterized in that, The process of adaptive sharding and disk persistence includes: reading the high-order threshold set and low-order threshold set preset in the dynamic partition routing table; comparing the values of each dimension of the three-dimensional coordinate point with the threshold set; if all dimensions are higher than the high-order threshold, it is determined that the data falls into the core reserved area; if all dimensions are lower than the low-order threshold, it is determined that the data falls into the edge archive area; if none of the above conditions are met, the data falls into the standard elastic subspace and is routed to the general transaction processing cluster; for the data in the core reserved area, the log structure merging tree storage engine is called to append the data stream to the variable buffer and trigger lossless... The columnar serialization operation uses run-length encoding to persist the data as an immutable sorted string table file. For standard elastic subspace data, a general relational database transaction engine based on a B+ tree structure is invoked to activate the row-level lock manager, perform moderate lossless compression encoding on the data page content, and perform an in-place update operation to write to disk. For edge archive data, a wavelet compression engine is invoked to perform multi-level discretization on the load data in the heterogeneous data stream, separating low-frequency approximate components and high-frequency detail components. The low-frequency approximate components are retained, and the high-frequency detail components are truncated to zero, storing only the processed sparse data.
8. A sharded database storage system based on automobile insurance data, characterized in that, include: The data acquisition module acquires heterogeneous data streams to be stored, including business keys consisting of vehicle unique identifier, policy index and timestamp, and operating load data consisting of steering wheel angle, brake master cylinder pressure, three-axis acceleration and suspension compression stroke, and simultaneously collects storage cluster IOPS load rate and historical read access frequency. The feature calculation module uses business keys to lock vehicle entities and align data timing, constructs an input-response binary graph, performs wavelet packet decomposition on suspension travel nodes to extract frequency band energy distribution and calculate complex signal features, uses cross-correlation functions to calculate the temporal redundancy features between steering wheel angle and acceleration nodes, and calculates the signal correlation features between braking pressure and acceleration. The model mapping module inputs complex signal features, temporal redundancy features, and signal correlation features into the data mapping model and outputs storage entropy weight coefficients. The adaptive sharding module defines the storage entropy weight coefficient, IOPS load rate, and access frequency as the data value dimension, system load dimension, and access heat dimension, respectively. It constructs a three-dimensional storage strategy mapping space and determines the coordinate points. Based on the location of the coordinate points in the preset partition routing table, it performs adaptive sharding and disk writing for heterogeneous data streams containing business keys.