Geographic information surveying and mapping data management method and system based on multi-source spatio-temporal features
By optimizing the spatiotemporal data index structure and query pattern learning, the problems of imbalance and storage redundancy in traditional spatiotemporal data indexes are solved, achieving high efficiency in data management and a reduction in query response time.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANDONG ZHENGTU INFORMATION POLYTRON TECH INC
- Filing Date
- 2026-05-15
- Publication Date
- 2026-06-19
AI Technical Summary
Traditional spatiotemporal data indexes are prone to imbalance, storage redundancy, and high query latency, making it difficult to meet the performance requirements of real-time analysis and interactive applications.
By constructing a geographic information mapping data management method based on multi-source spatiotemporal features, the method calculates the comprehensive insertion cost to select child nodes, evaluates the cost of spatiotemporal variability to formulate a splitting strategy, and combines differential numerical storage and query pattern learning and prefetching functions to optimize the index structure and data management.
Construct a balanced spatiotemporal index structure to compress data volume, reduce storage hardware burden, shorten query response time, and improve data management and retrieval performance.
Smart Images

Figure CN122240650A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data management technology. More specifically, this invention relates to a method and system for managing geographic information mapping data based on multi-source spatiotemporal characteristics. Background Technology
[0002] Data generated by the geographic information surveying and mapping industry, which has both spatiotemporal attributes, not only represents the spatial location and shape of geographic entities at a specific moment, but also records the process of geographic entities changing over time, exhibiting characteristics such as huge data volume, complex dimensions, and frequent updates.
[0003] Currently, various spatiotemporal data indexes can group and index spatiotemporal objects by constructing hierarchical bounding boxes in multidimensional space, thereby accelerating operations such as spatiotemporal range queries and proximity queries.
[0004] However, in terms of data insertion, traditional index structures typically employ a greedy strategy based on minimum spatial overlap or minimum area increment to select the insertion path. These methods only consider the best local results of a single insertion, easily leading to an unbalanced index tree structure and frequent node reorganization, resulting in performance degradation over long-term operation. Regarding node splitting and data storage, most existing splitting algorithms fail to fully integrate the time dimension and the spatiotemporal distribution uniformity of data objects within nodes, easily causing overlap between split nodes in both dimensions, thus reducing query pruning efficiency. Simultaneously, existing data storage methods often contain redundancy and lack compression mechanisms for local data features within nodes, resulting in wasted storage resources. In terms of query execution, traditional query processing relies on determining the geometric intersection between the query range and index nodes, easily triggering excessive disk random access operations, especially when handling regular or repetitive query patterns. It fails to utilize historical query information to predict and prefetch data, leading to high query latency and difficulty meeting the performance requirements of real-time analysis and interactive applications. Summary of the Invention
[0005] To address the technical problems of traditional spatiotemporal data indexes being prone to imbalance, storage redundancy, and high query latency, this invention proposes a geographic information mapping data management method and system based on multi-source spatiotemporal features. This method can construct a more balanced spatiotemporal index structure and compress data volume by combining a differential numerical storage mechanism. At the same time, it utilizes query pattern learning and pre-fetching functions to pre-load data, thereby shortening query response time and improving overall retrieval performance.
[0006] In a first aspect, the present invention provides a geographic information mapping data management method based on multi-source spatiotemporal features, comprising: acquiring geographic information mapping data objects containing spatiotemporal coordinate attributes; when inserting a data object into a non-leaf parent node, calculating a comprehensive insertion cost, and selecting the child node with the smallest comprehensive insertion cost to insert the data object; when the number of child node data objects exceeds a capacity threshold, selecting the optimal splitting strategy by evaluating the spatiotemporal variability cost of candidate splitting strategies; for the new leaf node generated by the split, determining a reference object based on the spatiotemporal distribution of the internal data objects, and using the smallest spatiotemporal variability cost calculated during the split as the reference object. The new leaf node selects a difference value precision level, and all subsequent data objects stored store the processed difference value of the data object relative to the base object; it receives spatiotemporal range query requests and extracts query pattern feature vectors; when traversing the index tree, it matches the query pattern feature vectors with the historical query pattern cluster centroids stored in the index nodes. If a match is successful, it prefetches subsequent node data based on the probabilistic adjacency pointer associated with the best matching centroid; after the query is completed, it asynchronously records the actual path and pattern features of this query, and updates the transition probabilities of the query pattern cluster centroids and adjacency pointers of the nodes hit on the path in batches through a background thread.
[0007] This invention constructs a more balanced multidimensional index tree structure and reduces boundary overlap between nodes by calculating the comprehensive insertion cost to select child nodes during data insertion and evaluating the spatiotemporal variability cost when nodes exceed capacity to determine the splitting strategy. By determining the baseline object for the new nodes generated by splitting and assigning precision levels of the difference values according to the variability cost, and storing the difference values, the overall data volume can be effectively compressed, reducing the read and write burden on storage hardware. In addition, by extracting query pattern features and matching them with historical query patterns, and combining probabilistic adjacency pointers to prefetch data of subsequent nodes, the waiting time during data retrieval can be significantly reduced, comprehensively improving the management efficiency and query response performance of geographic information mapping data.
[0008] Preferably, when inserting a data object into a non-leaf parent node, calculating a comprehensive insertion cost and selecting the child node with the smallest comprehensive insertion cost to insert the data object includes: For a child node of a non-leaf parent node, the node instability metric of that child node is determined by the child node's most recent instability value recorded by the parent node. The arithmetic mean of the historical split cost values is determined; the comprehensive insertion cost of the data object to be inserted for each child node is a weighted linear combination of the node instability metric, the time span expansion caused by the data object, the spatial expansion, and the current space occupancy rate after normalization; the data object is inserted into the child node with the smallest comprehensive insertion cost value.
[0009] This invention calculates the comprehensive insertion cost by combining the node instability metric, the time span expansion caused by the data object, the spatial expansion, and the current space occupancy rate. It comprehensively considers the historical changes of the node, the degree of spatiotemporal boundary expansion brought by new data, and the current spatial utilization status of the node, thus alleviating the imbalance of the index tree structure caused by a single indicator judgment and making the distribution of surveying data objects among the nodes more uniform.
[0010] Preferably, the step of selecting the optimal splitting strategy by evaluating the spatiotemporal variability cost of candidate splitting strategies includes: calculating the spatial overlap area between the minimum bounding rectangles of the two new nodes generated by the split; calculating the normalized time span overlap between the time intervals covered by the two new nodes; calculating the spatiotemporal distribution variance of all data objects within the two new nodes respectively; and weighted summing the sum of the normalized spatial overlap area, the normalized time span overlap, and the normalized spatiotemporal distribution variance of the two new nodes to obtain the spatiotemporal variability cost of the current splitting strategy.
[0011] This invention obtains the cost of spatiotemporal variability by calculating the spatial overlap area, temporal span overlap, and spatiotemporal distribution variance of new nodes. During the node splitting process, it takes into account both the geometric boundary intersection of the split nodes and the dispersion of the internal data distribution, making the new nodes generated by the split more distinguishable in both spatial and temporal dimensions. This helps to reduce the invalid paths that need to be traversed during query operations and improves the efficiency of retrieval pruning.
[0012] Preferably, determining a reference object based on the spatiotemporal distribution of internal data objects includes: normalizing the spatial and temporal coordinates of the object set assigned to the new leaf node, and then calculating the spatiotemporal centroid and average change frequency of the object set; among objects whose change frequency is not higher than the average change frequency, calculating the weighted spatiotemporal distance between each object and the spatiotemporal centroid; and selecting the object with the smallest weighted spatiotemporal distance value as the reference object of the new leaf node.
[0013] This invention calculates the spatiotemporal centroid and average change frequency of the object set, and selects the object with the smallest weighted spatiotemporal distance from the objects with low change frequency as the reference object. This helps to place the selected reference object at the spatiotemporal center of the data cluster and make it relatively stable, thereby keeping the subsequently generated difference values within a small range, enhancing the data compression effect and saving storage resources.
[0014] Preferably, the minimum spatiotemporal variability cost calculated during splitting is used to select a differential numerical precision level for the new leaf node. Subsequent data objects store the processed differential values relative to the reference object. This includes: pre-setting multiple differential numerical precision levels and corresponding spatiotemporal variability cost threshold ranges; comparing the minimum spatiotemporal variability cost calculated during splitting with the preset threshold ranges to set the differential numerical precision level corresponding to the cost range for the new node; and storing the spatiotemporal coordinates of subsequent data objects in the new node as differential values relative to the reference object's spatiotemporal coordinates, processed according to the set precision level.
[0015] Preferably, when traversing the index tree, the query pattern feature vector is matched with the historical query pattern cluster centroids stored in the index node. If a match is successful, the data of subsequent nodes is prefetched based on the probabilistic adjacency pointer associated with the best matching centroid. This includes: combining the spatiotemporal coordinates of the center point of the query window, the spatial span of the query window, and the time span after performing minimum-maximum normalization processing to form the query pattern feature vector; reading the historical query pattern cluster centroid vectors stored in the current index node; calculating the similarity between the query pattern feature vector and each centroid vector based on Euclidean distance; if the maximum similarity is greater than a preset matching threshold, the probabilistic adjacency pointer associated with the centroid vector with the highest similarity is activated to prefetch the data of the target node.
[0016] Preferably, the asynchronous recording of the actual path and pattern features of this query, and the batch updating of the query pattern cluster centroids and adjacency pointers of the hit nodes on the path by a background thread, includes: the background thread periodically collecting query logs; for each node on the query path and subsequent access nodes, obtaining the pattern feature vector of this query; finding the query pattern cluster centroid with the highest similarity to the pattern feature vector in the current node, and updating the corresponding centroid using the exponential moving average method; updating the access frequency counter from the current node to each adjacent node, and incrementing the counter value corresponding to the actual accessed subsequent node by one; and calculating and updating the transition probability of the probabilistic adjacency pointers pointing to each adjacent node associated with the centroid with the highest similarity based on the updated access frequency counter.
[0017] Preferably, the method further includes: generating a time granularity code for the geographic information mapping data object, wherein the time granularity code is formed by normalizing and combining multiple time components decomposed from the timestamp of the data object; when inserting a data object into a non-leaf parent node or when node overflow triggers a split, the time granularity code is used as a pre-grouping basis for assessing spatiotemporal variability, guiding data with the same frequency to be preferentially routed to the same tree branch or divided into the same new leaf node.
[0018] This invention utilizes time-granularity coding to pre-extract features and guide grouping of geographic information mapping data objects, prompting data objects with similar update frequencies to be clustered and stored on the branch structure of the index tree. This prevents excessive expansion of node boundaries and frequent reorganization of the index structure caused by mixing high-frequency and low-frequency update data, thus maintaining the long-term stable operation of the data management system.
[0019] Preferably, the calculation of the similarity between the query pattern feature vector and each centroid vector based on Euclidean distance, and the matching prefetching, includes: calculating the Euclidean distance between the query pattern feature vector and each historical query pattern cluster centroid, and converting the Euclidean distance into a similarity; finding the maximum similarity and comparing it with a preset matching threshold; if the maximum similarity is greater than the preset matching threshold, searching the probabilistic adjacency pointer list associated with the best matching centroid, selecting the node with the highest probability for data prefetching and loading it into the memory cache in advance.
[0020] Secondly, the present invention provides a geographic information mapping data management system based on multi-source spatiotemporal features, including a processor and a memory, wherein the memory stores computer program instructions, and when the computer program instructions are executed by the processor, the above-mentioned geographic information mapping data management method based on multi-source spatiotemporal features is implemented.
[0021] By adopting the above technical solution, the above-mentioned geographic information mapping data management method based on multi-source spatiotemporal characteristics is generated into a computer program and stored in a memory so that it can be loaded and executed by a processor. In this way, a terminal device can be made based on the memory and processor for convenient use.
[0022] The beneficial effects of this invention are as follows: This invention constructs a more balanced spatiotemporal index structure by comprehensively considering the historical splitting cost of nodes, spatiotemporal increments, and space occupancy rates during the data insertion stage, and evaluating the spatiotemporal variability cost when nodes overflow and split. This reduces the overlap of spatiotemporal boundaries between nodes and improves the rationality of geographic information mapping data organization.
[0023] Furthermore, by combining the differential numerical storage mechanism determined based on the spatiotemporal distribution characteristics of the data, the data volume is compressed and the storage overhead and disk read / write burden are reduced while ensuring data accuracy. By utilizing the query pattern feature extraction and prefetching functions contained in the index nodes, subsequent node data can be preloaded based on historical query records, shortening the response waiting time for spatiotemporal range queries and improving the overall management and retrieval performance of geographic information mapping data. Attached Figure Description
[0024] Figure 1 This is a flowchart of the geographic information mapping data management method based on multi-source spatiotemporal features in this invention; Figure 2 This is a schematic diagram of the integrated insertion cost model in this invention; Figure 3 This is a schematic diagram comparing the relative performance indicators of the four evaluation schemes in this invention. Detailed Implementation
[0025] The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are some embodiments of the present invention, but not all embodiments.
[0026] This invention discloses a geographic information mapping data management method based on multi-source spatiotemporal features, referring to... Figure 1 This includes steps S1-S3: S1. Obtain geographic information mapping data objects containing spatiotemporal coordinate attributes; when inserting data objects into non-leaf parent nodes, calculate a comprehensive insertion cost and select the child node with the smallest comprehensive insertion cost to insert the data object.
[0027] In an optional embodiment, a time-granularity code is generated for the data object. Specifically, firstly, the Coordinated Universal Time (UTC) timestamps of the geographic information mapping data object are decomposed to obtain multiple time components such as year, month, day, hour, minute, and second; then, each time component is normalized to a preset integer range of 0 to 1023; finally, the Morton coding algorithm, i.e., the Z-order curve filling algorithm, is used to interleave and combine the normalized multidimensional time components into a one-dimensional time-granularity code.
[0028] For the insertion process into a non-leaf parent node, all child nodes are traversed, and a comprehensive insertion cost is calculated for each child node. The node instability metric is calculated using the Exponentially Weighted Moving Average (EWMA) algorithm, and its calculation method is as follows:
[0029] in, This is the current instability metric. For smoothing coefficients; The historical split generation value recorded during the most recent split of the child node; This is the instability metric value calculated in the previous iteration.
[0030] The time span increment is obtained by calculating the time length required for the child node's time-dimensional minimum bounding rectangle (TMBR) to expand after the point in time when the new data object is included; the spatial increment is obtained by calling the Union and Area functions in a geometry object library such as GEOS to calculate the area increment of the child node's spatial-dimensional minimum bounding rectangle (MBR) after merging with the spatial extent of the new data object; the current space occupancy rate is the number of existing data objects within the child node divided by the node capacity. The current instability metric is... Spatial expansion Time span expansion and space occupancy rate Normalization is performed separately, and then the overall insertion cost is calculated. The calculation method is as follows:
[0031] in, To account for the overall insertion cost; The first preset weighting coefficient; This is the normalized current instability metric. This is the second preset weighting coefficient; This is the normalized spatial expansion. The third preset weighting coefficient; This represents the normalized time span expansion. The fourth preset weighting coefficient; This represents the normalized space occupancy rate.
[0032] Select the child node with the smallest overall insertion cost as the insertion target.
[0033] In one embodiment, when inserting a data object into a non-leaf parent node, a comprehensive insertion cost is calculated, and the child node with the lowest comprehensive insertion cost is selected to insert the data object. Specifically, for a child node of a non-leaf parent node, the node instability metric of that child node is determined by the child node most recently recorded by the parent node. The arithmetic mean of the historical split cost values is determined; the comprehensive insertion cost of the data object to be inserted for each child node is a weighted linear combination of the node instability metric, the time span expansion caused by the data object, the spatial expansion, and the current space occupancy rate after normalization; the data object is inserted into the child node with the smallest comprehensive insertion cost value.
[0034] When inserting a data object O into a non-leaf parent node P, it is necessary to select an optimal target child node for the data object O. For each candidate child node under the non-leaf parent node P... Calculate candidate child nodes Instability measure The instability metric is maintained by the non-leaf parent node P, which records the candidate child nodes. recent The value of spatiotemporal variability during secondary splitting, in this invention Let's set it to 5, and the calculation method is as follows:
[0035] in, Candidate child nodes Instability measure; The number of splits recorded; The value of the j-th historical division.
[0036] If the number of historical divisions is insufficient If the number of times is calculated, the average value is calculated based on the actual number of times, and the insertion object O is calculated for each candidate child node. The calculation methods for the resulting spatiotemporal expansion and time span expansion are as follows:
[0037] in, The time span extension is the amount of time that allows inserting a new data object O into a candidate child node. Then, candidate child nodes The time length required to expand the minimum bounding rectangle in the time dimension; Candidate child nodes after containing new data object O The new time span; To insert the previous candidate child node The old time span.
[0038] The method for calculating spatial expansion is as follows:
[0039] in, The space expansion quantity is the amount of new data object O inserted into the candidate child node. Then, candidate child nodes The increase in geometric area by the minimum bounding rectangle in spatial dimension; Candidate child nodes after containing new data object O The new space area; To insert the previous candidate child node The area of the old space.
[0040] Simultaneously, calculate candidate child nodes. The current space occupancy rate is calculated as follows:
[0041] in, This represents the current space occupancy rate. Candidate child nodes The number of existing data objects; This refers to the node capacity.
[0042] Instability measure for all candidate child nodes Time span expansion Spatial expansion and current space occupancy rate Each of the four indicators is subjected to minus maximum normalization to scale the indicator values to the nearest integer. Within the interval, the influence of dimensions is eliminated, and the comprehensive interpolation cost is calculated through a weighted linear combination. The calculation method is as follows:
[0043] in, Candidate child nodes The overall insertion cost; This is the first preset weighting coefficient; This is the normalized instability measure. This is the second preset weighting coefficient; This is the normalized spatial expansion. This is the third preset weighting coefficient; Normalized time span expansion; This is the fourth preset weighting coefficient; This represents the normalized current space occupancy rate.
[0044] The weighting coefficients are preset based on data characteristics, for example, 0.4 0.2 0.2 and It is 0.2 and satisfies =1. Select the child node that minimizes the overall insertion cost as the insertion target.
[0045] In an optional embodiment, the calculation of the comprehensive insertion cost also includes the temporal granularity encoding matching degree between the data object to be inserted and the historical data within the candidate nodes, which also needs to be normalized. Specifically, during the data insertion stage, the temporal granularity encoding of the object to be inserted is extracted, and the extracted temporal granularity encoding is compared with the dominant granularity encoding of the data within each candidate child node. The obtained difference value is included as a penalty term in the calculation of the comprehensive insertion cost, thereby guiding data with the same frequency to be routed to the same tree branch first. During the node overflow triggering split stage, the temporal granularity encoding is used as the pre-grouping basis for assessing spatiotemporal variability, and data objects with the same or similar encodings are preferentially assigned to the same new leaf node, thereby achieving tight clustering of data with the same frequency in the index tree structure and avoiding the drastic expansion of node time boundaries and frequent reconstruction of the index tree caused by the mixing of high-frequency and low-frequency data.
[0046] In another embodiment, the weighted linear combination is implemented using a single-layer feedforward network model. This single-layer feedforward network model is a single-neuron network without hidden layers or nonlinear activation functions, consisting of an input layer and an output layer. The input layer is directly connected to a neuron in the output layer. The input is a four-dimensional feature vector, which consists of the normalized node instability metric, time span expansion, spatial expansion, and current space occupancy rate generated by the data object to be inserted for a child node. This four-dimensional feature vector satisfies the following relationship:
[0047] in, It is a four-dimensional feature vector; This is the normalized measure of node instability. This represents the normalized time span expansion. This is the normalized spatial expansion. This represents the normalized current space occupancy rate.
[0048] The preset weight vector satisfies the following relation:
[0049] in, This is a preset weight vector; The first preset weighting coefficient; The third preset weighting coefficient; This is the second preset weighting coefficient; This is the fourth preset weighting coefficient.
[0050] The output is a scalar value, namely the total insertion cost. The calculation method is the dot product of the four-dimensional feature vector and the preset weight vector. The specific calculation method is as follows:
[0051] in, To account for the overall insertion cost; This is the pre-defined weight vector after transposition; It is a four-dimensional feature vector.
[0052] Integrated insertion cost model, such as Figure 2 As shown.
[0053] S2. When the number of child node data objects exceeds the capacity threshold, the optimal splitting strategy is selected by evaluating the spatiotemporal variability cost of the candidate splitting strategies. For the new leaf node generated by the split, a reference object is determined based on the spatiotemporal distribution of the internal data objects, and a differential numerical precision level is selected for the new leaf node based on the spatiotemporal variability cost with the smallest value calculated during the split. The data objects stored thereafter all store the processed differential values of the data objects relative to the reference object.
[0054] In an optional embodiment, when the number of node data objects exceeds the capacity threshold, a splitting algorithm similar to a R-star tree is used to sort all data objects within the node along each spatiotemporal dimension axis, and to divide each axis into a finite number of candidate partition combinations based on preset candidate splitting points; for each partition combination, i.e., candidate splitting strategy, a spatiotemporal variability cost is calculated. .
[0055] Specifically, the following metrics are first calculated: the overlap area of the spatial boundary rectangles (MBR) of the two new nodes after splitting is calculated by calling the intersection function of the spatial topology suite library (JTS); the overlap length of the temporal boundary rectangles (TMBR) is calculated; and the sum of the spatiotemporal distribution variances of the data objects within the two new nodes is calculated, where the spatiotemporal distribution variance equals the sum of the spatial coordinate variances plus the sum of the timestamp variances. Then, the above three original metrics are subjected to minimum-maximum normalization to obtain dimensionless values, followed by a weighted summation to obtain the spatiotemporal variability cost. The splitting strategy that minimizes the cost of spatiotemporal variability is selected for execution, and the minimum cost of spatiotemporal variability is recorded as the historical splitting cost in the entry associated with the current splitting node in the parent node. Within the newly generated leaf nodes, objects with a change frequency attribute value higher than a preset threshold are excluded. The remaining objects are then partitioned around a center point using the PAM algorithm to determine a geometric center object in the spatiotemporal dimension as the reference object. Based on the minimum cost of spatiotemporal variability obtained during the split, a piecewise mapping function is applied to determine the precision level of the difference values; for example, when the cost of spatiotemporal variability is... When the value is less than the threshold T1, a 32-bit floating-point number is selected, which is the cost of time-space variability. A 16-bit integer is chosen when the threshold T1 is between the threshold T1 and the threshold T2, considering the cost of time-space variability. When the value exceeds the threshold T2, an 8-bit integer is selected. Subsequent data objects are stored as the original spatiotemporal coordinates minus the spatiotemporal coordinates of the reference object, and then linearly represented according to the selected precision level.
[0056] In one implementation, for each new leaf node generated by a split, a baseline object is determined based on the spatiotemporal distribution of its internal data objects. Specifically, the spatial overlap area between the minimum bounding rectangles of the two new nodes generated by the split is calculated; the normalized time span overlap between the time intervals covered by the two new nodes is calculated; the spatiotemporal distribution variance of all data objects within the two new nodes is calculated separately; and the sum of the normalized spatial overlap area, the normalized time span overlap, and the normalized spatiotemporal distribution variance of the two new nodes is weighted and summed to obtain the spatiotemporal variability cost of the current splitting strategy.
[0057] A split is initiated when the number of data objects in a node N exceeds the capacity threshold. One candidate splitting strategy involves splitting node N into new nodes. and new nodes Calculate the spatial overlap area:
[0058] in, The area of spatial overlap; This is the area calculation function; For new nodes The minimum bounding rectangle; For new nodes The smallest outer rectangle.
[0059] The method for calculating the time overlap value is as follows:
[0060] in, Values representing time overlap; This is a function for calculating length. For new nodes Time interval; For new nodes The time interval.
[0061] To perform normalization, the time-overlapping values Divide by the total time span .
[0062] The sum of the spatiotemporal distribution variances within the two new nodes is calculated as follows:
[0063] in, It is the sum of the variances of the spatiotemporal distribution; For new nodes The spatial and temporal distribution variance; For new nodes The spatial and temporal distribution variance.
[0064] The method for calculating the spatiotemporal distribution variance of the new node is as follows:
[0065] in, For candidate new nodes The spatial and temporal distribution variance; For candidate new nodes The variance of the x-coordinate of the internal data object; For candidate new nodes The variance of the ordinate of the internal data object; For candidate new nodes The variance of the time coordinates of the internal data object.
[0066] Spatial overlap area generated for all candidate splitting strategies Time overlap values and the sum of the variances of the spatiotemporal distribution The numerical values are normalized to obtain the normalized spatial overlap area. Normalized time overlap values and the sum of the normalized spatiotemporal distribution variances .
[0067] The spatiotemporal variability cost of the splitting strategy is calculated as follows:
[0068] in, This comes at the cost of spatiotemporal variability. The first weighting coefficient; This represents the normalized spatial overlap area. This is the second weighting coefficient; This represents the normalized time overlap value; This is the third weighting coefficient; This represents the sum of the normalized spatiotemporal distribution variances. In this invention, the first weighting coefficient... The second weighting coefficient is 0.4. The third weighting coefficient is 0.3. It is 0.3.
[0069] Iterate through all candidate splitting strategies and select the one with the spatiotemporal variability cost. The strategy with the minimum numerical value is taken as the optimal splitting strategy, and the spatiotemporal variability cost with the minimum numerical value is also considered. The numerical value is recorded in the parent node.
[0070] In one implementation, for each newly generated leaf node, a reference object is determined based on the spatiotemporal distribution and frequency of change of the internal data objects. Specifically, after normalizing the spatial and temporal coordinates of the set of objects assigned to the new leaf node, the spatiotemporal centroid and average frequency of change of the set of objects are calculated. Among the objects whose frequency of change is not higher than the average frequency of change, the weighted spatiotemporal distance between each object and the spatiotemporal centroid is calculated. The object with the smallest weighted spatiotemporal distance is selected as the reference object of the new leaf node.
[0071] When a new leaf node During creation, the internally allocated object set S is processed, and the horizontal and vertical space coordinates of all objects within object set S are determined. and time coordinates Perform minimum-maximum normalization and map to the interval. Within a 3D cube, the centroid of the normalized coordinates and the average frequency of change of all objects are calculated. The method for calculating the centroid is as follows:
[0072] in, The centroid of the normalized coordinates; The normalized average x-axis; The normalized average ordinate; This represents the normalized average time coordinate.
[0073] Select a candidate subset from the object set S Candidate subset Includes all objects that meet the frequency of change. If the frequency of change of an object is less than or equal to the average frequency of change of that object, then the object is relatively stable.
[0074] For candidate subsets Each object in The object of calculation and the centroid Weighted spatiotemporal distance Spatial weights and time weight Depending on the application scenario, for example, spatial weights can be set for moving object data. The time weight is 0.7. The value is 0.3. Traverse the candidate subsets. Choose the minimum weighted spatiotemporal distance object As the current leaf node The baseline object .
[0075] In one implementation, a differential numerical precision level is selected for a node based on the minimum spatiotemporal variability cost calculated at the time of splitting. Subsequent data objects stored all store the calculated differential values of the data object relative to the reference object. Specifically, multiple differential numerical precision levels and corresponding spatiotemporal variability cost threshold ranges are preset. The minimum spatiotemporal variability cost calculated at the time of splitting is compared with the preset threshold ranges to set the differential numerical precision level corresponding to the cost range for the new node. Subsequent data objects stored as new nodes have their spatiotemporal coordinates stored as differential values relative to the reference object's spatiotemporal coordinates, processed according to the set precision level.
[0076] Cost of spatiotemporal variability At that time, the corresponding difference value is stored as a 32-bit floating-point number; the cost of spatiotemporal variability. At that time, the corresponding difference value is stored as a 16-bit signed integer, which needs to be multiplied by a scaling factor before rounding; the cost of spatiotemporal variability. The corresponding difference values are stored as 8-bit signed integers with a scaling factor of 100. When a new leaf node is created, the spatiotemporal variability cost of the values generated by the splitting process is minimized. The numerical value is used to select the precision. Since 0.12 is... Within the specified range, the corresponding node is set to level 1.
[0077] Subsequently, when a new data object When inserting into the current node, calculate the new data object and the baseline object. The difference value.
[0078] The method for calculating the difference value of the horizontal axis is as follows:
[0079] in, The difference values are the values on the horizontal axis; For the spatial x-coordinate component of the new data object; The reference space abscissa component of the reference object.
[0080] The method for calculating the difference value of the ordinate is as follows:
[0081] in, The difference values are for the ordinate. For the spatial ordinate component of the new data object; The reference space ordinate component of the reference object.
[0082] The method for calculating the time coordinate difference fraction is as follows:
[0083] in, These are the time coordinate difference values; For the time coordinate components of the new data object; The reference time coordinate component is the reference object.
[0084] Because the node is set to level 1, the three difference values mentioned above are stored directly as 32-bit floating-point numbers, assuming a spatiotemporal variability cost. If the value is 0.25, the node will be set to level 2, and the stored data will be converted to a 16-bit integer form. , , .
[0085] S3. Receive spatiotemporal range query requests and extract query pattern feature vectors. When traversing the index tree, match the query pattern feature vectors with the historical query pattern cluster centroids stored in the index nodes. If a match is successful, prefetch subsequent node data based on the probabilistic adjacency pointers associated with the best matching centroid. After the query is completed, asynchronously record the actual path and pattern features of this query. Update the transition probabilities of the query pattern cluster centroids and adjacency pointers of the nodes hit on the path in batches through a background thread.
[0086] In an optional embodiment, for a spatiotemporal range query request, the center point of the spatial query window is extracted. Coordinates, center point Coordinates, width, height, and center point time of the time query interval The duration is then used to perform minimum-maximum normalization on all the above feature components, forming a six-dimensional query pattern feature vector. During the index tree traversal, when a non-leaf node is reached, the spatial distance calculation function in the scientific computing library scipy is used to calculate the six-dimensional query pattern feature vector. Clustering centroids with the K historical query patterns stored in the current node, i.e., starting from the centroids To the center of mass Find the best matching centroid with the minimum Euclidean distance. Each centroid Associate a probabilistic adjacency pointer, which is a hash table or dictionary structure where the key is the child node identifier and the value is the centroid matched from the current node. The transition probability of visiting the corresponding child node afterward; based on the centroid of the best matching. The probability distribution of associations is used to select the child node with the highest probability. A pre-fetch request for the target child node's data block is submitted to the storage medium via an independent thread pool or asynchronous input / output model such as asyncio. After the query is completed, the actual query access path is asynchronously recorded. Through an independent background update thread, for each non-leaf node N on the path, the internal cluster centroid and transition probability are updated using incremental update rules. The centroid update uses an update formula with a learning rate, calculated as follows:
[0087] in, The updated best-match centroid; The learning rate; The best matching centroid before the update; This is a feature vector for a six-dimensional query pattern.
[0088] For the best-matching centroid The associated probabilistic adjacency pointers, if the actually visited child node is the target child node. Then first connect the target child node. Increment the associated visit count value by one, and recalculate the transition probabilities of all child nodes under the current centroid. The calculation method is as follows:
[0089] in, The new transition probability; For target child node The access count value; This is the total visit count of all child nodes under the current centroid.
[0090] In one implementation, when traversing the index tree, the query pattern feature vector is matched with the historical query pattern cluster centroids stored in the index node. If a match is successful, subsequent node data is prefetched based on the probabilistic adjacency pointer associated with the best matching centroid. Specifically, the spatiotemporal coordinates of the query window's center point, the spatial span of the query window, and the time span are combined into a query pattern feature vector after undergoing min-max normalization. The historical query pattern cluster centroid vectors stored in the current index node are read. The similarity between the query pattern feature vector and each centroid vector based on Euclidean distance is calculated. If the maximum similarity is greater than a preset matching threshold, the probabilistic adjacency pointer associated with the centroid vector with the highest similarity is activated to prefetch the target node's data.
[0091] The spatiotemporal range query Q includes the horizontal coordinate of the spatial center point, the vertical coordinate of the spatial center point, the time center point, the horizontal span of space, the vertical span of space, and the time span, i.e. Using the spatiotemporal range of the global or parent node as a benchmark, the above six components are subjected to minimum-maximum normalization to form a six-dimensional query pattern feature vector. When traversing the index tree and reaching a non-leaf node N, read the K historical query pattern cluster centroids from internal storage. For each centroid Calculate the corresponding centroid and the feature vector of the six-dimensional query pattern. The Euclidean distance is calculated as follows:
[0092] in, The distance is Euclidean. This is a six-dimensional query pattern feature vector; Cluster centroids for historical query patterns.
[0093] Convert distance to similarity, i.e.:
[0094] in, For similarity, in this invention, the range of similarity values is: ; The distance is Euclidean.
[0095] Then find the largest similarity. and the corresponding best-matching centroid , in, The maximum similarity; To find the maximum value function; The similarity score is calculated by assigning the highest similarity score to the lowest similarity score. Matching threshold with preset Compare; if the maximum similarity is found... Greater than the preset matching threshold If the match is successful, then the centroid of the best match is found. An associated probabilistic adjacency pointer list, which contains the centroids that historically belong to the best matching, starting from the current node N. The identifiers and transition probabilities of the subsequent nodes most likely to be accessed in the query pattern are identified. One to two nodes with the highest probabilities are selected for data prefetching, and these selected nodes are preloaded into the memory cache.
[0096] In one implementation, after the query is completed, the actual path and pattern features of this query are asynchronously recorded. A background thread then batch updates the centroids of the query pattern clusters and the transition probabilities of the adjacency pointers for the nodes hit along the path. Specifically, the background thread periodically collects query logs. For each node on the query path and its subsequent visited nodes, it obtains the pattern feature vector of this query. It finds the centroid of the query pattern cluster with the highest similarity to the pattern feature vector in the current node and updates the corresponding centroid using an exponential moving average method. It updates the access frequency counters from the current node to each adjacent node, incrementing the counter value corresponding to the actually visited subsequent node. Based on the updated access frequency counters, it calculates and updates the transition probabilities of the probabilistic adjacency pointers associated with the centroid with the highest similarity, pointing to each adjacent node.
[0097] Assuming the actual path of a query is a node To node To node And continue in the future. When querying from a node... Proceed to the node At that time, perform the following update operation: obtain the normalized feature vector of this query. ; at the node In the process, find the normalized eigenvector. The most similar historical query pattern cluster centroid Use learning rate The centroids of the matched data are updated using the exponential moving average method, which is calculated as follows:
[0098] in, The updated best-match centroid; The learning rate; The best matching centroid before the update; These are normalized eigenvectors.
[0099] Update and historical query pattern cluster centroid The associated adjacency pointer information is a mapping structure where the key is the adjacent node identifier and the value is the access frequency. Find the node whose key is... For each entry, increment the frequency counter by 1; if the node... This is the first time as The successor node in the pattern is a node. Create a new entry with a frequency of 1. Recalculate the transition probability based on the updated frequency counter. For the centroid of the cluster with historical query patterns... Each associated neighboring node The new transition probability is calculated as follows:
[0100] in, For the new transition probability; Adjacent nodes The frequency of access; Cluster centroids for all clusters with historical query patterns Associated neighboring nodes The sum of access frequencies.
[0101] The updated probability values will be stored in the node. The probabilistic adjacency pointers are used to perform more accurate prefetching operations for subsequent similarity queries.
[0102] The experimental dataset uses the GeoLife GPS trajectory dataset, containing approximately 18,000 trajectories and 24 million spatiotemporal data points. The query set consists of 1,000 randomly generated spatiotemporal range queries, with the spatial range accounting for 0.1% to 5% of the total area and the temporal range accounting for 1% to 10% of the total time span. Four schemes were established for ablation comparison. Scheme 1 uses the baseline 3DR-tree algorithm; Scheme 2 removes the differential value extraction and query prefetching modules of this invention, employing only an optimized insertion and splitting strategy; Scheme 3 adds a differential value compression module to Scheme 2; and Scheme 4 is the complete implementation of this application.
[0103] The specific experimental data are as follows: Scheme 1: Index building time 3.2 hours, index size 8.5GB, average query response time 125 milliseconds, average node accesses per query 150. Scheme 2: Index building time 3.5 hours, index size 8.2GB, average query response time 98 milliseconds, average node accesses 115. Scheme 3: Index building time 3.6 hours, index size 3.9GB, average query response time 75 milliseconds, average node accesses 116. Scheme 4: Index building time 3.8 hours, index size 4.1GB, average query response time 42 milliseconds, average node accesses 65. The complete implementation of this invention, Scheme 4, exhibits the best query performance. Although the index building time and space overhead are slightly increased compared to Scheme 3, compared to the benchmark 3DR-tree algorithm, the complete scheme reduces query response time by 66.4% and compresses index storage space by 51.8%. Each ablation version also verified the rationality of different modules. The optimized insertion and splitting strategies, differential numerical compression, and query-aware prefetching mechanism all contributed to the overall performance.
[0104] The performance improvement of Scheme 2 compared to Scheme 1 stems from the insertion and splitting strategy that addresses input node instability and spatiotemporal variability costs, reducing node overlap and thus minimizing invalid paths traversed during queries, resulting in a 23.3% decrease in node access count. Scheme 3, building on Scheme 2, introduces differential numerical compression based on benchmark objects, drastically reducing the index size by 52.4% and lowering disk I / O overhead, further shortening query time by 23.5%. Scheme 4 introduces a query pattern learning and prefetching mechanism. By analyzing historical queries, it successfully predicts and preloads data, reducing query waiting time and the actual number of logical node accesses, further reducing query time by 44%. This demonstrates the advantages of query pattern learning and prefetching mechanisms in handling repetitive or similar query loads. A comprehensive performance comparison of each scheme is provided below. Figure 3 As shown.
[0105] This invention also discloses a geographic information mapping data management system based on multi-source spatiotemporal features, including a processor and a memory. The memory stores computer program instructions, which, when executed by the processor, implement the geographic information mapping data management method based on multi-source spatiotemporal features according to this invention.
[0106] The system also includes other components well known to those skilled in the art, such as communication buses and communication interfaces, the settings and functions of which are known in the art and will not be described in detail here.
[0107] It should be noted that those skilled in the art can make various modifications and improvements without departing from the inventive concept, and these all fall within the scope of protection of this invention. Therefore, the scope of protection of this patent should be determined by the appended claims.
Claims
1. A geographic information mapping data management method based on multi-source spatiotemporal characteristics, characterized in that, include: Obtain geographic information mapping data objects containing spatiotemporal coordinate attributes; when inserting data objects into non-leaf parent nodes, calculate a comprehensive insertion cost and select the child node with the smallest comprehensive insertion cost to insert the data object; When the number of child node data objects exceeds the capacity threshold, the optimal splitting strategy is selected by evaluating the spatiotemporal variability cost of candidate splitting strategies. For the new leaf node generated by the split, a reference object is determined based on the spatiotemporal distribution of the internal data objects, and a differential numerical precision level is selected for the new leaf node based on the spatiotemporal variability cost with the smallest value calculated during the split. The data objects stored thereafter all store the processed differential values of the data objects relative to the reference object. Receive spatiotemporal range query requests and extract query pattern feature vectors; When traversing the index tree, the query pattern feature vector is matched with the historical query pattern cluster centroids stored in the index node. If a match is successful, the subsequent node data is prefetched based on the probabilistic adjacency pointer associated with the best matching centroid. After the query is completed, the actual path and pattern features of this query are asynchronously recorded, and the transition probabilities of the query pattern cluster centroids and adjacency pointers of the nodes hit on the path are updated in batches by a background thread.
2. The geographic information mapping data management method based on multi-source spatiotemporal features according to claim 1, characterized in that, When inserting a data object into a non-leaf parent node, the process of calculating a comprehensive insertion cost and selecting the child node with the lowest comprehensive insertion cost to insert the data object includes: For a child node of a non-leaf parent node, the node instability metric of that child node is determined by the child node's most recent instability value recorded by the parent node. The arithmetic mean of the historical split cost values is determined; the comprehensive insertion cost of the data object to be inserted for each child node is a weighted linear combination of the node instability metric, the time span expansion caused by the data object, the spatial expansion, and the current space occupancy rate after normalization; the data object is inserted into the child node with the smallest comprehensive insertion cost value.
3. The geographic information mapping data management method based on multi-source spatiotemporal features according to claim 1, characterized in that, The step of selecting the optimal splitting strategy by evaluating the spatiotemporal variability cost of candidate splitting strategies includes: calculating the spatial overlap area between the minimum bounding rectangles of the two new nodes generated by the split; calculating the normalized time span overlap between the time intervals covered by the two new nodes; calculating the spatiotemporal distribution variance of all data objects within the two new nodes respectively; and weighted summing the sum of the normalized spatial overlap area, the normalized time span overlap, and the normalized spatiotemporal distribution variance of the two new nodes to obtain the spatiotemporal variability cost of the current splitting strategy.
4. The geographic information mapping data management method based on multi-source spatiotemporal features according to claim 1, characterized in that, The step of determining a reference object based on the spatiotemporal distribution of internal data objects includes: normalizing the spatial and temporal coordinates of the object set assigned to the new leaf node, calculating the spatiotemporal centroid and average change frequency of the object set; calculating the weighted spatiotemporal distance between each object and the spatiotemporal centroid among objects whose change frequency is not higher than the average change frequency; and selecting the object with the smallest weighted spatiotemporal distance value as the reference object of the new leaf node.
5. The geographic information mapping data management method based on multi-source spatiotemporal features according to claim 1, characterized in that, The minimum spatiotemporal variability cost calculated during splitting is used to select a differential numerical precision level for the new leaf node. Subsequent data objects are stored with their processed differential values relative to the reference object. This includes: pre-setting multiple differential numerical precision levels and corresponding spatiotemporal variability cost threshold ranges; comparing the minimum spatiotemporal variability cost calculated during splitting with the pre-set threshold ranges to set the differential numerical precision level corresponding to the cost range for the new node; and storing the spatiotemporal coordinates of subsequent data objects in the new node as differential values relative to the spatiotemporal coordinates of the reference object, processed according to the set precision level.
6. The geographic information mapping data management method based on multi-source spatiotemporal features according to claim 1, characterized in that, When traversing the index tree, the query pattern feature vector is matched with the historical query pattern cluster centroids stored in the index node. If a match is successful, the data of subsequent nodes is prefetched based on the probabilistic adjacency pointer associated with the best matching centroid. This includes: combining the spatiotemporal coordinates of the center point of the query window, the spatial span of the query window, and the time span after performing minimum-maximum normalization processing to form the query pattern feature vector; reading the historical query pattern cluster centroid vectors stored in the current index node; calculating the similarity between the query pattern feature vector and each centroid vector based on Euclidean distance; and if the maximum similarity is greater than a preset matching threshold, activating the probabilistic adjacency pointer associated with the centroid vector with the highest similarity to prefetch the data of the target node.
7. The geographic information mapping data management method based on multi-source spatiotemporal features according to claim 1, characterized in that, The asynchronous recording of the actual path and pattern features of this query, and the batch updating of the query pattern cluster centroids and transition probabilities of the adjacent pointers of the hit nodes on the path by a background thread, includes: the background thread periodically collecting query logs, obtaining the pattern feature vector of this query for each node and subsequent access node on the query path; finding the query pattern cluster centroid with the highest similarity to the pattern feature vector in the current node, and updating the corresponding centroid using the exponential moving average method; updating the access frequency counter from the current node to each adjacent node, and incrementing the counter value corresponding to the actual accessed subsequent node by one; and calculating and updating the transition probabilities of the probabilistic adjacent pointers pointing to each adjacent node associated with the centroid with the highest similarity based on the updated access frequency counter.
8. The geographic information mapping data management method based on multi-source spatiotemporal features according to claim 1, characterized in that, The method further includes: generating a time granularity code for the geographic information mapping data object, wherein the time granularity code is formed by normalizing and combining multiple time components decomposed from the timestamp of the data object; when inserting a data object into a non-leaf parent node or when node overflow triggers a split, the time granularity code is used as a pre-grouping basis for assessing spatiotemporal variability, guiding data with the same frequency to be preferentially routed to the same tree branch or divided into the same new leaf node.
9. The geographic information mapping data management method based on multi-source spatiotemporal features according to claim 6, characterized in that, The calculation of the similarity between the query pattern feature vector and each centroid vector based on Euclidean distance, and the matching prefetching, includes: calculating the Euclidean distance between the query pattern feature vector and each historical query pattern cluster centroid, and converting the Euclidean distance into a similarity; finding the maximum similarity and comparing it with a preset matching threshold; if the maximum similarity is greater than the preset matching threshold, searching the probabilistic adjacency pointer list associated with the best matching centroid, selecting the node with the highest probability for data prefetching and loading it into the memory cache in advance.
10. A geographic information mapping data management system based on multi-source spatiotemporal features, characterized in that: include: A processor and a memory, wherein the memory stores computer program instructions that, when executed by the processor, implement the geographic information mapping data management method based on multi-source spatiotemporal features according to any one of claims 1-9.