A highly scalable concurrent learning index system
By optimizing the learning index structure and using machine learning to predict key-value storage locations, combined with Bloom filters and key-linear models, the problems of high storage overhead and low concurrency in the learning index were solved, resulting in a highly efficient and scalable index system that improves data access and update performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XIAMEN UNIV
- Filing Date
- 2023-11-17
- Publication Date
- 2026-06-12
Smart Images

Figure CN117493349B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data indexing structures, and more particularly to an efficient and scalable concurrent learning indexing system. Background Technology
[0002] With the explosive growth of data, the data storage and access performance of in-memory systems has become crucial. Traditional data indexing structures such as B+ trees, hash trees, and trie trees have been widely used for data processing tasks. However, these classic indexing structures have limitations in supporting efficient query performance. To improve performance, applying machine learning to indexing structures has become a popular approach, opening up new areas of research in indexing. The idea behind learned indexes is to use a learned model to create an approximate index. It trains the model using the key and position of records, and then uses the model to predict the position of a given key. Compared to traditional indexing structures like B+ trees, learned indexes perform better when querying data. However, due to the long training time of the learned model, learned indexes may perform poorly in updating the index. Therefore, research on learned index structures is necessary to achieve better performance and multi-functional indexes in in-memory systems.
[0003] Insertion strategies in a learning index are generally divided into two types: incremental cache insertion and in-place insertion. Incremental cache insertion can be further subdivided into index-level, node-level, and key-value pair-level. At the index level, all insertion operations are maintained in a single cache; at the node level, a cache is maintained for each leaf node to store newly inserted data; and at the key-value pair level, a cache is maintained for each key-value pair in each leaf node. When the cache is full, these strategies periodically merge the cached data with the data in the learning index and retrain to build a new learning index. This strategy uses additional storage overhead to implement index insertion. In-place insertion is designed to accommodate insertion operations, reserving some gaps in each node (e.g., inserting a key-value pair occupies two positions). For each insertion, if the target position is empty, the key-value pair is inserted directly; otherwise, existing key-value pairs are moved to make room for the new key-value pair, or a new node is created, and a pointer to the new node is used to replace the target position, writing conflicting key-value pairs to the new node. The in-place insertion strategy easily moves a large number of key-value pairs from their original positions to store new key-value pairs, reducing the index's concurrency.To address the issue of high training costs for learning models in learned indexes leading to decreased insertion performance, existing research can be categorized into three types: (1) In single-threaded scenarios, by establishing a data structure and cost model with a free array on each node, node splitting or model retraining is controlled (Ding J, Minhas UF, Yu J, et al. ALEX: an updatable adaptive learned index. In SIGMOD, 2020; Hadian A, Heinis T. MADEX: Learning-augmented Algorithmic Index Structures. In VLDB, 2020; Wu J, Zhang Y, Chen S, et al. Updatable learned index with precise positions. In arXiv preprint arXiv, 2021); (2) In single-threaded scenarios, a multi-layer storage design is adopted, with the top layer used for inserting the latest key-value pairs. When the storage array is full, it is merged into the adjacent next layer (Ferragina P, Vinciguerra G. The PGM-index: a fully-dynamic compressed learned index with provable worst-case). (3) In multi-threaded scenarios, traditional data structures (such as B+ trees, MassTree, etc.) are used as incremental index caches. When a new key-value pair is not in the existing array, it is inserted into the cache. Finally, the background thread is used to asynchronously merge the key-value pairs in the cache and retrain to build a new learning index (Tang C, Wang Y, Dong Z, et al. XIndex: a scalable learned index for multicore data storage. In PPOPP, 2020; Wang Y, Tang C, Wang Z, et al. SIndex: a scalable learned index for string keys. In APSys, 2020; Li P, Hua Y, Jia J, et al. FINEdex: a fine-grained learned index scheme for scalable and concurrent memorysystems. In VLDB, 2021).
[0004] However, experiments and observations have revealed that while existing research considers model training algorithms, storage structures, and concurrency when designing index structures, it neglects actual storage overhead. Directly using different methods to store new data can lead to several problems: (1) The algorithm directly uses the maximum error value to build the model, but most data positions are much smaller than the maximum error value, failing to fully utilize the model's query performance; (2) Using tree-structured caches to store new data under different models makes it impossible to retrain the model in time as the data volume increases, resulting in performance degradation; (3) Overly fine-grained cache structures can easily lead to space waste; (4) Using gap data groups to store the insertion of new data significantly reduces the concurrency of the index. Therefore, how to optimize the data model building of learning index algorithms and design index structures to improve concurrency and reduce space waste remains a challenging and important problem in the field of learning indexes. Summary of the Invention
[0005] The purpose of this invention is to solve the problems of excessive storage overhead and concurrency issues in the existing technology of learning indexes. It proposes a concurrent learning index that uses machine learning to predict the storage location of key values, and achieves scalability and high efficiency of the index through fine-grained buffer structure and non-blocking update method.
[0006] The technical solution adopted by this invention to solve its technical problem is: to provide an efficient and scalable concurrent learning index system, comprising:
[0007] The learning index layer is built and trained using key-value pair data, and the learning index model is periodically updated using newly written key-value pair data.
[0008] The second-stage storage layer stores key-value pair data used to train the learning index model, as well as caches newly written key-value pair data.
[0009] The tree-structured index layer generates and updates index nodes based on newly written key-value pairs, and periodically returns index nodes to the learning index layer to update the learning index model.
[0010] Preferably, the learning index layer constructs a learning index model for fuzzy localization and a key linear model for precise localization, and the construction steps are as follows:
[0011] The initial key-value pair data is sorted and grouped to form multiple key arrays. Each key array creates a storage node. A key linear model is constructed for each key array to find the storage location of the key-value pair data in the key array, achieving precise location.
[0012] Select the maximum and minimum values of the keys in each key array to form a key index array;
[0013] Based on the key index array, a learning index model is built and trained using machine learning to obtain a trained learning index model, which is used to search the key array to achieve fuzzy positioning.
[0014] The system periodically receives index nodes from the tree-structured index layer and updates the trained learning index model.
[0015] Preferably, the periodic reception of index nodes in the tree-like index layer to update the trained learning index model specifically involves:
[0016] The learning index layer periodically receives index nodes returned by the tree index layer, merges the index nodes with the original key index array of the learning index layer in an orderly manner to form a new array, and uses the new array to reconstruct and train a new learning index model, replacing the original learning index model.
[0017] Preferably, the second-stage storage layer includes:
[0018] The first-stage storage is used to cache newly written key-value pair data; the first-stage storage consists of a root array and a subarray, the subarray stores key-value pair data, and the root array stores the maximum and minimum values of the subarray keys;
[0019] The second stage of storage is used to store key-value pair data that have been trained on the learned index model and the corresponding key linear model;
[0020] A Bloom filter is used to determine whether key-value pairs exist in the cache stored in the first stage.
[0021] Preferably, the tree index layer generates and updates index nodes based on the newly written key values, as follows:
[0022] Construct an empty array to serve as the index node;
[0023] When the cache of a storage node in the second-stage storage layer is full, the key-value pairs of the first-stage and second-stage storage in that storage node are merged and grouped in an orderly manner to obtain two key arrays, and two storage nodes corresponding to the two key arrays are created.
[0024] Construct a key linear model for each of the two key arrays;
[0025] Select the maximum and minimum values of the keys in the two key arrays respectively, and write them into the index nodes.
[0026] Preferably, the size of the index nodes in the tree index layer is limited. When an index node reaches the limit size, one of the storage nodes in the index node is replaced with an index pointing to a new index node. The storage node is first stored in the new index node, and then the new storage node is written to the new index node.
[0027] Preferably, the concurrent learning index system is used to implement write or delete functions, as follows:
[0028] The learning index layer uses the learning index model to find the key array containing key-value pair data;
[0029] In the second stage, the storage layer finds the Bloom filter in the corresponding storage node and modifies the bit of the corresponding key value in the Bloom filter. The content of the bit indicates whether the key-value pair data exists in the Bloom filter.
[0030] The key-value pair data is stored in the first stage of storage at this storage location; wherein, the key-value pair data written carries a flag bit 1, and the key-value pair data deleted carries a flag bit 0.
[0031] Preferably, the concurrent learning index system is used to implement query or update functions, as follows:
[0032] The learning index layer uses the learning index model to find the key array containing key-value pair data;
[0033] The second-stage storage layer uses the Bloom filter in the corresponding storage node to determine whether the key value exists in the first-stage storage.
[0034] If the Bloom filter determines that the key-value pair data exists, it first queries the key-value pair data in the first-order storage. If found, it returns the query result or updates the data. If not found, it enters the second-order storage, uses the key linear model for precise positioning, finds the storage location of the key-value pair data in the key array, and queries or updates the key-value pair data.
[0035] If the Bloom filter determines that the key-value pair does not exist, it directly enters the second-order storage, uses the key linear model for precise location, finds the storage location of the key-value pair data in the key array, and queries or updates the key-value pair data.
[0036] Preferably, the learning index layer uses a learning index model to find the key array where the key-value pair data is located. Specifically, it finds the storage node through the learning index model; if the storage node points to the key index array, it finds the key array corresponding to the storage node; if the storage node points to the index node, it continues to search for storage nodes using the index node; if the found storage node still points to the index node, it continues to search for storage nodes using the index node until a storage node pointing to the key index array is found, and then the key array corresponding to the storage node is found.
[0037] Preferably, the learning index layer, the second-stage storage layer, and the tree index layer are all equipped with index concurrency modules to ensure the normal implementation of multi-threaded concurrent index operations. Specifically, the learning index layer sets an RCU lock to prevent threads from accessing the learning index model during updates; in the tree index layer, a mutex lock is set on index nodes that are not yet full; in the second-stage storage layer, a mutex lock is set on the first-stage storage; the mutex lock only allows one thread to access the storage node.
[0038] The present invention has the following beneficial effects:
[0039] (1) This invention proposes an efficient and scalable concurrent learning indexing technology that implements key-value pair data storage through a fine-grained storage structure, while reducing thread access conflicts, thereby improving the concurrency of the index;
[0040] (2) The present invention introduces a learning index layer, which can speed up the index update and retraining process, alleviate the performance degradation caused by excessive data volume and long training time, and improve the access performance of the index.
[0041] (3) The tree index layer proposed in this invention makes full use of the advantages of tree index, thereby greatly reducing the cost of frequent retraining and updating of the learning index;
[0042] (4) The present invention adopts a second-stage storage layer, which utilizes the advantages of Bloom filters to improve the read and write performance of the index and ensure the scalability of the index;
[0043] (5) The hybrid learning index structure proposed in this invention reduces the frequency of index retraining and solves the problem of insufficient memory utilization caused by fine granularity, thus improving the overall performance of the index in the memory system.
[0044] The present invention will be further described in detail below with reference to the accompanying drawings and embodiments, but the present invention is not limited to the embodiments. Attached Figure Description
[0045] Figure 1 This is a schematic diagram of an indexing system according to an embodiment of the present invention;
[0046] Figure 2 This is a schematic diagram of the learning index layer in an embodiment of the present invention;
[0047] Figure 3 This is a schematic diagram of the second-stage storage layer according to an embodiment of the present invention;
[0048] Figure 4 This is a schematic diagram of a tree-shaped index layer according to an embodiment of the present invention;
[0049] Figure 5 Figures showing the experimental results of different index performance on the YCSB dataset;
[0050] Figure 6 Figure showing the experimental results of different indexes for multi-threaded concurrency under insertion operations;
[0051] Figure 7 The figure shows the experimental results of different indexes with frequent read and write operations having a skewed range;
[0052] Figure 8 The graph shows the experimental results of different index performance parameters for space utilization during insertion operations. Detailed Implementation
[0053] See Figure 1 The diagram shown is a schematic representation of an indexing system according to an embodiment of the present invention, comprising:
[0054] The learning index layer is built and trained using key-value pair data, and the learning index model is periodically updated using newly written key-value pair data.
[0055] The second stage storage layer stores key-value pair data that has been trained on the index model, as well as caches newly written key-value pair data.
[0056] The tree-structured index layer generates and updates index nodes based on newly written key-value pairs, and periodically returns index nodes to the learning index layer to update the learning index model.
[0057] See Figure 2 The diagram shown is a schematic of the learning index layer according to an embodiment of the present invention; the steps for constructing the learning index model in the learning index layer are as follows:
[0058] The initial key-value pair data is sorted and grouped to form multiple key arrays. Each key array creates a storage node. A key linear model is constructed for each key array to find the storage location of the key-value pair data in the key array, achieving precise location. Specifically, in this embodiment, the key linear model is fitted to each key array using the least squares method.
[0059] Select the maximum and minimum values of the keys in each key array to form a key index array;
[0060] Based on the key index array, a learning index model is constructed and trained using machine learning to obtain a trained learning index model, which is used to search the key array and achieve fuzzy positioning. Specifically, this embodiment uses a linear interpolation method to fit the learning index model. The training process is as follows: calculate the deviation between the fitting result of each key value in the model and the actual storage location. When the error value exceeds the maximum error value, the linear model is trained and its slope and intercept are stored. Calculate the deviation of each key value in different primary learning models, perform error analysis, and recursively construct multi-level models until they are unified into a single model, which serves as the learning index model. Specifically, count the proportion of key values with smaller actual error values in the multi-level models to the total number of key values. When the proportion exceeds a threshold, set the smaller actual error value as the prediction error value of the model. A greedy algorithm is used to recursively construct linear models and unify them into a single learning index model, where the greedy algorithm fits the relationship between key values and array storage locations.
[0061] The system periodically receives index nodes from the tree index layer, updates the trained learning index model, merges the key array with the index nodes and groups them to form a new key array, and retrains the learning index model. Specifically, a background thread is built to asynchronously obtain the index nodes of the tree index layer that are full and merge them with the trained key index array. The learning model is reconstructed in a recursive manner, and then the old learning index and key index array are replaced using the RCU locking mechanism to achieve asynchronous and non-blocking updates of the learning index.
[0062] See Figure 3 The diagram shown is a schematic of the second-stage storage layer according to an embodiment of the present invention, including:
[0063] A Bloom filter is used to determine whether key-value pair data exists in the first-stage storage.
[0064] The first-stage storage is used to cache newly written key-value pair data; the first-stage storage consists of a root array and a subarray, the subarray stores key-value pair data, and the root array stores the minimum value of the subarray;
[0065] The second stage of storage is used to store the trained key-value pair array and the corresponding key linear model, where the key linear model is represented as pos = key * k + b, k is the slope, b is the intercept, key is the key value, and pos represents the storage location.
[0066] Specifically, the steps for implementing write or delete functions in the second-stage storage layer are as follows:
[0067] The storage node for key-value pairs to be written or deleted is found by learning the index model in the index layer and the index nodes in the tree index layer.
[0068] The hash function in the Bloom filter of the storage node is used to calculate the position of the key value in the Bloom filter, and the corresponding bit in the Bloom filter is modified. The content of the bit indicates whether the key value data exists in the Bloom filter. At the same time, the key value data is written to the first-stage cache with a flag bit of 1 for deletion operations, and the flag bit is 0 for deletion operations. The key value data is then stored in the first-stage storage. Figure 3 The results of writing 23 and deleting 5 are shown.
[0069] Specifically, the steps for implementing query or update functions in the second-stage storage layer are as follows:
[0070] The storage node for the key-value pair data to be queried or updated is found by learning the index model in the index layer and the index nodes in the tree index layer.
[0071] The Bloom filter in the storage node is used to determine whether the key-value pair exists in the first-stage storage. If it exists, the key-value pair data in the cache is directly queried or updated. If it is not found in the first-stage storage, it enters the second-stage storage and queries or updates the trained key-value pair array through the key-linear model. If it does not exist, it enters the second-stage storage and queries or updates the trained key-value pair array through the key-linear model.
[0072] See Figure 4 The diagram shown is a schematic of a tree index layer according to an embodiment of the present invention. The tree index layer generates and updates index nodes based on newly written key values. The steps are as follows:
[0073] S401, construct an empty array as the index node;
[0074] S402, when the cache of a certain storage node in the second stage storage layer is full, the key-value pairs of the first stage storage and the second stage storage in the storage node are merged and grouped in an orderly manner to obtain two key arrays, and two storage nodes corresponding to the two key arrays are created.
[0075] S403, construct a key linear model for each of the two key arrays;
[0076] S404: Select the maximum and minimum values of the keys in the two key arrays respectively and write them into the index node;
[0077] S405: When the cache of a storage node in the second-stage storage layer is full, repeat S402 to S405.
[0078] The size of index nodes in the tree index layer is limited. When an index node reaches the limit, one of the storage nodes in the index node is replaced with an index pointing to a new index node. The storage node is first stored in the new index node, and then the new storage node is written to the new index node.
[0079] Specifically, when the cache of a storage node is full, the storage node splits into two new storage nodes. As shown in the figure, if storage node 2 is full, it will split into storage node 3 and storage node 4. The key6 and key7 pointing to storage node 3 and storage node 4 will replace the key5 pointing to storage node 2.
[0080] Specifically, when the array of index nodes is full, the index nodes are updated. As shown in the figure, the index node containing key1 that points to storage node 1 is full, so it is split into two index nodes. Storage node 1 is split into storage node 5 and storage node 6. The key1 of the original index node points to the new index node, and the key1 and key2 of the new index node point to storage node 5 and storage node 6 respectively.
[0081] Specifically, the learning index layer, the second-stage storage layer, and the tree-structured index layer all have index concurrency modules to ensure the normal implementation of multi-threaded concurrent index operations.
[0082] Learn the index layer, where the index is only used for querying, and combine background threads and RCU (Read-Copy-Update) locks to achieve non-blocking concurrent updates;
[0083] In a tree-structured index layer, a mutex lock is set for index nodes that are not full. When a thread accesses the node, it acquires the lock, and other threads waiting to access the node wait. When the accessing thread finishes its operation, it releases the mutex lock. For full index nodes, no mutex lock is set, allowing for non-blocking concurrent access.
[0084] In the second-stage storage layer, mutex locks are used for accessing the cached storage index, while simultaneously allowing exclusive access to and modification of the Bloom filter. For the trained key-value pair array, non-blocking concurrent access is possible due to read-only access. When splitting storage nodes, thread-safe access under concurrent conditions is achieved through mutex locks in the tree-structured index layer.
[0085] To test the performance of the indexing system of this invention, six datasets from the YCSB test suite and the Lognormal synthetic dataset from the SOSD test suite were used for test experiments. The indexing schemes used in the experiments included: (1) Learned index + Delta-Buffer, i.e. LI + Δ, which consists of a learned index and an incremental index. The learned index is an RMI recursive model, and the incremental index is a Masstree, which is used to cache all written data. In the experiment, a two-stage RMI implementation was used; (2) XIndex, which divides the key-value pair data into multiple groups. Each group uses a cache for writing data. The least squares method is used to linearly fit the key-value pair data of each group, and a two-stage merging method is used to realize the asynchronous update of the index, effectively handling concurrent write operations; (3) Finedex, which adopts a fine-grained caching method and proposes a step-by-step merging and retraining strategy to realize non-blocking index updates, reducing conflicts between threads and improving index concurrency; (4) SpaceIndex, which represents the combination of the modules of this invention. For each concurrent index, CRUD operations were tested using an existing public dataset. The training error value of the learning model used in the experiment was set to 32. Except for the multi-threaded concurrent experiment, all other experiments used 24 threads to run all schemes to evaluate concurrent performance. The factors that affect index performance considered in the test experiments mainly include mixed operations, number of threads, operations with skewed ranges, and space utilization.
[0086] See Figure 5 The figure shows the experimental results of different index performance on the YCSB dataset. Six test datasets were used to evaluate the throughput of this invention and other comparative indexes under different mixed operations. YCSB_A, YCSB_B, YCSB_C, YCDB_D, YCDB_E, and YCDB_F represent re-update, mostly read, read-only, read recently inserted data, few range lookups, and read-after-modification, respectively. The number of loaded datasets and operations were 10,000,000. To better demonstrate update performance, the number of loaded datasets was set to a 1:1 ratio of the number of keys used to build the learning index to the number of inserted data. The test results are as follows... Figure 5 As shown, compared to LI+Δ, XIndex, and Finedex, SpaceIndex improves throughput by an average of 26.9%, 32.3%, and 14.6% per unit time, respectively. Furthermore, in some benchmarks, Finedex exhibits better performance on read-only and read-most-inserted datasets because it builds a learned index directly on key-value pairs and experiences fewer thread conflicts during multi-threaded read operations, thus demonstrating better performance in read-only mode and for reading most recently inserted data.
[0087] See Figure 6The figure shows the experimental results of different index performance under multi-threaded concurrency during insertion operations. The multi-threaded concurrency index performance test used the Lognormal synthetic dataset as the test dataset. The initial learning index data volume and the number of insertion operations were set to a 1:1 ratio, both being 10,000,000. The test evaluated the insertion performance of this invention compared to other indexes under threaded conditions. The test results are shown below. Figure 6 As shown in the figure, this experiment tested the throughput of index insertion operations with 1, 8, 16, 24, 32, and 40 threads. Compared with LI+Δ, XIndex, and Finedex, SpaceIndex improved throughput by 139%, 36.9%, and 22.7%, respectively. When the number of threads is small, Finedex performs better because it uses a fine-grained caching strategy, resulting in fewer conflicts with fewer threads. SpaceIndex, on the other hand, experiences more thread conflicts, which have a greater impact on performance. When the number of threads is large, SpaceIndex performs better because the insertion operation does not need to access the trained key-value pair array, while Finedex still needs to access the trained key-value pair array. This array access process becomes a key factor affecting performance.
[0088] See Figure 7 The figure shows the experimental results of different indexes performing frequent read / write operations with skewed ranges. The performance test for frequent read / write operations with skewed ranges used a dataset with a read / write ratio of 1:1, and the dataset was loaded with a 1:1 ratio of the number of key values used to build the learning index to the number of inserted data. The test evaluates the throughput of this invention and other comparative indexes under skewed access. The test results are as follows: Figure 7 As shown in the figure, the horizontal axis represents the percentage of key-value pairs accessed that are within the initial data key-value range. Because the SpaceIndex index structure updates through node splitting, ensuring that key-value pairs are stored at the lowest level, it reduces multiple query operations across data storage layers and allows for faster index rebuilding, thus giving SpaceIndex a performance advantage. Compared to LI+Δ, XIndex, and Finedex, SpaceIndex improves performance by 17.4%, 20.5%, and 23.7%, respectively.
[0089] See Figure 8 The figure shows the experimental results of different index performance under insertion operations in terms of space utilization. The test of space utilization under insertion operations used the Lognormal synthetic dataset as the test dataset to evaluate the space utilization of this invention compared to other indexes under different amounts of inserted data. The test results are as follows: Figure 8As shown in the figure, the insertion factor refers to the ratio of the number of insertions to the number of trained learning models. Compared with LI+Δ, XIndex, and Finedex, it can improve by an average of 7.6%, 7.6%, and 38.0%, respectively. Among them, Learned index+Δ and XIndex use MassTree as a cache, so the space utilization is close to 67.0%, while Finedex uses a key-value level fine-grained cache, which is prone to insufficient space utilization. The merging and splitting operation of SpaceIndex nodes makes space utilization more efficient.
[0090] This invention proposes a highly efficient and scalable concurrent learning index system based on the strong correlation between data keys and storage locations. Compared to existing concurrent learning index schemes, this invention solves the problems of wasted index storage space and performance degradation caused by untimely learning index updates. This invention optimizes the design of learning index retraining and buffer structures. By reducing the frequency of retraining, efficient learning index updates are achieved. Simultaneously, the efficiency of allocated space utilization is optimized, improving concurrent access performance. Through these innovations, this invention effectively enhances the performance of learning indexes, enabling efficient data access and updates.
[0091] The above are merely preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A high-efficiency and scalable concurrent learning indexing system, characterized in that, include: The learning index layer is built and trained using key-value pair data, and the learning index model is periodically updated using newly written key-value pair data. The second-stage storage layer stores key-value pair data used to train the learning index model, as well as caches newly written key-value pair data. The tree-structured index layer generates and updates index nodes based on newly written key-value pairs, and periodically returns index nodes to the learning index layer to update the learning index model. The learning index layer constructs a learning index model for fuzzy localization and a key linear model for precise localization. The construction steps are as follows: The initial key-value pair data is sorted and grouped to form multiple key arrays. Each key array creates a storage node. A key linear model is constructed for each key array to find the storage location of the key-value pair data in the key array, achieving precise location. Select the maximum and minimum values of the keys in each key array to form a key index array; Based on the key index array, a learning index model is built and trained using machine learning to obtain a trained learning index model, which is used to search the key array to achieve fuzzy positioning. Periodically receive index nodes from the tree-structured index layer and update the trained index model; The periodic reception of index nodes in the tree-like index layer and the updating of the trained index model are specifically as follows: The learning index layer periodically receives index nodes returned by the tree index layer, merges the index nodes with the original key index array of the learning index layer in an orderly manner to form a new array, and uses the new array to reconstruct and train a new learning index model, replacing the original learning index model.
2. The efficient and scalable concurrent learning indexing system according to claim 1, characterized in that, The second-stage storage layer includes: The first-stage storage is used to cache newly written key-value pair data; the first-stage storage consists of a root array and a subarray, the subarray stores key-value pair data, and the root array stores the maximum and minimum values of the subarray keys; The second stage of storage is used to store key-value pair data that have been trained on the learned index model and the corresponding key linear model; A Bloom filter is used to determine whether key-value pairs exist in the cache stored in the first stage.
3. The efficient and scalable concurrent learning indexing system according to claim 2, characterized in that, The tree-structured index layer generates and updates index nodes based on the newly written key values, following these steps: Construct an empty array to serve as the index node; When the cache of a storage node in the second-stage storage layer is full, the key-value pairs of the first-stage and second-stage storage in that storage node are merged and grouped in an orderly manner to obtain two key arrays, and two storage nodes corresponding to the two key arrays are created. Construct a key linear model for each of the two key arrays; Select the maximum and minimum values of the keys in the two key arrays respectively, and write them into the index nodes.
4. The efficient and scalable concurrent learning indexing system according to claim 3, characterized in that, The size of index nodes in the tree index layer is limited. When an index node reaches the limit, one of the storage nodes in the index node is replaced with an index pointing to a new index node. The storage node is first stored in the new index node, and then the new storage node is written to the new index node.
5. The efficient and scalable concurrent learning indexing system according to claim 4, characterized in that, The concurrent learning index system is used to implement write or delete functions, as detailed below: The learning index layer uses the learning index model to find the key array containing key-value pair data; In the second stage, the storage layer finds the Bloom filter in the corresponding storage node and modifies the bit of the corresponding key value in the Bloom filter. The content of the bit indicates whether the key-value pair data exists in the Bloom filter. The key-value pair data is stored in the first stage of storage at this storage location; wherein, the key-value pair data written carries a flag bit 1, and the key-value pair data deleted carries a flag bit 0.
6. The efficient and scalable concurrent learning indexing system according to claim 5, characterized in that, The concurrent learning index system is used to implement query or update functions, as follows: The learning index layer uses the learning index model to find the key array containing key-value pair data; The second-stage storage layer uses the Bloom filter in the corresponding storage node to determine whether the key value exists in the first-stage storage. If the Bloom filter determines that the key-value pair data exists, it first queries the key-value pair data stored in the first stage. If found, it returns the query result or updates the data. If not found, it enters the second stage of storage, uses the key linear model to accurately locate the key-value pair data in the key array, and then queries or updates the key-value pair data. If the Bloom filter determines that the key-value pair does not exist, it directly enters the second stage of storage, using the key linear model for precise location to find the storage location of the key-value pair data in the key array, and then queries or updates the key-value pair data.
7. A high-efficiency and scalable concurrent learning indexing system according to claim 5 or 6, characterized in that, The learning index layer uses a learning index model to find the key array containing key-value pair data. Specifically, it finds the storage node through the learning index model; if the storage node points to the key index array, it finds the key array corresponding to the storage node; if the storage node points to the index node, it continues to search for storage nodes using the index node; if the found storage node still points to the index node, it continues to search for storage nodes using the index node until a storage node pointing to the key index array is found, then the key array corresponding to the storage node is found.
8. The efficient and scalable concurrent learning indexing system according to claim 1, characterized in that, The learning index layer, the second-stage storage layer, and the tree index layer all have index concurrency modules to ensure the normal implementation of multi-threaded concurrent index operations. Specifically, the learning index layer sets an RCU lock to prevent threads from accessing the learning index model when it is being updated; in the tree index layer, a mutex lock is set on index nodes that are not yet full. In the second-stage storage layer, a mutex lock is set in the first-stage storage; The mutex lock allows only one thread to access the storage node.