Large-scale vector retrieval method, system, device and storage medium

By recursively building a hierarchical index from bottom to top, determining the balanced partition granularity, residing the top-level index in memory, storing partition data on external storage, and accessing each level in parallel, the problem of cross-node access and vector access costs in large-scale vector retrieval is solved, improving query performance and system scalability.

CN122240627APending Publication Date: 2026-06-19UNIV OF SCI & TECH OF CHINA

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
UNIV OF SCI & TECH OF CHINA
Filing Date
2026-05-22
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In large-scale vector retrieval scenarios, existing technologies struggle to balance cross-node access latency and vector access costs under the constraint of target recall rate, leading to a decline in system performance.

Method used

By recursively building a hierarchical index from bottom to top, a balanced partition granularity is determined. The top-level graph index resides in memory, while partition data is stored in persistent storage. During queries, candidate partitions are accessed in parallel layer by layer, reducing cross-node access overhead and vector read amplification.

Benefits of technology

It improves the query performance and system throughput of large-scale vector retrieval services, and enhances system scalability and resource utilization.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240627A_ABST
    Figure CN122240627A_ABST
Patent Text Reader

Abstract

This invention discloses a large-scale vector retrieval method, system, device, and storage medium, which are corresponding solutions. In each solution: a balanced partition granularity is determined based on the target recall rate; then, recursive clustering and hierarchical index construction are performed on the vector data according to the balanced partition granularity, ensuring that the top-level graph index meets the single-machine memory budget and resides in memory, while each layer of partitions is stored in persistent storage; during querying, the number of candidate vectors is determined layer by layer from the top level, and the corresponding partitions are retrieved in parallel until the retrieval results are output. This invention can balance target recall, query latency, and system throughput in large-scale vector retrieval scenarios, reduce cross-node access rounds and read amplification, and improve system scalability and resource utilization.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of vector retrieval technology, and in particular to a large-scale vector retrieval method, system, device and storage medium. Background Technology

[0002] Vector retrieval is a key technology in applications such as search, recommendation, advertising, image retrieval, and retrieval enhancement. Recall, as a metric for retrieval accuracy, characterizes the extent to which search results cover the true nearest neighbors; for example, Recall@K represents the coverage of the first K returned results by the true nearest neighbors. As the scale of vector data expands from tens of millions to billions or even larger, single-machine memory often struggles to hold the complete index and raw vector data. Therefore, related systems typically require a combination of distributed deployment and external storage to support large-scale vector retrieval services.

[0003] The existing technologies mainly adopt the following two types of solutions: The first type of solution is based on graph indexes. Graph indexes establish adjacency relationships for vectors, allowing queries to progressively access candidate vectors along the graph structure and approximate nearest neighbor results. In a single-machine environment, this typically results in high retrieval accuracy and good query performance. However, when graph indexes are directly partitioned and deployed across multiple nodes, a large number of cross-node edges and remote accesses can easily occur. Since the query process usually has strong data dependencies and sequential access characteristics, the next hop access path often needs to be determined only after the current candidate result has been calculated. Therefore, it is difficult to effectively reduce the additional overhead of cross-node accesses simply through simple prefetching or coarse-grained parallelism, leading to increased query latency.

[0004] The second type of approach is based on clustering. This involves clustering vector data and using cluster centers to partition or guide query requests, reducing the number of cross-node accesses. However, cluster centers typically only provide approximate representations of vectors within a partition, especially for vectors near partition boundaries, which can easily introduce representation errors. When the partition granularity is coarse, to achieve the target recall rate, the query process often needs to probe more partitions, leading to increased vector reads, distance calculations, and I / O (input / output) overhead, negatively impacting system performance.

[0005] Therefore, in large-scale vector retrieval scenarios, existing technologies still have the following technical problems: how to balance cross-node access latency and vector access cost under the constraint of target recall rate, so as to improve the scalability and query performance of large-scale vector retrieval services.

[0006] In view of this, the present invention is hereby proposed. Summary of the Invention

[0007] The purpose of this invention is to provide a large-scale vector retrieval method, system, device, and storage medium that can reduce cross-node access overhead and vector read amplification in distributed vector search under the constraint of target recall rate, thereby improving the query performance, system throughput, and scalability of large-scale vector retrieval services.

[0008] The objective of this invention is achieved through the following technical solution: A large-scale vector retrieval method, comprising: Obtain the vector dataset, target recall rate, and single-machine memory budget; Under the constraint of target recall, determine the balanced partition granularity when constructing the hierarchical index corresponding to the vector dataset; A bottom-up recursive approach is used to construct a hierarchical index: It is determined whether the current layer's vector data meets the single-machine memory budget. If so, the current layer's vector data is used to construct the top-level graph index, and the current layer's vector data is clustered based on balanced partitioning granularity to generate multiple partitions. If not, the current layer's vector data is clustered according to balanced partitioning granularity, generating multiple partitions and centroids of each partition. The centroids of the partitions are used as the vector data of the next layer, and the determination of whether the single-machine memory budget is met continues until the top-level graph index is constructed. The top-level graph index is deployed in the memory of the compute nodes, and each layer's partitions are written to persistent storage. When the current layer is a leaf layer, the current layer's vector data is a vector dataset. Using the input query vector, multiple candidate vector data are searched in the top-level graph index. Then, based on the searched candidate vector data, the next level partition is obtained in parallel, and multiple candidate vector data are searched in the next level partition. This process is repeated until the leaf layer is reached, where the search results are obtained.

[0009] A large-scale vector retrieval system for implementing the aforementioned method includes: The information acquisition unit is used to acquire vector datasets, target recall rate, and single-machine memory budget. The balanced partition granularity determination unit is used to determine the balanced partition granularity when constructing the hierarchical index corresponding to the vector dataset under the constraint of the target recall rate. The hierarchical index building unit is used to build a hierarchical index in a bottom-up recursive manner: it determines whether the current layer vector data meets the single-machine memory budget. If so, it builds the current layer vector data into a top-level graph index and clusters the current layer vector data based on balanced partitioning granularity to generate multiple partitions; if not, it clusters the current layer vector data according to balanced partitioning granularity to generate multiple partitions and centroids of each partition, and uses the centroids of the partitions as the vector data of the next layer, continuing to determine whether the single-machine memory budget is met, until the top-level graph index is built, the top-level graph index is deployed in the memory of the compute node, and each layer partition is written to persistent storage medium; where the current layer is a leaf layer, the current layer vector data is a vector dataset; The query unit is used to search for multiple candidate vector data in the top-level graph index using the input query vector, and then perform parallel retrieval of the next-level partition based on the searched candidate vector data, and search for multiple candidate vector data in the next-level partition, and repeat this process until the leaf layer is reached, and the retrieval results are obtained in the leaf layer.

[0010] A processing device includes: one or more processors; and a memory for storing one or more programs; When the one or more programs are executed by the one or more processors, the one or more processors implement the aforementioned method.

[0011] A readable storage medium storing a computer program that, when executed by a processor, implements the aforementioned method.

[0012] As can be seen from the technical solution provided by the present invention, the granularity of the balanced partition is determined based on the target recall rate; then, recursive clustering and hierarchical index construction are performed on the vector data according to the balanced partition granularity, so that the top-level graph index meets the single-machine memory budget and resides in memory, while the partitions of each layer are stored in persistent storage media; during querying, the number of candidate vectors is determined layer by layer from the top level and the corresponding partitions are obtained in parallel until the search results are output. The present invention can balance target recall rate, query latency and system throughput in large-scale vector retrieval scenarios, reduce cross-node access rounds and read amplification, and improve system scalability and resource utilization. Attached Figure Description

[0013] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the following description of the embodiments will be briefly introduced. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0014] Figure 1 This is a flowchart of a large-scale vector retrieval method provided in an embodiment of the present invention.

[0015] Figure 2 This is a schematic diagram illustrating the determination of the granularity of the balanced partitioning according to an embodiment of the present invention.

[0016] Figure 3 This is a schematic diagram illustrating the determination of balanced partition granularity and the construction of a hierarchical index, provided in an embodiment of the present invention.

[0017] Figure 4 This is a schematic diagram of the hierarchical index structure and the parallel drill-down process during the query phase provided in an embodiment of the present invention.

[0018] Figure 5 This is a schematic diagram of a large-scale vector retrieval system provided in an embodiment of the present invention.

[0019] Figure 6 This is a schematic diagram of a processing device provided in an embodiment of the present invention. Detailed Implementation

[0020] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the protection scope of the present invention.

[0021] First, the following explanations are provided for the terms that may be used in this article: The terms "comprising," "including," "containing," "having," or other similar semantic descriptions should be interpreted as non-exclusive inclusion. For example, including a technical feature element (such as raw material, component, ingredient, carrier, dosage form, material, size, part, component, mechanism, device, step, process, method, reaction conditions, processing conditions, parameter, algorithm, signal, data, product or article of manufacture, etc.) should be interpreted as including not only the expressly listed technical feature element, but also other technical feature elements that are not expressly listed and are well-known in the art.

[0022] The following provides a detailed description of a large-scale vector retrieval method, system, device, and storage medium provided by this invention. Contents not described in detail in the embodiments of this invention are prior art known to those skilled in the art. Where specific conditions are not specified in the embodiments of this invention, conventional conditions or conditions recommended by the manufacturer in the art shall apply. Reagents or instruments used in the embodiments of this invention, unless otherwise specified by the manufacturer, are all commercially available conventional products.

[0023] Example 1 like Figure 1The diagram shown is a flowchart of a large-scale vector retrieval method provided in an embodiment of the present invention, which mainly includes the following steps: Step 1: Information Acquisition.

[0024] In this embodiment of the invention, information such as vector dataset, target recall rate, and single-machine memory budget is mainly obtained.

[0025] In this embodiment of the invention, the vector dataset can be image vectors, text vectors, web page document vectors, multimodal vectors, or other high-dimensional feature vectors.

[0026] Step 2: Determine the granularity of the balanced partition.

[0027] In this embodiment of the invention, under the constraint of the target recall rate, the granularity of the balanced partitions when constructing the hierarchical index corresponding to the vector dataset is determined. This mainly involves determining the granularity of the balanced partitions in the leaf layer, as follows: Sample vectors of a preset size are extracted from the vector dataset, and multiple candidate partition densities are set accordingly; For each candidate partition density, clustering is performed on the extracted sample vectors to obtain the centroids of the corresponding partitions, and a centroid index is constructed; Under the constraint of the target recall rate, the corresponding cost index is measured in conjunction with the centroid index of each candidate partition density; The optimal candidate partition density is determined based on the cost index and used as the granularity of the balanced partitions in the leaf layer.

[0028] In this embodiment of the invention, partition density is used to quantify the partition granularity. Partition density is the ratio of the number of partitions in the current layer to the number of vectors in the current layer. The larger the partition density, the finer the partition; the smaller the partition density, the coarser the partition.

[0029] In this embodiment of the invention, the cost metric includes: the average vector access volume when the target recall rate is met, or it may also include: cross-node traversal volume, average partition access volume, average distance calculation volume, input / output count, and one or more other metrics that can reflect system cost. Taking the average vector access volume as an example, the candidate partition density that does not significantly deteriorate the average vector access volume while relatively reducing the cost of cross-node traversal can be selected as the balanced partition granularity.

[0030] In this embodiment of the invention, the clustering method can be the k-means algorithm, or other clustering algorithms can be selected. This invention does not impose any specific restrictions.

[0031] Preferably, when clustering the vector data of the current layer, for the boundary vector of each partition, the distance between it and the center of the adjacent partition is determined. If the distance to the center of an adjacent partition is less than a preset value, the corresponding boundary vector is copied to the corresponding adjacent partition.

[0032] Step 3: Construct a hierarchical index using a bottom-up recursive approach.

[0033] The main steps are as follows: (1) Determine whether the current layer vector data meets the single-machine memory budget.

[0034] (2) If so, the current layer vector data is constructed into a top-level graph index, and the current layer vector data is clustered based on the balanced partition granularity to generate multiple partitions.

[0035] This embodiment does not limit the specific algorithm for constructing the top-level graph index. For example, Vamana (Vamana graph algorithm), HNSW (Hierarchical Navigable Small World Graph), or other graph indexing algorithms suitable for approximate nearest neighbor search can be used.

[0036] (3) If not, cluster the current layer vector data according to the balanced partition granularity, generate multiple partitions and centroids of each partition, and use the centroids of the partitions as the vector data of the next layer, and then return to the aforementioned process (1) to continue to determine whether the single-machine memory budget is met, until the top-level graph index is constructed; where the current layer is a leaf layer, the current layer vector data is a vector dataset.

[0037] In this embodiment of the invention, when constructing a hierarchical index using a bottom-up recursive approach, if the current layer is not a leaf layer, the balanced partitioning granularity can be determined using any of the following methods: Method 1: All layers adopt a uniform balanced partitioning granularity; in this method, all layers (including leaf layers and non-leaf layers) directly adopt the balanced partitioning granularity of the leaf layer determined in step 2 above.

[0038] Method 2: Each layer determines its corresponding granularity of balanced partitioning. In this method, non-leaf layers need to redetermine their corresponding granularity of balanced partitioning. Specifically, when the current layer is not a leaf layer, the vector data of the current layer is used as the sample vector, and the granularity of balanced partitioning of the current layer is determined in the same way as the granularity of balanced partitioning of the leaf layer (the method described in step 2 above). In this method, both process (2) and process (3) above need to redetermine the granularity of balanced partitioning of the current layer.

[0039] Based on the above process, a multi-layered index structure is constructed from the top layer to the bottom layer (leaf layer), where the centroids of the upper layer are mapped one-to-one with the partitions of the lower layer. Then, the top-level graph index is deployed in the memory of the compute nodes, and the partitions of each layer are written to persistent storage. Specifically, each partition in each layer is saved to persistent storage as an independent object, independent data block, or independent storage unit. The data stored within each partition includes: vector data, vector identifiers, partition identifiers, and auxiliary metadata related to query processing; the partition identifiers are generated during the construction of the hierarchical index. For each partition in the lower layer, a centroid corresponding to it is generated, and these centroids are used as the vector data of the upper layer. These vector data retain the partition identifiers of the corresponding lower-layer partitions. In the upper layer, these vector data are further clustered to form multiple partitions, each containing multiple vector data. Therefore, a mapping relationship can be established between the vector data of the upper layer (i.e., the centroids of the lower layer) and the multiple partitions of the lower layer.

[0040] For example, if the next layer includes 100 partitions, 100 centroids corresponding one-to-one with the 100 partitions can be generated, and these 100 centroids are used as vector data for the next layer. If the next layer obtains 10 next-layer partitions and corresponding 10 next-layer centroids after clustering at a balanced partition granularity, then each next-layer partition contains several of the 100 input vectors. Since each input vector corresponds one-to-one with a next-layer partition, each next-layer partition can further correspond to multiple next-layer partitions, thus forming an inter-layer mapping relationship. During the query process, after the computing node obtains a candidate centroid in the next layer search, it can first locate the next-layer partition to which the candidate centroid belongs, and then determine the next-layer sub-partition to be accessed based on the partition identifier associated with the input vector contained in the next-layer partition.

[0041] Step 4: Execute the query.

[0042] In this embodiment of the invention, multiple candidate vector data are searched in the top-level graph index using the input query vector. These are all centroids of the next layer. The next layer partition is obtained in parallel based on the searched candidate vector data, and multiple candidate vector data are searched in the next layer partition. This process is repeated until the leaf layer is reached, and the retrieval results are obtained in the leaf layer.

[0043] In this embodiment of the invention, during the query execution process, according to the preset query search budget, a corresponding number of candidate vector data are searched at each layer. The candidate vector data of each layer is the centroid of the next layer. Through the mapping relationship between the centroid and the partition, the next layer partition is obtained in parallel. Finally, the search and sorting are performed at the leaf layer, and the top K vector data are taken as the search results.

[0044] In this embodiment of the invention, the query search budget is set according to the target recall rate, system resources, latency requirements, and business scenario; wherein, all layers are set to use a unified query search budget, or are set to use different query search budgets.

[0045] The main principle of the above-mentioned solution provided by the embodiments of the present invention is as follows: In a large-scale vector retrieval scenario, a balanced partition granularity is determined under the constraint of target recall rate, and a multi-layer index is recursively constructed based on the granularity, so that the top-level index resides in memory and the lower-level partitions are stored externally. During the query, candidate selection and partition access are performed in parallel, taking into account both cross-node access cost and vector access cost, thereby improving the system's query performance, throughput and scalability.

[0046] To more clearly demonstrate the technical solution and its effects provided by the present invention, the method provided by the embodiments of the present invention will be described in detail below with reference to specific examples.

[0047] I. Determining the granularity of the balanced partition.

[0048] In this embodiment of the invention, the granularity of the balanced partition can be a pre-fixed empirical parameter or an operating point obtained through a cost evaluation process. This invention uses the latter to determine the granularity of the balanced partition, such as... Figure 2 As shown, the corresponding flowchart is provided.

[0049] In this embodiment of the invention, partition granularity is used to characterize the coarseness of the current layer vector division. A finer partition granularity typically means a larger number of partitions and fewer vectors per partition; a coarser partition granularity typically means a smaller number of partitions and more vectors per partition. For ease of quantitative description, this embodiment uses candidate partition density to characterize partition granularity, where candidate partition density can be defined as the ratio of the number of current layer partitions to the number of current layer vectors. A higher candidate partition density indicates finer partitioning; a lower candidate partition density indicates coarser partitioning.

[0050] In this embodiment of the invention, to select an appropriate partition granularity under the constraint of target recall, sample vectors of a preset size can be extracted from the dataset of vectors to be indexed. The size can be flexibly set according to the dataset size, hardware resources, and evaluation time requirements; this invention does not impose any limitations on this. Subsequently, for multiple candidate partition densities, clustering (e.g., k-means clustering) is performed on the sample vectors, and a centroid index is constructed based on the centroids obtained from the clustering. Then, the cost metrics corresponding to different candidate partition densities are measured under the target recall.

[0051] In this embodiment of the invention, the cost metric may include: the average vector access count when the target recall rate is met, or may include: cross-node traversal count, average partition access count, average distance calculation count, input / output count, and other metrics that can reflect the system cost. In some implementations, the average vector access count can be used to characterize system throughput-related overhead, and the cross-node traversal count can be used to characterize system latency-related overhead.

[0052] Specifically, when the candidate partition density is high, the cluster centers are highly representative of the vectors within the partitions due to the finer partitions. Therefore, queries can usually locate the relevant partitions more accurately. In this case, the vector access volume is usually low, but the cross-node traversal volume may be large. However, when the candidate partition density gradually decreases, although the cross-node traversal volume will decrease, the number of vectors within the partitions increases, and the representativeness of the cluster centers for boundary vectors and local complex distributions decreases. When it drops to a certain threshold, in order to restore the target recall rate, the system needs to probe more partitions, which in turn causes the vector access volume to rise rapidly.

[0053] Here, taking average vector access volume and cross-node traversal volume as cost metrics, the average vector access volume and cross-node traversal volume can be measured for multiple candidate partition densities, provided that the target recall rate is met. The balanced partition granularity is then determined based on the measurement results. For example, the average vector access volume under a finer candidate partition density can be used as a baseline. When the increase in the average vector access volume corresponding to a certain candidate partition density relative to the baseline does not exceed a preset threshold, and its corresponding cross-node traversal volume decreases relatively, the vector access cost under that candidate partition density can be considered not to have significantly deteriorated. Further, as the partitions gradually coarsen as the candidate partition density decreases, if the increase in the average vector access volume corresponding to adjacent candidate partition densities exceeds a preset threshold, the vector access cost can be determined to have entered a significantly increasing range. Preferably, the candidate partition density before the significantly increasing range is determined as the balanced partition granularity. This method avoids significant read amplification caused by overly coarse partitions and excessive cross-node access caused by overly fine partitions.

[0054] Preferably, the process of measuring and comparing the density of multiple candidate partitions can employ stepwise scanning, binary search, segmented search, or other search methods. Here, "search" refers to the process of testing the density of multiple candidate partitions and measuring the corresponding cost metrics in a preset order under the constraint of target recall. For example, a baseline can be established starting with finer-grained partitions, and then testing can be gradually moved towards coarser-grained partitions; when a significant increasing trend in vector access volume is detected, the candidate granularity before that growth point can be determined as the balanced partition granularity. This invention does not limit the specific search strategy.

[0055] The balanced partition granularity obtained through the above methods is the balanced partition granularity of the leaf layer. The method for determining the balanced partition granularity of other layers can be set according to actual needs. In some implementations, different layers can share the same or approximately the same balanced partition granularity. For example, in the aforementioned method one, other layers directly adopt the balanced partition granularity of the leaf layer. In other implementations, the balanced partition granularity can also be determined separately for different layers to adapt to the data scale and distribution characteristics of different levels. For example, in the aforementioned method two, except for the leaf layer, the balanced partition granularity of other layers is determined separately during the hierarchical index construction process.

[0056] Through the above methods, the present invention can determine the partition working point that balances latency and throughput under the constraint of target recall rate, providing a foundation for the construction of recursive hierarchical indexes.

[0057] II. Recursive construction of hierarchical indexes.

[0058] In this embodiment of the invention, the hierarchical index is constructed using a bottom-up recursive construction method, rather than first determining a fixed tree structure and then cutting the leaves downwards.

[0059] First, determine whether the current layer vector data meets the single-machine memory budget. The single-machine memory budget here includes not only the memory space occupied by the index structure itself, but also factors such as candidate cache, graph adjacency structure, metadata, and runtime overhead. This invention does not impose specific limitations in this regard.

[0060] If the current layer vector data meets the single-machine memory budget, it means that the current layer can reside in the memory of the computing node as the top-level index. In this case, the top-level graph index can be directly constructed based on the current layer vector data. The top-level graph index can be a graph structure index, an approximate nearest neighbor graph index, or other index structures suitable for performing fast nearest neighbor search in memory; this invention does not limit this.

[0061] If the current layer vector data does not meet the single-machine memory budget, then clustering is performed on the current layer vectors according to the previously determined balanced partitioning granularity (this case corresponds to Method 1 above), generating multiple partitions and their centroids. The centroids of the partitions are used to represent the overall features of the corresponding partitions and serve as the input vector for the next layer. In this way, the current layer vector data is compressed into a smaller set of centroids after clustering.

[0062] In this embodiment of the invention, k-means, improved k-means, hierarchical clustering, or other clustering algorithms suitable for high-dimensional vector processing can be used to generate partitions and centroids. To improve the clustering effect, random initialization, k-means++ initialization, or initialization based on prior data distribution information can be used. This invention does not limit the specific clustering algorithm and initialization method.

[0063] Preferably, to mitigate recall loss caused by partitioning errors in vectors near partition boundaries, a copying process can be performed on the boundary vectors. Specifically, during clustering, boundary vectors that are close to the centers of multiple adjacent partitions can be identified, and these boundary vectors can be copied to adjacent candidate partitions. In this way, even if the query is directed to an adjacent partition during subsequent queries, these boundary vectors can still be accessed, thereby reducing the accuracy loss caused by cluster boundaries.

[0064] After generating the current layer partition, the partition can be written to persistent storage medium, and the corresponding centroid can be used as the input for the next layer. The same process continues. That is, it continues to determine whether the input for the next layer meets the single-machine memory budget; if it does not, clustering is performed again at the balanced partition granularity to generate higher-level centroids; if it does, the layer is used as the top-level index.

[0065] By repeatedly performing the above process, a multi-layered index structure from top to bottom is eventually formed. In this structure, there is a mapping relationship between the centroid of the upper layer and the partition of the lower layer. The top-level index is responsible for quickly determining the general search direction, the lower-level partitions are responsible for gradually refining the candidate range, and the leaf layers correspond to the final vector data or the finest-grained partitions.

[0066] like Figure 3 As shown, the overall process of determining the granularity of balanced partitioning and recursively constructing hierarchical indexes is illustrated, covering both Method 1 and Method 2.

[0067] Compared to top-down pre-defined hierarchical structures, the bottom-up recursive construction method of this invention has better adaptability. Since each layer is built around the same end-to-end recall target and partitioning granularity principle, it avoids recall losses in lower layers due to excessive compression of upper layers, thus better enabling stable retrieval performance in large-scale data scenarios.

[0068] III. Partitioned storage and inter-layer mapping.

[0069] In this embodiment of the invention, except for the top-level graph index which can be deployed in the memory of the computing node, each layer of partitions can be stored in a persistent storage medium, which is carried, managed, or accessed by one or more storage nodes. The persistent storage medium can be a solid-state drive, distributed object storage, distributed block storage, or other storage systems suitable for large-scale persistent data storage.

[0070] In some implementations, each partition can be stored as an independent object, an independent data block, or an independent storage unit. The partition can store the vector data it contains, vector identifiers, partition identifiers, and auxiliary metadata related to query processing. Partition identifiers can be generated by the system during the layer construction process and used to establish the mapping relationship between the upper-layer centroid and the lower-layer partitions.

[0071] Preferably, the partition identifier can be further used for distributed placement. For example, the partition identifier can be mapped to different storage nodes using a hash function.

[0072] In this embodiment of the invention, the distance calculation can be performed by selecting an appropriate vector distance metric scheme based on actual conditions or experience, such as Euclidean distance, cosine similarity / distance, etc., and the invention does not impose any restrictions.

[0073] In this embodiment of the invention, since there is a clear mapping relationship between the top-level index and the lower-level partition, the corresponding partition of the next level can be located directly based on the candidate centroid of the current level during the query process, without having to search again in the global scope, thereby reducing the overhead of switching between levels.

[0074] IV. Parallel drill-down at each level during the query phase.

[0075] After the hierarchical index is built, it can be used for query processing. For example... Figure 4 As shown in the embodiment of the present invention, the query phase is executed in a top-down, layer-by-layer parallel drill-down manner. Figure 4 In this context, the visited vector refers to the number of candidate vectors obtained at each search level, L. N L1 is the top layer, L0 is the leaf layer, and L1~L2 are the leaf layers. N-1 The solid arrows represent the intermediate layers; they illustrate the search path, starting the search process from the top-level graph index, while the dashed arrows in the middle indicate that the search process for the intermediate layers has been omitted.

[0076] For any received query vector, the computing node first performs an approximate nearest neighbor search in the top-level graph index to obtain the number of the top m most relevant candidate vectors in the current layer (each of which is a centroid of the next layer partition). Here, m is a preset query search budget used to control the number of candidate vectors that need to be expanded at each layer. The value of m can be set according to the target recall rate, system resources, latency requirements, and business scenarios, and this invention does not limit it.

[0077] After obtaining the number of the top m candidate vectors in the current layer, the m partitions of the next layer can be obtained in batches and in parallel based on the partition identifiers corresponding to these candidate vector numbers. Unlike tree structures that drill down along a single path, this embodiment of the invention allows multiple candidate partitions to be expanded simultaneously at each layer, thereby improving the coverage of the true nearest neighbor distribution in high-dimensional space.

[0078] Specifically, the storage node locates the next-level partition based on the partition identifier, reads the corresponding partition data from the persistent storage medium, performs distance calculation and candidate sorting within the partition locally, obtains local candidate results, and sends them to the computing node. The local candidate results include candidate vector identifiers and corresponding distances.

[0079] After receiving the local candidate results, the computing nodes merge and filter them to obtain the candidate vector data for the next layer. The computing and storage nodes continue to repeat the above process until the leaf layer, where distance calculation and candidate sorting are finally performed, and the top K vector data are returned as the retrieval results. In some implementations, different layers can use a unified query search budget, that is, each layer performs parallel expansion according to the same number of candidate vectors. The advantage of this is that it can simplify parameter configuration and system deployment complexity, and avoid tuning parameters separately for different layers. In other implementations, different search budgets can be used for different layers to adapt to the data scale and fidelity requirements of different layers. This invention does not limit this.

[0080] The query process in this embodiment of the invention has the following characteristics: First, the query path no longer depends on a large number of cross-node graph traversals, but is constrained by the layer height; Second, multiple partitions of each layer can be accessed in batches in parallel, so the total number of network rounds for querying is mainly determined by the number of layers; Third, as the candidates converge layer by layer, the query range gradually shrinks, which can control the system cost while maintaining the target recall rate.

[0081] The above-mentioned solution provided by the embodiments of the present invention determines the granularity of balanced partitioning and recursively constructs a hierarchical index, so that the top-level graph index resides in memory and the lower-level partitions are written to persistent storage media. During the query, a layer-by-layer candidate selection and partition parallel access mechanism is adopted, thereby taking into account both cross-node access cost and vector access cost. Through the solution of the present invention, cross-node traversal can be effectively reduced, vector read amplification can be reduced, and the query performance, system throughput and scalability of large-scale vector retrieval services can be improved.

[0082] Through the above description of the embodiments, those skilled in the art can clearly understand that the above embodiments can be implemented by software, or by using software plus necessary general-purpose hardware platforms. Based on this understanding, the technical solutions of the above embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (such as a CD-ROM, USB flash drive, mobile hard drive, etc.), including several instructions to cause a computer device (such as a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments of the present invention.

[0083] Example 2 This invention also provides a large-scale vector retrieval system, which is mainly used to implement the methods provided in the foregoing embodiments, such as... Figure 5 As shown, the system mainly includes: The information acquisition unit is used to acquire vector datasets, target recall rate, and single-machine memory budget. The balanced partition granularity determination unit is used to determine the balanced partition granularity when constructing the hierarchical index corresponding to the vector dataset under the constraint of the target recall rate. The hierarchical index building unit is used to build a hierarchical index in a bottom-up recursive manner: it determines whether the current layer vector data meets the single-machine memory budget. If so, it builds the current layer vector data into a top-level graph index and clusters the current layer vector data based on balanced partitioning granularity to generate multiple partitions; if not, it clusters the current layer vector data according to balanced partitioning granularity to generate multiple partitions and centroids of each partition, and uses the centroids of the partitions as the vector data of the next layer, continuing to determine whether the single-machine memory budget is met, until the top-level graph index is built, the top-level graph index is deployed in the memory of the compute node, and each layer partition is written to persistent storage medium; where the current layer is a leaf layer, the current layer vector data is a vector dataset; The query unit is used to search for multiple candidate vector data in the top-level graph index using the input query vector, and then perform parallel retrieval of the next-level partition based on the searched candidate vector data, and search for multiple candidate vector data in the next-level partition, and repeat this process until the leaf layer is reached, and the retrieval results are obtained in the leaf layer.

[0084] Since the main technical details of this system have been described in detail in previous embodiments, they will not be repeated here.

[0085] Those skilled in the art will understand that, for the sake of convenience and brevity, the above-described division of functional modules is used as an example. In practical applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the system can be divided into different functional modules to complete all or part of the functions described above.

[0086] Example 3 The present invention also provides a processing device, such as Figure 6 As shown, it mainly includes: one or more processors; a memory for storing one or more programs; wherein, when the one or more programs are executed by the one or more processors, the one or more processors implement the method provided in the foregoing embodiments.

[0087] Furthermore, the processing device also includes at least one input device and at least one output device; in the processing device, the processor, memory, input device, and output device are connected via a bus.

[0088] In this embodiment of the invention, the specific types of the memory, input device, and output device are not limited; for example: Input devices can be touchscreens, image acquisition devices, physical buttons, or mice, etc. The output device can be a display terminal; The memory can be random access memory (RAM) or non-volatile memory, such as disk storage.

[0089] Example 4 The present invention also provides a readable storage medium storing a computer program that, when executed by a processor, implements the method provided in the foregoing embodiments.

[0090] In this embodiment of the invention, the readable storage medium is a computer-readable storage medium and can be disposed in the aforementioned processing device, for example, as a memory in the processing device. Furthermore, the readable storage medium can also be any medium capable of storing program code, such as a USB flash drive, portable hard drive, read-only memory (ROM), magnetic disk, or optical disk.

[0091] The above description is merely a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims. The information disclosed in the background section is intended only to enhance the understanding of the overall background technology of the present invention and should not be construed as an admission or implication in any way that such information constitutes prior art known to those skilled in the art.

Claims

1. A large-scale vector retrieval method, characterized in that, include: Obtain the vector dataset, target recall rate, and single-machine memory budget; Under the constraint of target recall, determine the balanced partition granularity when constructing the hierarchical index corresponding to the vector dataset; A bottom-up recursive approach is used to construct a hierarchical index: It is determined whether the current layer's vector data meets the single-machine memory budget. If so, the current layer's vector data is used to construct the top-level graph index, and the current layer's vector data is clustered based on balanced partitioning granularity to generate multiple partitions. If not, the current layer's vector data is clustered according to balanced partitioning granularity, generating multiple partitions and centroids of each partition. The centroids of the partitions are used as the vector data of the next layer, and the determination of whether the single-machine memory budget is met continues until the top-level graph index is constructed. The top-level graph index is deployed in the memory of the compute nodes, and each layer's partitions are written to persistent storage. When the current layer is a leaf layer, the current layer's vector data is a vector dataset. Using the input query vector, multiple candidate vector data are searched in the top-level graph index. Then, based on the searched candidate vector data, the next level partition is obtained in parallel, and multiple candidate vector data are searched in the next level partition. This process is repeated until the leaf layer is reached, where the search results are obtained.

2. The large-scale vector retrieval method according to claim 1, characterized in that, Determining the balanced partition granularity when constructing the hierarchical index corresponding to the vector dataset under the constraint of the target recall rate refers to determining the balanced partition granularity of the leaf layer. The method is as follows: Extract sample vectors of a preset size from the vector dataset, and set multiple candidate partition densities accordingly; For each candidate partition density, perform clustering on the extracted sample vectors to obtain the corresponding partition and the centroid of the partition, and construct a centroid index; Under the constraint of the target recall rate, measure the corresponding cost index in combination with the centroid index of each candidate partition density; Determine the optimal candidate partition density based on the cost index, and use it as the balanced partition granularity of the leaf layer. When constructing a hierarchical index using a bottom-up recursive approach, if the current layer is not a leaf layer, the balanced partition granularity of the leaf layer is used directly; or, if the current layer is not a leaf layer, the vector data of the current layer is used as the sample vector, and the balanced partition granularity of the current layer is determined in the same way as the balanced partition granularity of the leaf layer.

3. The large-scale vector retrieval method according to claim 2, characterized in that, The cost metrics include: the average number of vector accesses when the target recall rate is met, or may include: cross-node traversal, average number of partition accesses, average distance calculation, number of inputs / outputs, and one or more other metrics that can reflect the system cost.

4. The large-scale vector retrieval method according to claim 1, characterized in that, When clustering the vector data of the current layer, for the boundary vector of each partition, determine its distance from the center of the adjacent partition. If the distance from the center of an adjacent partition is less than a preset value, then the corresponding boundary vector is copied to the corresponding adjacent partition.

5. The large-scale vector retrieval method according to claim 1, characterized in that, The step of writing each partition to the persistent storage medium includes: Each partition in each layer is saved as an independent object, independent data block, or independent storage unit to the persistent storage medium; the data stored inside each partition includes: vector data, vector identifier, partition identifier, and auxiliary metadata related to query processing; among them, the partition identifier is generated during the construction of the hierarchical index and is used to establish the mapping relationship between the vector data of the upper layer and the partition of the lower layer.

6. A large-scale vector retrieval method according to claim 1 or 5, characterized in that, Also includes: During the query process, according to the preset query search budget, a corresponding number of candidate vector data are searched at each layer. The candidate vector data of each layer is the centroid of the next layer. Through the mapping relationship between the centroid and the partition, the next layer partition is obtained in parallel. Finally, the search and sorting are performed at the leaf layer, and the top K vector data are used as the search results.

7. A large-scale vector retrieval method according to claim 6, characterized in that, The query search budget is set based on the target recall rate, system resources, latency requirements, and business scenarios; wherein, all layers are set to use a unified query search budget, or are set to use different query search budgets.

8. A large-scale vector retrieval system, characterized in that, To implement the method according to any one of claims 1 to 7, comprising: The information acquisition unit is used to acquire vector datasets, target recall rate, and single-machine memory budget. The balanced partition granularity determination unit is used to determine the balanced partition granularity when constructing the hierarchical index corresponding to the vector dataset under the constraint of the target recall rate. The hierarchical index building unit is used to build a hierarchical index in a bottom-up recursive manner: it determines whether the current layer vector data meets the single-machine memory budget. If so, it builds the current layer vector data into a top-level graph index and clusters the current layer vector data based on balanced partitioning granularity to generate multiple partitions; if not, it clusters the current layer vector data according to balanced partitioning granularity to generate multiple partitions and centroids of each partition, and uses the centroids of the partitions as the vector data of the next layer, continuing to determine whether the single-machine memory budget is met, until the top-level graph index is built, the top-level graph index is deployed in the memory of the compute node, and each layer partition is written to persistent storage medium; where the current layer is a leaf layer, the current layer vector data is a vector dataset; The query unit is used to search for multiple candidate vector data in the top-level graph index using the input query vector, and then perform parallel retrieval of the next-level partition based on the searched candidate vector data, and search for multiple candidate vector data in the next-level partition, and repeat this process until the leaf layer is reached, and the retrieval results are obtained in the leaf layer.

9. A processing device, characterized in that, include: One or more processors; Memory, used to store one or more programs; Wherein, when the one or more programs are executed by the one or more processors, the one or more processors cause the one or more processors to implement the method as described in any one of claims 1 to 7.

10. A readable storage medium storing a computer program, characterized in that, When a computer program is executed by a processor, it implements the method as described in any one of claims 1 to 7.