Data indexing method, apparatus, and electronic device

By hashing and addressing key-value pairs, a key index is created, solving the problem of large memory usage in existing technologies and achieving more efficient query and disk I/O performance.

CN113157689BActive Publication Date: 2026-06-30TENCENT TECHNOLOGY (SHENZHEN) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TENCENT TECHNOLOGY (SHENZHEN) CO LTD
Filing Date
2020-01-22
Publication Date
2026-06-30

Smart Images

  • Figure CN113157689B_ABST
    Figure CN113157689B_ABST
Patent Text Reader

Abstract

This invention provides a data indexing method, apparatus, electronic device, and storage medium. The method includes: hashing the keys in key-value pair data to obtain a key hash; determining the data address of the key-value pair data in storage space; and establishing a key index corresponding to the key-value pair data based on the key hash and the data address. The key index is used to respond to query requests for the key-value pair data. This invention reduces the memory space occupied by the index and improves query efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to data processing technology, and more particularly to a data indexing method, apparatus, electronic device, and storage medium. Background Technology

[0002] Key-value pairs are a common form of data storage where there is a corresponding relationship between keys and values. After storing key-value pair data, the corresponding value can be retrieved using the key. An index is a structure that sorts the stored key-value pair data. Its main purpose is to locate key-value pair data, thereby improving query response efficiency.

[0003] In related technical solutions, key-value pair data is typically stored using a sorted string table (SSTable) within a Log Structured Merge tree (LSM) architecture, and an index is created for the key-value pair data. However, this method indexes data blocks, and when the value volume in the key-value pair data is large, the number of blocks is large, and the memory space occupied by the block index is also large. Summary of the Invention

[0004] This invention provides a data indexing method, apparatus, electronic device, and storage medium that can reduce the memory space occupied by indexing key-value pairs of data.

[0005] The technical solution of this invention is implemented as follows:

[0006] This invention provides a data indexing method, including:

[0007] Hash the keys in key-value pairs to obtain the key hash;

[0008] Determine the data address of the key-value pair in the storage space;

[0009] Based on the key hash and the data address, establish a key index corresponding to the key-value pair data;

[0010] The key index is used to respond to query requests for the key-value pair data.

[0011] This invention provides a data indexing method, including:

[0012] Receive query requests including the target key;

[0013] The target key is hashed to obtain the target key hash;

[0014] Find the key index that matches the target key hash, and

[0015] Based on the data address in the found key index, determine the corresponding key-value pair data to respond to the query request.

[0016] This invention provides a data indexing device, comprising:

[0017] The first hash processing module is used to hash the keys in key-value pair data to obtain the key hash;

[0018] The address determination module is used to determine the data address of the key-value pair data in the storage space;

[0019] A module is established to create a key index corresponding to the key-value pair data based on the key hash and the data address.

[0020] The key index is used to respond to query requests for the key-value pair data.

[0021] This invention provides a data indexing device, comprising:

[0022] The receiving module is used to receive query requests including the target key;

[0023] The second hash processing module is used to perform hash processing on the target key to obtain the target key hash;

[0024] The lookup module is used to find the key index that matches the target key hash, and

[0025] Based on the data address in the found key index, determine the corresponding key-value pair data to respond to the query request.

[0026] This invention provides an electronic device, comprising:

[0027] Memory, used to store executable instructions;

[0028] The processor, when executing executable instructions stored in the memory, implements the data indexing method provided in the embodiments of the present invention.

[0029] This invention provides a storage medium storing executable instructions that, when executed by a processor, implement the data indexing method provided in this invention.

[0030] The embodiments of the present invention have the following beneficial effects:

[0031] This invention provides a key hash by hashing the keys in key-value pairs, and simultaneously determines the data address of the key-value pairs in the storage space. A key index is then established based on the key hash and the data address of the key-value pairs. Since the key hash is small in size, the memory space occupied by the index is greatly reduced. Attached Figure Description

[0032] Figure 1 This is a schematic diagram of an optional architecture of the data indexing system provided in an embodiment of the present invention;

[0033] Figure 2A This is a schematic diagram of an optional architecture of the electronic device provided in an embodiment of the present invention;

[0034] Figure 2B This is a schematic diagram of an optional architecture of the electronic device provided in an embodiment of the present invention;

[0035] Figure 3 This is a schematic diagram of an optional architecture of the data indexing device provided in an embodiment of the present invention;

[0036] Figure 4A This is an optional flowchart illustrating the data indexing method provided in an embodiment of the present invention;

[0037] Figure 4B This is an optional flowchart illustrating the data indexing method provided in an embodiment of the present invention;

[0038] Figure 5A This is an optional flowchart illustrating the data indexing method provided in an embodiment of the present invention;

[0039] Figure 5B This is an optional flowchart illustrating the data indexing method provided in an embodiment of the present invention;

[0040] Figure 5C This is an optional flowchart illustrating the data indexing method provided in an embodiment of the present invention;

[0041] Figure 6 This is an optional index diagram provided in an embodiment of the present invention. Detailed Implementation

[0042] To make the objectives, technical solutions, and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings. The described embodiments should not be regarded as limitations on the present invention. All other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0043] In the following description, references are made to “some embodiments,” which describe a subset of all possible embodiments. However, it is understood that “some embodiments” may be the same subset or different subsets of all possible embodiments and may be combined with each other without conflict.

[0044] In the following description, the terms "first" and "second" are used merely to distinguish similar objects and do not represent a specific ordering of objects. It is understood that "first" and "second" may be interchanged in a specific order or sequence where permitted, so that the embodiments of the invention described herein can be implemented in an order other than that illustrated or described herein.

[0045] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to limit the invention.

[0046] In the implementation of this application, the collection and processing of relevant data should strictly comply with the requirements of relevant laws and regulations, obtain the informed consent or separate consent of the personal information subject, and carry out subsequent data use and processing within the scope of laws and regulations and the authorization of the personal information subject.

[0047] Before providing a further detailed description of the embodiments of the present invention, the nouns and terms involved in the embodiments of the present invention will be explained, and the nouns and terms involved in the embodiments of the present invention shall be interpreted as follows.

[0048] 1) Key-value data: A common form of data storage where there is a correspondence between keys and values, and the corresponding value is obtained by searching through the key.

[0049] 2) Hash processing: Transform an input of arbitrary length into a fixed-length output through a hash algorithm. This output is the hash value.

[0050] 3) Hash collision: refers to the situation where two different inputs are transformed into the same hash value after being processed by a hash algorithm.

[0051] 4) Data Address: The data address can be the actual storage address of the data, the address offset of the data, or other forms of address. The address offset is the distance between the actual storage address of the data and the base address. The base address is used to reduce the range of the address offset so that the offset can be represented by a shorter data encoding address length.

[0052] 5) Index: A structure for sorting data. The main purpose of an index is to speed up the retrieval of data, that is, to find data that meets the restrictions as quickly as possible.

[0053] 6) Byte Alignment: Memory space is divided into bytes. Byte alignment refers to arranging data in space according to certain rules, rather than storing them sequentially one after another. The arrangement rules on which byte alignment depends can be reflected in the alignment unit. For example, if the data address encoding length is 2 bytes, then its encoding range is 64 kilobytes (kb). If the data to be byte-aligned is 256kb, then the alignment unit can be obtained as 256 / 64=4 bytes.

[0054] 7) Log Structured Merge Tree (LSM) architecture: A mainstream data organization method that improves write performance by transforming random writes to sequential writes on disk. LSM architecture is used in a variety of databases.

[0055] 8) Sorted String Table (SSTable): A data indexing method under the LSM architecture. An SSTable consists of a series of data blocks, and the data blocks are located by creating a block index.

[0056] In related technologies, key-value pair data is typically indexed using SSTables under the LSM architecture. However, this method indexes data blocks, requiring the entire data block to be loaded from disk for access, resulting in low disk I / O efficiency, especially noticeable with solid-state drives (SSDs). Furthermore, when the values ​​in key-value pairs are large, the number of data blocks is high, leading to a significant increase in memory usage for the block indexes. For example, SSTables usually limit the size of data blocks, such as a maximum block size of 64 kilobytes. If a key-value pair reaches 64 kilobytes, an SSTable will create a block containing only that key-value pair and a block index for that block. Since block indexes themselves occupy a large amount of memory, the memory usage of all block indexes increases dramatically as the number of large key-value pairs increases.

[0057] This invention provides a data indexing method, apparatus, electronic device, and storage medium that can reduce the memory space occupied by the index and improve disk I / O efficiency. The following describes an exemplary application of the electronic device provided in this invention.

[0058] See Figure 1 , Figure 1This is an optional architecture diagram of the data indexing system 100 provided in an embodiment of the present invention. In order to support a data indexing application, the terminal device 400 (terminal device 400-1 and terminal device 400-2 are shown as examples) connects to the server 200 through the network 300, and the server 200 connects to the database 500. The network 300 can be a wide area network or a local area network, or a combination of the two.

[0059] Server 200 is used to obtain key-value pair data and hash the keys in the key-value pair data to obtain key hashes. The key-value pair data can be entered by the user through terminal device 400 or obtained through other means. The key-value pair data is stored in database 500, and the data address of the key-value pair data in database 500 is determined. Based on the key hash and data address, a key index corresponding to the key-value pair data is established, and the key index is stored in the memory of server 200. Terminal device 400 is used to receive a query request including a target key and send the target key to server 200. Server 200 is also used to hash the target key to obtain a target key hash. The key index that matches the target key hash is searched, and the corresponding key-value pair data is determined based on the data address in the searched key index. The value in the key-value pair data is then sent to terminal device 400. Terminal device 400 is also used to display the queried value on graphical interface 410 (graphical interfaces 410-1 and 410-2 are shown as examples).

[0060] The following continues to describe exemplary applications of the electronic devices provided in the embodiments of the present invention. The electronic devices can be implemented as various types of terminal devices such as laptops, tablets, desktop computers, set-top boxes, and mobile devices (e.g., mobile phones, portable music players, personal digital assistants, dedicated messaging devices, portable gaming devices), or as servers. The following description uses an electronic device as a server as an example.

[0061] See Figure 2A , Figure 2A The electronic device 600 provided in the embodiments of the present invention (for example, may be...) Figure 1 The diagram shows the architecture of the server 200 or terminal device 400. Figure 2A The illustrated electronic device 600 includes at least one processor 610, a memory 650, at least one network interface 620, and a user interface 630. The various components in the electronic device 600 are coupled together via a bus system 640. It is understood that the bus system 640 is used to implement communication between these components. In addition to a data bus, the bus system 640 also includes a power bus, a control bus, and a status signal bus. However, for clarity, in… Figure 2A The general labeled all buses as Bus System 640.

[0062] The processor 610 can be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor, etc.

[0063] User interface 630 includes one or more output devices 631 that enable the presentation of media content, including one or more speakers and / or one or more visual displays. User interface 630 also includes one or more input devices 632, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

[0064] The memory 650 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid-state storage, hard disk drives, optical disk drives, etc. The memory 650 may optionally include one or more storage devices physically located away from the processor 610.

[0065] The memory 650 may include volatile memory or non-volatile memory, or both. The non-volatile memory may be read-only memory (ROM), and the volatile memory may be random access memory (RAM). The memory 650 described in this embodiment is intended to include any suitable type of memory.

[0066] In some embodiments, memory 650 is capable of storing data to support various operations, examples of which include programs, modules, and data structures or subsets or supersets thereof, as illustrated below.

[0067] Operating system 651 includes system programs for handling various basic system services and performing hardware-related tasks, such as the framework layer, core library layer, driver layer, etc., for implementing various basic business functions and handling hardware-based tasks;

[0068] The network communication module 652 is used to reach other computing devices via one or more (wired or wireless) network interfaces 620, exemplary network interfaces 620 including: Bluetooth, WiFi, and Universal Serial Bus (USB), etc.

[0069] Presentation module 653 enables the presentation of information (e.g., a user interface for operating peripheral devices and displaying content and information) via one or more output devices 631 associated with user interface 630 (e.g., a display screen, a speaker, etc.).

[0070] The input processing module 654 is used to detect and translate one or more user inputs or interactions from one or more input devices 632.

[0071] In some embodiments, the data indexing device provided in this invention can be implemented in software. Figure 2A A data indexing device 6551 stored in memory 650 is shown. This device can be software in the form of programs and plug-ins, and includes the following software modules: a first hash determination module 65511, an address determination module 65512, and an establishment module 65513. These modules are logically linked and can therefore be arbitrarily combined or further divided according to their implemented functions. The functions of each module will be described below.

[0072] In some embodiments, Figure 2B A data indexing device 6552 stored in memory 650 is shown. This device can be software in the form of programs and plug-ins, and includes the following software modules: a receiving module 65521, a second hash processing module 65522, and a lookup module 65523. These modules are logically linked and can therefore be arbitrarily combined or further divided according to their implemented functions. Figure 2B Except for the data indexing device 6552 shown, the rest can all be connected to Figure 2A They are the same. The functions of each module will be explained below.

[0073] In other embodiments, the data indexing device provided in the embodiments of the present invention can be implemented in hardware. As an example, the data indexing device provided in the embodiments of the present invention can be a processor in the form of a hardware decoding processor, which is programmed to execute the data indexing method provided in the embodiments of the present invention. For example, the processor in the form of a hardware decoding processor can be one or more application-specific integrated circuits (ASICs), DSPs, programmable logic devices (PLDs), complex programmable logic devices (CPLDs), field-programmable gate arrays (FPGAs), or other electronic components.

[0074] The electronic device that performs the data indexing method described above can be of various types. For example, the data indexing method provided in this embodiment of the invention can be executed by the server described above, or by a terminal device (e.g., a terminal device that can be a terminal device). Figure 1 The terminal device 400-1 or terminal device 400-2 shown may execute the command, or the server and terminal device may execute the command together.

[0075] The process of implementing a data indexing method in an electronic device by means of an embedded data indexing device 6551 will now be described in conjunction with the exemplary applications and structures of the electronic device described above.

[0076] See Figure 3 and Figure 4A , Figure 3 This is a schematic diagram of the architecture of the data indexing device 9551 provided in an embodiment of the present invention, illustrating the process of establishing an index through a series of modules. Figure 4A This is a flowchart illustrating the data indexing method provided in an embodiment of the present invention, which will be combined with... Figure 3 right Figure 4A The steps shown are explained.

[0077] In step 101, the keys in the key-value pair data are hashed to obtain the key hash.

[0078] As an example, see Figure 3 In the first hash processing module 65511, the key in the key-value pair data is hashed. To facilitate differentiation, the resulting hash value is named the key hash. The length of the key hash can be set according to the actual application scenario, such as 2 bytes.

[0079] In step 102, the data address of the key-value pair data in the storage space is determined.

[0080] Key-value pair data is stored in a designated storage space, and its data address within that space is determined. This data address indicates the storage status of the key-value pair data, facilitating retrieval of the corresponding data during subsequent queries. It's worth noting that the data address can be either the actual storage address of the key-value pair data or an address offset. Furthermore, in this embodiment of the invention, both the key and value in the key-value pair data are stored together, rather than simply storing the value as in traditional solutions, using the key as an index.

[0081] In step 103, a key index corresponding to the key-value pair data is established based on the key hash and data address; wherein, the key index is used to respond to query requests for key-value pair data.

[0082] Here, a key index is established based on the key hash and data address corresponding to the key-value pair data. Since the capacity of the key hash is usually smaller than that of the key (for example, some keys can have a capacity of tens of bytes), the key index established in this embodiment of the invention occupies less memory space than storing the key information itself directly in the index. The established key index provides a query channel for the key-value pair data, that is, it is used to respond to query requests for the key-value pair data. In addition, the key index may also include a value capacity, which is used to represent the size of the value in the key-value pair data. With the key index established and stored, if it is necessary to query the key-value pair data, it can be searched through the key hash, and after finding the corresponding key index, the corresponding key-value pair data can be accessed according to the data address in the key index.

[0083] Through the embodiments of the invention, for Figure 4A As can be seen from the above exemplary implementation, the embodiments of the present invention establish a key index corresponding to key-value pair data based on key hash and data address. Since the key index stores a key hash that occupies a small amount of space, the memory requirement is greatly reduced, that is, the memory space occupied by the key index is reduced.

[0084] In some embodiments, see Figure 4B , Figure 4B This is an optional flowchart illustrating the data indexing method provided in an embodiment of the present invention, based on... Figure 4A Before step 101, in step 201, the key-value pair data can be sorted according to the key in the key-value pair data.

[0085] As an example, see Figure 3 In the data sorting module, all key-value pairs can be sorted either in ascending order of keys or in descending order of keys. For example, if the key-value pairs include "100-value1", "56-value2", and "101-value3", where "100", "56", and "101" represent keys, then sorting them in ascending order of keys will result in the sequence: "56-value2"-"100-value1"-"101-value3".

[0086] In step 202, the sorted key-value pairs are stored in the partitioned sub-storage spaces; wherein the number of key-value pairs stored in the sub-storage spaces does not exceed the upper limit.

[0087] As an example, see Figure 3In the storage module, sub-storage spaces can be divided from the main storage space based on the number of key-value pairs. For example, if the maximum number of key-value pairs a sub-storage space can store is set to 128, and there are 200 key-value pairs to be stored, then two sub-storage spaces will be created. Of course, depending on the actual application scenario, the maximum number can be set to other values, not limited to 128. Then, the sorted key-value pairs are stored sequentially into the sub-storage spaces. When the maximum number of key-value pairs in a sub-storage space is reached, the remaining unstored key-value pairs are stored in the next sub-storage space, until all key-value pairs are stored. Simultaneously, the data addresses of the key-value pairs are determined during storage.

[0088] In some embodiments, the above-mentioned hashing of the keys in the key-value pair data can be achieved in the following way: mapping the keys in the key-value pair data to key hashes of a set hash length; wherein, the hash collision probability in the sub-storage space is the ratio between the upper limit of the number of hashes and the encoding types corresponding to the set hash length.

[0089] In this embodiment of the invention, the keys in key-value pairs can be mapped to key hashes of a set hash length according to a hash algorithm. By adjusting the set hash length and the upper limit of data in the sub-storage space, the hash collision probability within the sub-storage space can be controlled to ensure it does not exceed a collision probability threshold. The collision probability threshold can be set according to the actual application scenario. For example, if the hash length is set to 2 bytes, the corresponding encoding types (equivalent to encoding range) are 2. 16 If the maximum number of data items in a sub-storage space is 128, then the probability of a hash collision within that sub-storage space is 128 / 2. 16 =0.195%. By using the above method, the occurrence of hash collisions within sub-storage spaces can be minimized.

[0090] In step 203, the largest or smallest key in the sub-storage space is determined as the identifier key, and the data address of the key-value pair data containing the identifier key is determined as the identifier data address; wherein, the identifier key is used to represent the key range of the sub-storage space.

[0091] As an example, see Figure 3In the identifier determination module, identifiers are selected for the divided sub-storage spaces. Specifically, the largest or smallest key within the sub-storage space is determined as the identifier key used for indexing. This identifier key represents the key range of the sub-storage space. For example, if the keys are sorted in ascending order in step 201, and the identifier key is set to be the largest key within the sub-storage space, then the key range of sub-storage space A is (identifier key of sub-storage space B, identifier key of sub-storage space A). Here, sub-storage space B is the previous sub-storage space of sub-storage space A. That is, in step 202, the sorted key-value pairs are first stored in sub-storage space B, and after reaching the maximum number of items in sub-storage space B, they are then stored in sub-storage space A. After selecting the identifier key for the sub-storage space, the data address of the key-value pair containing the identifier key is determined as the identifier data address of the corresponding sub-storage space.

[0092] In step 204, a spatial index for the sub-storage space is established based on the identifier key and the identifier data address.

[0093] As an example, see Figure 3 In the spatial index building module, a spatial index for the sub-storage space is built based on the identifier key and the identifier data address. This spatial index is equivalent to the parent index of the key index and is used together with the key index to respond to query requests for key-value pair data.

[0094] In some embodiments, after step 204, the method further includes: sorting the spatial index according to the identifier key in the spatial index to obtain a spatial index sequence; wherein the spatial index sequence is used to respond to a lookup operation on the spatial index.

[0095] After establishing the spatial indexes of the sub-storage spaces, all spatial indexes can be sorted according to the identifier keys in the spatial indexes to obtain a spatial index sequence. The sorting process can be performed in ascending or descending order of the identifier keys. The resulting sorted spatial index sequence can be used to respond to ordered lookup operations on the spatial indexes, improving search efficiency.

[0096] In some embodiments, after step 103, the method further includes: sorting the key index according to the key hash to obtain a key index sequence; wherein the key index sequence is used to respond to a lookup operation on the key index.

[0097] Similarly, all key indexes within the storage space can be sorted based on the key hash within the key index to obtain a key index sequence. The sorting can be done in ascending or descending order of key hashes. It's worth noting that this key hash-based sorting is not traditional bucket sorting; it directly sorts based on the key hash value, facilitating subsequent ordered lookup processing. For example, if the key hash of key index 1 is 187, key index 2 is 154, and key index 3 is 150, sorting them in ascending order of key hashes yields the key index sequence: key index 3 - key index 2 - key index 1. This sorted key index sequence can be used to respond to ordered lookup operations on the key indexes within the storage space, improving search efficiency. Based on the partitioning of sub-storage spaces, all key indexes within each sub-storage space can be sorted to obtain the corresponding key index sequence for that sub-storage space.

[0098] exist Figure 4B middle, Figure 4A The step 102 shown can be updated to step 205. In step 205, the relative data address of the key-value pair data is obtained by subtracting the identifier data address of the sub-storage space where the key-value pair data is located from the data address of the key-value pair data.

[0099] As an example, see Figure 3 In the address determination module 65512, based on the determined identifier data address of the sub-storage space, the data address of each key-value pair within the sub-storage space can be updated according to the identifier data address. Specifically, the identifier data address of the sub-storage space containing the key-value pair is subtracted from the data address of the key-value pair to obtain the relative data address of the key-value pair. A key index is then built based on this relative data address. Compared to the original data address of the key-value pair, determining the relative data address allows for a shorter data address encoding length to describe the storage of the key-value pair.

[0100] In some embodiments, the above-mentioned storage of sorted key-value pairs into the partitioned sub-storage space can be achieved by: determining the storage capacity of the partitioned sub-storage space; determining the alignment unit based on the storage capacity and the data address encoding length; writing the sorted key-value pairs into the partitioned sub-storage space, and during the writing process, performing byte alignment processing on the written key-value pairs based on the alignment unit to obtain the data address of the written key-value pairs.

[0101] For the partitioned sub-storage space, the storage capacity to be stored can be determined, and the alignment unit can be determined based on the storage capacity and the set data address encoding length. The data address encoding length is typically 2 bytes, but other byte lengths are also possible. When determining the alignment unit, the corresponding encoding range is first determined based on the data address encoding length. For example, when the data address encoding length is 2 bytes, its encoding range is 2... 16 =65536 bytes = 64kb. Then, divide the storage capacity by the encoding range to get the alignment unit. For example, when the storage capacity is 256kb and the data address encoding length is 2 bytes, the alignment unit is 4 bytes.

[0102] After obtaining the alignment unit, the sorted key-value pairs are written to the partitioned sub-storage spaces. During the writing process, the key-value pairs are byte-aligned according to the alignment unit to obtain the data address of the written key-value pairs. It's worth noting that the data address obtained here can be the original data address of the key-value pairs within the sub-storage space or a relative data address. This method ensures that the obtained data address effectively represents the written key-value pairs. In some embodiments, determining the storage capacity of the partitioned sub-storage spaces can be achieved by performing any of the following processes to obtain the storage capacity: determining the stored capacity of the previous sub-storage space and predicting the storage capacity of the partitioned sub-storage spaces based on the stored capacity; determining the key-value pairs to be written to the partitioned sub-storage spaces and setting the data capacity of the key-value pairs to be written as the storage capacity of the partitioned sub-storage spaces.

[0103] This invention provides two methods for determining the storage capacity of a partitioned sub-storage space. The first method involves identifying the previous sub-storage space, which has already stored key-value pairs of data. Then, based on the stored capacity of the previous sub-storage space, the storage capacity of the partitioned sub-storage space is predicted. Here, the stored capacity of the previous sub-storage space can be directly used as the storage capacity of the partitioned sub-storage space. However, since data capacity has a certain degree of uncontrollability—that is, the data capacity stored in two adjacent sub-storage spaces may not be similar—a predetermined additional value can be added to the stored capacity of the previous sub-storage space to obtain the storage capacity of the partitioned sub-storage space. For example, if the stored capacity of the previous sub-storage space is 2 megabytes (mb), adding an additional value of 1mb yields a storage capacity of 3mb for the partitioned sub-storage space. By reading the stored capacity of the previous sub-storage space, the storage capacity can be obtained relatively quickly, improving storage efficiency. Based on this, the data capacity of all key-value pairs in the sub-storage space to be written can be read. When the data capacity exceeds the predicted storage capacity, the data address encoding length is increased, such as increasing the 2-byte data address encoding length to 4 bytes, so that the subsequent data address can effectively represent the key-value pairs to be written.

[0104] The second method is to directly obtain the data capacity of all key-value pairs in the partitioned sub-storage space to be written, and then determine this data capacity as the storage capacity of the partitioned sub-storage space. This method provides a more accurate storage capacity, but it is slower. Depending on the actual application scenario, any of the methods mentioned above can be used to obtain the storage capacity.

[0105] exist Figure 4B middle, Figure 4A The step 103 shown can be updated to step 206, in which the key hash, relative data address and value capacity are combined into a key index corresponding to the key-value pair data; wherein, the value capacity is used to represent the size of the value in the key-value pair data; the spatial index and the key index are used to respond to query requests for the key-value pair data.

[0106] As an example, see Figure 3 In module 65513, for each key-value pair, the key hash, relative data address, and value capacity are combined to form a key index. The value capacity represents the size of the value in the key-value pair, facilitating the loading of values ​​based on the value capacity during queries. The established spatial index and key index work together to respond to query requests for the key-value pair data.

[0107] Through the embodiments of the invention, for Figure 4BAs can be seen from the above exemplary implementation, the embodiments of the present invention form a two-level index architecture by establishing spatial indexes and key indexes. When a query request is received, the sub-storage space is searched first, and then the key-value pair data is searched, which effectively improves the query efficiency of key-value pair data. At the same time, by setting the hash length and the upper limit of the number, the occurrence of hash collisions in the sub-storage space can be effectively reduced.

[0108] The process of implementing a data indexing method in an electronic device by means of an embedded data indexing device 6552 will now be described in conjunction with the exemplary applications and structures of the electronic device described above.

[0109] See Figure 5A , Figure 5A This is an optional flowchart illustrating the data indexing method provided in this embodiment of the invention, which will be combined with... Figure 5A The steps shown are explained.

[0110] In step 301, a query request including the target key is received.

[0111] Here, the electronic device receives a query request that includes a key. For ease of distinction, the key in the query request is named the target key.

[0112] In step 302, the target key is hashed to obtain the target key hash.

[0113] Since the key index includes a key hash, when performing a query based on a query request, the target key is hashed to obtain the target key hash.

[0114] In step 303, the key index that matches the target key hash is found, and the corresponding key-value pair data is determined based on the data address in the found key index.

[0115] Here, the key index whose key hash is the same as the target key hash is searched. For example, the key indexes can be traversed and matched according to the target key hash until a key index with the same key hash is found. Then, based on the data address in the found key index, the actual storage address of the corresponding key-value pair data is determined, and the key-value pair data is accessed according to the actual storage address to respond to the query request.

[0116] Through the embodiments of the invention, for Figure 5A As can be seen from the above exemplary implementation, the embodiments of the present invention establish a key index for each key-value pair of data, find the corresponding key index according to the target key hash, and thus respond to the query request, thereby improving the query efficiency.

[0117] In some embodiments, see Figure 5B , Figure 5BThis is an optional flowchart illustrating the data indexing method provided in an embodiment of the present invention, based on... Figure 5A After step 303, in step 401, the key in the key-value pair data corresponding to the found key index can be determined as the comparison key.

[0118] Hash processing transforms an input of arbitrary length into a fixed-length output using a hash algorithm, yielding a hash value. However, different inputs may produce the same hash value after hash processing, resulting in a hash collision. Unlike traditional solutions that use the key as an index to store the value in key-value pairs, this invention stores both the key and the value, and establishes a key index based on the key hash. Therefore, this invention overcomes the hash collision problem by storing the key-value pair data. Specifically, when a key index matching the target key hash is found, the key in the key-value pair data corresponding to the found key index is determined as the comparison key.

[0119] In step 402, when the comparison key is the same as the target key, the value capacity in the found key index is determined.

[0120] Here, the comparison key is compared with the target key in the query request. If the comparison key is different from the target key, the key index corresponding to that comparison key is skipped; if the comparison key is the same as the target key, the value capacity in the key index corresponding to that comparison key is determined. It is worth noting that in step 401, at least two key indexes may be found, where the key hash in one of the found key indexes matches the target key hash. In this case, if the comparison key corresponding to a found key index is the same as the target key, the query is considered successful, and the value capacity in that key index is further determined; if the comparison keys corresponding to all found key indexes are different from the target key, a query failure message is returned.

[0121] In step 403, the values ​​in the key-value pairs corresponding to the found key index are loaded according to the value capacity to respond to the query request.

[0122] Here, the electronic device loads the value from the key-value pair data corresponding to the key index found in the storage space according to the determined value capacity, in response to the query request, for example, by presenting the value on the graphical interface of the terminal device so that the user can be informed.

[0123] Through the embodiments of the invention, for Figure 5B As can be seen from the above exemplary implementation, the embodiments of the present invention effectively avoid the situation of obtaining incorrect query results due to hash collisions by comparing the key corresponding to the found key index with the target key. That is, in the embodiments of the present invention, there is no possibility of key collision, thus improving the accuracy of the query.

[0124] In some embodiments, see Figure 5C , Figure 5C This is an optional flowchart illustrating the data indexing method provided in an embodiment of the present invention. Figure 5A Step 303 shown can be implemented through steps 501 to 509, which will be explained in conjunction with each step.

[0125] In step 501, the spatial index located in the middle position within the spatial index sequence is determined as the comparison spatial index.

[0126] Based on the established spatial index sequence, an ordered search can be performed on the spatial index sequence according to the target key in the query request to obtain the target spatial index. The range of keys represented by the identifier key of the target spatial index includes the target key. It is worth noting that this embodiment of the invention does not limit the method of ordered search processing; for example, ordered search processing methods may include binary search, interpolation search, and Fibonacci search. For ease of understanding, the process of obtaining the target spatial index is explained using a binary search method.

[0127] During a binary search, the spatial index sequence is first obtained, and the spatial indices in this sequence have been sorted in a specific order. Then, the spatial index located in the middle position within the spatial index sequence is determined as the comparison spatial index, and a binary search is performed on the sub-storage space.

[0128] In step 502, the spatial index sequence is divided into a first spatial index sequence and a second spatial index sequence according to the comparison spatial index.

[0129] The spatial index sequence is divided into a first spatial index sequence and a second spatial index sequence based on the comparison spatial index. In the first spatial index sequence, the identifier key of the spatial index is less than the identifier key of the comparison spatial index, and in the second spatial index sequence, the identifier key of the spatial index is greater than the identifier key of the comparison spatial index.

[0130] In step 503, when the target key falls within the key range represented by the identifier key of the comparison space index, the comparison space index is determined as the target space index, and the sub-storage space corresponding to the target space index is determined. The key index located in the middle position within the key index sequence of the sub-storage space is determined as the comparison key index.

[0131] The key range represented by the identifier key of the comparison space index is compared with the target key. When the target key falls within the key range represented by the identifier key of the comparison space index, the comparison space index is determined as the target space index. Further, the sub-storage space corresponding to the target space index is determined, and based on the target key hash, an ordered search process is performed on the key index sequence corresponding to that sub-storage space to obtain the key index whose key hash matches the target key hash. It is worth noting that the ordered search process for the key index sequence is not limited in this embodiment of the invention; for ease of understanding, a binary search is used for illustration.

[0132] Furthermore, even without partitioning the storage space into sub-storage spaces, ordered search processing of the key index sequence can still be performed. Specifically, the key index located in the middle position within the key index sequence of the storage space is determined as the comparison key index, thereby enabling a binary search of the key index.

[0133] In step 504, when the target key does not fall within the key range represented by the identifier key of the comparison space index and the target key is greater than the identifier key of the comparison space index, the space index located in the middle position in the second space index sequence is determined as the new comparison space index.

[0134] Here, when the target key does not fall within the key range represented by the identifier key of the comparison space index, and the target key is greater than the identifier key of the comparison space index, the space index located in the middle position in the second space index sequence is determined as the new comparison space index. Based on the new comparison space index, the second space index sequence is divided, and the binary search continues.

[0135] In step 505, when the target key does not fall within the key range represented by the identifier key of the comparison space index and the target key is smaller than the identifier key of the comparison space index, the space index located in the middle position in the first space index sequence is determined as the new comparison space index.

[0136] When the target key does not fall within the key range represented by the identifier key of the comparison space index, and the target key is smaller than the identifier key of the comparison space index, the space index located in the middle position within the first space index sequence is determined as the new comparison space index. Based on the new comparison space index, the first space index sequence is then divided, and the binary search continues. It is worth noting that if, after completing the binary search on the sub-memory space, a sub-memory space containing the target key is still not found, a search failure message is returned.

[0137] In step 506, the key index sequence is divided into a first key index sequence and a second key index sequence based on the comparison key index.

[0138] When a sub-storage space is located, its key index sequence is obtained. The key indices in this sequence correspond to key-value pairs within the sub-storage space, and these key indices are sorted in a specific order. The key index located in the middle of the key index sequence is designated as the comparison key index. Based on this comparison key index, the key index sequence is divided into a first key index sequence and a second key index sequence. The key hash of the key index in the first key index sequence is less than the key hash of the comparison key index, while the key hash of the key index in the second key index sequence is greater than the key hash of the comparison key index.

[0139] In step 507, when the target key hash is the same as the key hash in the comparison key index, the comparison key index is determined to be the key index that matches the target key hash.

[0140] The target key hash is compared with the key hash in the comparison key index. When the target key hash and the key hash in the comparison key index are the same, the comparison key index is determined as the key index that matches the target key hash.

[0141] In step 508, when the target key hash is less than the key hash in the comparison key index, the key index located in the middle position in the first key index sequence is determined as the new comparison key index.

[0142] When the target key hash is less than the key hash in the comparison key index, the key index in the middle position of the first key index sequence is determined as the new comparison key index. The first key index sequence is then divided according to the new comparison key index, so that the binary search can continue.

[0143] In step 509, when the target key hash is greater than the key hash in the comparison key index, the key index located in the middle position in the second key index sequence is determined as the new comparison key index.

[0144] When the target key hash is greater than the key hash in the comparison key index, the key index in the middle position of the second key index sequence is determined as the new comparison key index. The second key index sequence is then divided according to this new comparison key index, and the binary search continues. It's worth noting that if no key index with the same key hash as the target key hash is found after the binary search of the key indexes is completed, a search failure message is returned.

[0145] In step 510, the corresponding key-value pair data is determined based on the data address in the key index that matches the target key hash, in response to the query request.

[0146] When a key index that matches the target key hash is determined, the corresponding key-value pair data is accessed based on the data address in that key index to respond to the query request.

[0147] In some embodiments, the above-mentioned determination of the corresponding key-value pair data based on the data address in the found key index can be achieved in the following manner: determine the relative data address in the found key index, and determine the identifier data address of the sub-storage space where the found key index is located; perform a summation process on the relative data address and the identifier data address, and determine the key-value pair data used to respond to the query request based on the data address obtained from the summation process.

[0148] If the found key index includes a relative data address, determine the identifier data address of the sub-storage space where the found key index is located, sum the relative data address and the identifier data address, and determine the key-value pair data used to respond to the query request based on the data address obtained from the summation. In this way, the applicability of accessing key-value pair data in different situations is improved.

[0149] Through the embodiments of the invention, for Figure 5C As can be seen from the above exemplary implementation, the embodiments of the present invention determine the sub-storage space and key index that meet the query request through ordered search processing, thereby obtaining the corresponding key-value pair data. Compared with the traversal search method, this improves query efficiency and speeds up the response to query requests.

[0150] The following will describe an exemplary application of the embodiments of the present invention in a practical application scenario.

[0151] First, all key-value pairs are sorted according to their keys. Based on the sorted key-value pairs, sub-storage spaces are created, and the sorted key-value pairs are stored in these sub-storage spaces. The number of key-value pairs in each sub-storage space does not exceed a data limit, such as 128. For each sub-storage space, a corresponding spatial index is created. The spatial index includes an identifier key for that sub-storage space, which can be the largest or smallest key. The spatial index also includes a base offset, which is equivalent to the identifier data address corresponding to the identifier key mentioned above. During storage, the key-value pairs can be byte-aligned according to a set data address encoding length. For example, with a data address encoding length of 2 bytes, for each key-value pair in a sub-storage space, the identifier data address can be subtracted from its original data address to obtain a relative data address of 2 bytes. Furthermore, all spatial indices can be sorted to obtain a spatial index sequence.

[0152] Next, the key index at the next level of the spatial index is built. Specifically, each key in the sub-storage space is hashed to obtain a key hash of a set length; for clarity, a hash length of 2 bytes is used as an example. Based on the obtained key hash, a key index is built, storing the following information: 2 bytes of key hash + 2 bytes of relative data address + 2 bits of value capacity. The value capacity represents the size of the value in the key-value pair and can be stored in up to 4 pages. Furthermore, all the obtained key indexes can be sorted to obtain a key index sequence.

[0153] The embodiments of the present invention provide, as follows Figure 6 The index diagram shown is in Figure 6 The diagram shows a sub-storage space where three key-value pairs are stored in ascending order of their keys. Key-value pair 1 has a key of 10 and a value of value1; key-value pair 2 has a key of 13 and a value of value2; and key-value pair 3 has a key of 19 and a value of value3. Figure 6 The data addresses shown can be actual storage addresses or address offsets, depending on the specific storage and addressing methods. Generally speaking, data address 3 > data address 2 > data address 1. For ease of explanation, we will take the case where the identifier key is the largest key in the sub-storage space as an example. Figure 6 The identifier key in the spatial index of the sub-storage space shown is the key in key-value pair data 3, which is 19, and the corresponding identifier data address is data address 3. When creating the key index, for key-value pair data 1, its corresponding key index includes the hash value of the key with a value of 10, the relative data address obtained by subtracting data address 3 from data address 1, and the capacity of value1; for key-value pair data 2, its corresponding key index includes the hash value of the key with a value of 13, the relative data address obtained by subtracting data address 3 from data address 2, and the capacity of value2; for key-value pair data 3, its corresponding key index includes the hash value of the key with a value of 19, the relative data address obtained by subtracting data address 3 from data address 3 (which is 0), and the capacity of value3. It is worth noting that the key-value pair data stored in the sub-storage space is itself stored in an ordered manner, and the key index corresponding to the key-value pair data can be sorted according to its key hash, and the spatial index corresponding to the sub-storage space can be sorted according to its identifier key.

[0154] After establishing the spatial index and key index, query requests can be received and corresponding queries can be performed. During a query, firstly, a binary search is performed on the spatial index sequence based on the target key in the query request to obtain the corresponding sub-storage space. Then, within that sub-storage space, a binary search is performed on the key index sequence based on the target key hash to obtain the corresponding key index. Of course, binary search may fail; in such cases, a query failure message is returned.

[0155] When the key index is obtained through binary search, the corresponding key-value pair is retrieved based on the relative data address in the key index. It's worth noting that, since this embodiment of the invention stores both the key and value in the key-value pair data, instead of only storing the value as in the traditional method and using the key as an index, when the key-value pair data is found, this embodiment also compares the key in the key-value pair data with the target key. Only if the two keys match is the search successful, and the value in the key-value pair data is returned. With 128 key-value pairs in the sub-storage space and a hash length of 2 bytes, the probability of a hash collision is 128 / 65536 = 0.195%. If a hash collision occurs, the cost is one I / O request, i.e., one additional access to the key-value pair data. This low-probability cost is within an acceptable range.

[0156] The beneficial effects of the data indexing method provided in this embodiment of the invention will be explained from the perspectives of time cost and space cost.

[0157] Regarding the time cost, since binary search is used during the search, the time cost of the search is log(n), which improves the search efficiency compared to the traditional method, where n is the total number of objects to be searched.

[0158] Regarding space cost, since each 128 keys correspond to a sub-storage space, and the storage size of each sub-storage space does not exceed 64 bytes, the storage cost of the sub-storage space is 64 / 128 = 0.5 bytes. Additionally, each key corresponds to a key index, and the size of the key index is 4.25 bytes. Therefore, in this embodiment of the invention, the required index size for each key is 4.75 bytes, far smaller than the index capacity of current mainstream query engines (usually above 24 bytes), approximately 20% of the mainstream index size. Through this embodiment of the invention, for 10 billion key-value pairs, the index only requires approximately 45 gigabytes (GB) of memory.

[0159] Through experimental verification by the inventors, in application scenarios with a value capacity > 4000 bytes, the solution provided by this invention improves memory efficiency by more than 5 times compared to the SSTable format commonly used in current LSM architectures (LevelDB, RocksDB, HBase, and Cassandra, etc.). Specific test data is as follows. The C language version of SSTable in LevelDB is used as a reference.

[0160] In disk I / O scenarios, the data volume is required to be much larger than the operating system's memory. Therefore, cache hits are essentially invalidated during random testing, making it an effective way to test disk I / O efficiency. The test machine used here includes two 20-core CPUs, 192GB of RAM, four 3.6 terabyte (TB) Non-Volatile Memory Express (NVMe) hard drives, one 480GB SSD, and two 10G Ethernet ports. Only one NVMe hard drive was used during the testing process.

[0161] In random query scenarios with value capacity > 4000 bytes, 80 threads were used to stress test various indexes. From a memory perspective, compared to SSTable, the solution provided in this embodiment of the invention improves memory efficiency by 400%; from a performance perspective, compared to SSTable, the solution provided in this embodiment of the invention improves the query per second (QPS) by 52%; and from the perspective of response speed, i.e., the time taken to reach 99.9% response speed, the solution provided in this embodiment of the invention improves by 147% compared to SSTable. Detailed test data are as follows.

[0162] Test scenario:

[0163]

[0164] Performance data:

[0165]

[0166] From a performance perspective, the solution provided in this embodiment of the invention has a significant performance improvement over SSTable.

[0167] From a memory perspective, the memory occupied by the solution provided in this embodiment of the invention is directly proportional to the number of key-value pairs. Since SSTable only compiles a first-level index for data blocks, with the same number of keys, when the value capacity increases, the number of first-level indexes will increase relatively, and the memory occupied by the index will continue to increase until the value capacity is greater than or equal to the data block size.

[0168] In the full memory scenario, the test system's memory is greater than the actual test data. Under these circumstances, random queries will generally hit the system cache, reflecting CPU performance more. To improve test stability, two memory tests were performed, and the results of the second test were used. The first memory test was used to force data to be flushed to the system's page cache.

[0169] From a performance perspective, in a random query test using 40 threads, the QPS of the solution provided in this embodiment exceeds 17 million, while SSTable only achieves 3.5 million. In addition, a sequential scan test using 20 threads was conducted. In this test, the QPS of the solution provided in this embodiment exceeds 180 million, while SSTable's is around 20 million.

[0170] In terms of functional characteristics, the solution provided by the embodiments of the present invention can support the following characteristics:

[0171] 1) Random query: After receiving a query request including the target key, it can find the corresponding sub-storage space and the key-value pair data in the sub-storage space through ordered search processing to respond to the query request.

[0172] 2) Range Query: In this embodiment of the invention, a key index sequence and a spatial index sequence are constructed. Furthermore, the key-value pairs within the sub-storage space are themselves ordered. If a range query request is received, the corresponding query result can be obtained based on the ordered relationship of the key-value pairs. For example, a sub-storage space stores ordered key-value pairs "10-value1", "13-value2", and "19-value3", where "10", "13", and "19" all represent keys. If the range query request is to query key-value pairs with keys greater than 12, and the key-value pair "10-value1" is accessed during the query, the next key-value pair (i.e., the key with the larger key) can be accessed based on the ordered relationship between the key-value pairs, and it is determined whether the key in the next key-value pair is greater than 12. Since the key in "13-value2" is greater than 12, "13-value2" and all key-value pairs after "13-value2" are returned as the query result.

[0173] 3) Variable key size: Since the key index stores the key hash, the embodiments of the present invention do not require the size of the key itself, that is, the key size is variable.

[0174] 4) No collision possible with keys: Although key hashes may collide, this embodiment of the invention avoids this problem by comparing keys, thus obtaining accurate query results.

[0175] 5) Create a memory index for each key.

[0176] It is worth noting that for SSTable, the in-memory index is a block index, while the first-level index is mixed with the value and compiled in the data block. That is, SSTable does not support creating an in-memory index for each key.

[0177] In addition, embodiments of the present invention also provide the following access index table (showing access indexes obtained through the scheme provided by embodiments of the present invention):

[0178]

[0179] During testing with key-value pairs with a value size of 5000 bytes, the solution provided in this embodiment of the invention reads an average of only 16.27 * block size (512 bytes) = 8330 bytes of data per query, approximately two pages. Because this embodiment indexes each key-value pair and loads only the values ​​within the specified value size at a time, the amount of data loaded is relatively small. In contrast, SSTable, being a block-based index, requires loading the entire data block for each query, resulting in a larger data load and wasted resources.

[0180] The following continues to describe an exemplary structure of the data indexing device 6551 provided in the embodiments of the present invention as a software module. In some embodiments, such as Figure 2A As shown, the software modules in the data indexing device 6551 stored in the memory 650 may include: a first hash processing module 65511, used to perform hash processing on the keys in the key-value pair data to obtain a key hash; an address determination module 65512, used to determine the data address of the key-value pair data in the storage space; and a creation module 65513, used to create a key index corresponding to the key-value pair data based on the key hash and the data address; wherein, the key index is used to respond to query requests for the key-value pair data.

[0181] In some embodiments, the data indexing device 6551 further includes: a data sorting module, configured to sort the key-value pair data according to the keys in the key-value pair data; a storage module, configured to store the sorted key-value pair data into a divided sub-storage space; wherein the number of key-value pair data stored in the sub-storage space does not exceed the upper limit; an identifier determination module, configured to determine the largest or smallest key in the sub-storage space as the identifier key, and determine the data address of the key-value pair data containing the identifier key as the identifier data address; wherein the identifier key is used to represent the key range of the sub-storage space; and a spatial index building module, configured to build a spatial index of the sub-storage space according to the identifier key and the identifier data address; wherein the spatial index and the key index are used to respond to query requests for key-value pair data.

[0182] In some embodiments, the address determination module 65512 is further configured to: subtract the identifier data address of the sub-storage space where the key-value pair data is located from the data address of the key-value pair data to obtain the relative data address of the key-value pair data;

[0183] The module 65513 is also used to: combine the key hash, relative data address and value capacity into the key index corresponding to the key-value pair data; wherein, the value capacity is used to represent the size of the value in the key-value pair data.

[0184] In some embodiments, the storage module is further configured to: determine the storage capacity of the divided sub-storage space; determine the alignment unit based on the storage capacity and the data address encoding length; write the sorted key-value pair data to the divided sub-storage space, and during the writing process, perform byte alignment processing on the written key-value pair data according to the alignment unit to obtain the data address of the written key-value pair data.

[0185] In some embodiments, the storage module is further configured to: perform any one of the following processes to obtain the storage capacity of the partitioned sub-storage space: determine the stored capacity of the previous sub-storage space of the partitioned sub-storage space, and predict the storage capacity of the partitioned sub-storage space based on the stored capacity; determine the key-value pair data to be written to the partitioned sub-storage space, and determine the data capacity of the key-value pair data to be written as the storage capacity of the partitioned sub-storage space.

[0186] In some embodiments, the data indexing device 6551 further includes a spatial index sorting module, configured to sort the spatial index according to the identifier key in the spatial index to obtain a spatial index sequence; wherein the spatial index sequence is used to respond to a lookup operation on the spatial index.

[0187] In some embodiments, the first hash processing module 65511 is further configured to: map the keys in the key-value pair data to key hashes of a set hash length; wherein the hash collision probability in the sub-storage space is the ratio between the upper limit of the number of hashes and the encoding types corresponding to the set hash length.

[0188] In some embodiments, the data indexing device 6551 further includes a key index sorting module, configured to sort the key index according to the key hash to obtain a key index sequence; wherein the key index sequence is used to respond to a lookup operation on the key index.

[0189] The following continues to describe an exemplary structure of the data indexing device 6552 provided in the embodiments of the present invention as a software module. In some embodiments, such as Figure 2B As shown, the software modules stored in the data indexing device 6552 of the memory 650 may include: a receiving module 65521, for receiving a query request including a target key; a second hash processing module 65522, for performing hash processing on the target key to obtain a target key hash; and a lookup module 65523, for looking up a key index that matches the target key hash, and determining the corresponding key-value pair data based on the data address in the found key index to respond to the query request.

[0190] In some embodiments, the lookup module 65523 is further configured to: perform an ordered lookup process on the key index sequence according to the target key hash to obtain a key index that matches the target key hash;

[0191] The data indexing device 6552 further includes: a spatial index lookup module, used to perform ordered lookup processing on the spatial index sequence according to the target key to obtain the target spatial index, so as to find the key index that matches the target key hash in the key index sequence of the sub-storage space corresponding to the target spatial index; wherein the key range represented by the identifier key of the target spatial index includes the target key.

[0192] In some embodiments, the lookup module 65523 is further configured to: determine the relative data address in the found key index, and determine the identifier data address of the sub-storage space where the found key index is located; perform a summation process on the relative data address and the identifier data address, and determine the key-value pair data used to respond to the query request based on the data address obtained from the summation process.

[0193] In some embodiments, the data indexing device 6552 further includes: a comparison key determination module, configured to determine the key in the key-value pair data corresponding to the found key index as the comparison key; and a comparison module, configured to respond to a query request based on the key-value pair data corresponding to the found key index when the comparison key is the same as the target key.

[0194] In some embodiments, the data indexing device 6552 further includes: a capacity determination module for determining the value capacity in the found key index; and a loading module for loading the value in the key-value pair data corresponding to the found key index according to the value capacity, in response to a query request.

[0195] This invention provides a storage medium storing executable instructions. When these executable instructions are executed by a processor, they cause the processor to execute the data indexing method provided in this invention, for example... Figure 4A or Figure 4B The data indexing method shown, or as... Figure 5A , Figure 5B or Figure 5C The data indexing method is shown.

[0196] In some embodiments, the storage medium may be a memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or it may be a variety of devices including one or any combination of the above-mentioned memories.

[0197] In some embodiments, executable instructions may take the form of a program, software, software module, script, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and may be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

[0198] As an example, executable instructions may, but do not necessarily, correspond to files in a file system. They may be stored as part of a file that holds other programs or data, for example, in one or more scripts in a Hyper Text Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple collaborating files (e.g., a file that stores one or more modules, subroutines, or code sections).

[0199] As an example, executable instructions can be deployed to execute on a single computing device, or on multiple computing devices located in one location, or on multiple computing devices distributed across multiple locations and interconnected via a communication network.

[0200] In summary, the following technical effects can be achieved through the embodiments of the present invention:

[0201] 1) By storing the key hash in the index, the embodiments of the present invention greatly reduce the memory requirements and the storage cost of the index, which is only about 20% of that of mainstream indexes.

[0202] 2) Because each key-value pair is indexed, only the value in the corresponding key-value pair needs to be loaded during loading, reducing the amount of data loaded each time.

[0203] 3) By setting the hash length and the upper limit of the number of sub-storage spaces, the probability of hash collisions is effectively reduced, and even if a hash collision occurs, the cost is within an acceptable range.

[0204] 4) The indexes created support random queries and range queries, providing high query flexibility. Furthermore, by comparing keys, it avoids obtaining incorrect query results due to hash collisions, meaning that the keys themselves do not have the possibility of collisions. In addition, the key size is variable, rather than only supporting fixed sizes.

[0205] 5) It greatly improves the performance of queries through the established indexes, and reduces the time and space costs, specifically in terms of memory consumption, QPS and average execution time.

[0206] The above are merely embodiments of the present invention and are not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and scope of the present invention are included within the scope of protection of the present invention.

Claims

1. A data indexing method, characterized in that, include: The key-value pair data, after being sorted based on the key, is stored in a sub-storage space divided from the main storage space; wherein the number of key-value pairs stored in the sub-storage space does not exceed the upper limit. The largest or smallest key within the sub-storage space is determined as the identifier key, and the data address of the key-value pair containing the identifier key is determined as the identifier data address; wherein, the identifier key is used to represent the key range of the sub-storage space; Based on the identification key and the identification data address, establish a spatial index for the sub-storage space; The keys in the key-value pair data are hashed to obtain the key hash; Subtract the identifier data address of the sub-storage space where the key-value pair data is located from the data address of the key-value pair data to obtain the relative data address of the key-value pair data; The key hash, the relative data address, and the value capacity are combined to form the key index corresponding to the key-value pair data, wherein the value capacity is used to represent the size of the value in the key-value pair data, and the spatial index and the key index are used to respond to query requests for the key-value pair data; The key index is sorted according to the key hash to obtain a key index sequence, wherein the key index sequence is used to respond to a lookup operation on the key index.

2. The data indexing method according to claim 1, characterized in that, The step of storing the key-value pair data after key sorting into sub-storage spaces partitioned from the main storage space includes: Determine the storage capacity of the divided sub-storage spaces; The alignment unit is determined based on the storage capacity and the data address encoding length; The sorted key-value pairs are written to the partitioned sub-storage spaces, and During the writing process, the key-value pair data being written is byte-aligned according to the alignment unit to obtain the data address of the key-value pair data being written.

3. The data indexing method according to claim 2, characterized in that, Determining the storage capacity of the divided sub-storage spaces includes: Perform any of the following processes to obtain the storage capacity of the divided sub-storage spaces: Determine the stored capacity of the previous sub-storage space of the divided sub-storage space, and Based on the existing storage capacity, predict the storage capacity to be stored in the divided sub-storage spaces; Determine the key-value pair data to be written to the partitioned sub-storage space, and The data capacity of the key-value pair data to be written is determined as the storage capacity of the divided sub-storage space.

4. The data indexing method according to claim 1, characterized in that, After establishing the spatial index of the sub-storage space based on the identifier key and the identifier data address, the method further includes: Based on the identifier key in the spatial index, the spatial index is sorted to obtain a spatial index sequence; The spatial index sequence is used to respond to a lookup operation on the spatial index; The step of hashing the keys in the key-value pair data to obtain the key hash includes: Map the keys in key-value pair data to key hashes of a set hash length; The hash collision probability within the sub-storage space is the ratio between the upper limit of the quantity and the number of encoding types corresponding to the set hash length.

5. A data indexing method, characterized in that, include: Receive a query request including a target key, and determine the sub-storage spaces to be divided based on the target key; The target key is hashed to obtain the target key hash; In the sub-storage space, find the key index whose key hash matches the target key hash, and Based on the data address in the retrieved key index, the corresponding key-value pair data is determined to respond to the query request. The sub-storage space is used to store key-value pair data after key sorting. The number of key-value pair data stored in the sub-storage space does not exceed the upper limit. The space index of the sub-storage space is established based on the identifier key and the identifier data address. The identifier key is the largest or smallest key in the sub-storage space, used to represent the key range of the sub-storage space. The identifier data address is the data address of the key-value pair data where the identifier key is located.

6. The data indexing method according to claim 5, characterized in that, The key index that matches the lookup key hash with the target key hash includes: Based on the target key hash, perform an ordered search on the key index sequence to obtain the key index that matches the target key hash; Before the key index where the lookup key hash matches the target key hash, the method further includes: Based on the target key, an ordered search is performed on the spatial index sequence to obtain the target spatial index. In the key index sequence of the sub-storage space corresponding to the target space index, find the key index that matches the target key hash; The range of keys represented by the identifier key of the target space index includes the target key.

7. The data indexing method according to claim 6, characterized in that, The step of determining the corresponding key-value pair data based on the data address in the retrieved key index includes: Determine the relative data address in the found key index, and determine the identifier data address of the sub-storage space where the found key index is located; The relative data address and the identifier data address are summed, and the key-value pair data used to respond to the query request is determined based on the data address obtained from the summation.

8. The data indexing method according to any one of claims 5 to 7, characterized in that, Also includes: The key in the key-value pair data corresponding to the found key index is determined as the comparison key; When the comparison key is the same as the target key, the query request is responded to based on the key-value pair data corresponding to the found key index.

9. The data indexing method according to any one of claims 5 to 7, characterized in that, Also includes: Determine the capacity of the values ​​in the found key index; Based on the value capacity, load the value from the key-value pair data corresponding to the found key index to respond to the query request.

10. A data indexing device, characterized in that, include: The data sorting module is used to sort the data based on the key values ​​in the data. A storage module is used to store sorted key-value pair data into sub-storage spaces partitioned from the storage space; wherein the number of key-value pair data stored in the sub-storage spaces does not exceed a maximum limit. The identifier determination module is used to determine the largest or smallest key in the sub-storage space as the identifier key, and to determine the data address of the key-value pair data containing the identifier key as the identifier data address; wherein, the identifier key is used to represent the key range of the sub-storage space; A spatial index building module is used to build a spatial index for the sub-storage space based on the identifier key and the identifier data address. The first hash processing module is used to perform hash processing on the keys in the key-value pair data to obtain the key hash; The address determination module is used to subtract the identifier data address of the sub-storage space where the key-value pair data is located from the data address of the key-value pair data to obtain the relative data address of the key-value pair data. A module is established to combine the key hash, the relative data address, and the value capacity into a key index corresponding to the key-value pair data, wherein the value capacity is used to represent the size of the value in the key-value pair data; A key index sorting module is used to sort the key index according to the key hash to obtain a key index sequence, wherein the key index sequence is used to respond to a lookup operation on the key index.

11. The apparatus according to claim 10, characterized in that, The storage module is further configured to determine the storage capacity of the divided sub-storage space; determine the alignment unit based on the storage capacity and the data address encoding length; write the sorted key-value pair data into the divided sub-storage space, and during the writing process, perform byte alignment processing on the written key-value pair data according to the alignment unit to obtain the data address of the written key-value pair data.

12. The apparatus according to claim 11, characterized in that, The storage module is further configured to perform any one of the following processes to obtain the storage capacity of the divided sub-storage space: determine the stored capacity of the previous sub-storage space of the divided sub-storage space, and predict the storage capacity of the divided sub-storage space based on the stored capacity; determine the key-value pair data to be written to the divided sub-storage space, and determine the data capacity of the key-value pair data to be written as the storage capacity of the divided sub-storage space.

13. The apparatus according to claim 10, characterized in that, The device further includes: A spatial index sorting module is used to sort the spatial index according to the identifier key in the spatial index to obtain a spatial index sequence; wherein the spatial index sequence is used to respond to a lookup operation on the spatial index; The first hash processing module is further configured to map the keys in the key-value pair data to key hashes of a set hash length; wherein the hash collision probability in the sub-storage space is the ratio between the upper limit of the number and the encoding type corresponding to the set hash length.

14. A data indexing device, characterized in that, include: The receiving module is used to receive a query request including a target key, and determine the sub-storage spaces to be divided according to the target key; The second hash processing module is used to perform hash processing on the target key to obtain the target key hash; A lookup module is used to find, in the sub-storage space, a key index whose key hash matches the target key hash, and Based on the data address in the retrieved key index, the corresponding key-value pair data is determined to respond to the query request. The sub-storage space is used to store key-value pair data after key sorting. The number of key-value pair data stored in the sub-storage space does not exceed the upper limit. The space index of the sub-storage space is established based on the identifier key and the identifier data address. The identifier key is the largest or smallest key in the sub-storage space, used to represent the key range of the sub-storage space. The identifier data address is the data address of the key-value pair data where the identifier key is located.

15. The apparatus according to claim 14, characterized in that, The device further includes: A spatial index lookup module is used to perform an ordered lookup process on a spatial index sequence based on the target key to obtain a target spatial index, so as to find a key index in the key index sequence of the sub-storage space corresponding to the target spatial index that matches the key hash; wherein, the key range represented by the identifier key of the target spatial index includes the target key; The search module is further configured to perform an ordered search on the key index sequence based on the target key hash to obtain a key index that matches the target key hash.

16. The apparatus according to claim 15, characterized in that, The lookup module is further configured to determine the relative data address in the found key index and the identifier data address of the sub-storage space where the found key index is located; perform a summation process on the relative data address and the identifier data address, and determine the key-value pair data used to respond to the query request based on the data address obtained from the summation process.

17. The apparatus according to any one of claims 14 to 16, characterized in that, The device further includes: The comparison key determination module is used to determine the key in the key-value pair data corresponding to the found key index as the comparison key; The comparison module is used to respond to the query request based on the key-value pair data corresponding to the found key index when the comparison key is the same as the target key.

18. The apparatus according to any one of claims 14 to 16, characterized in that, Also includes: The capacity determination module is used to determine the capacity of the values ​​in the found key index; A loading module is used to load the values ​​in the key-value pair data corresponding to the found key index according to the value capacity, in response to the query request.

19. An electronic device, characterized in that, include: Memory, used to store executable instructions; A processor, when executing executable instructions stored in the memory, implements the data indexing method according to any one of claims 1 to 4, or the data indexing method according to any one of claims 5 to 9.

20. A storage medium, characterized in that, It stores executable instructions for causing a processor to execute, thereby implementing the data indexing method according to any one of claims 1 to 4, or the data indexing method according to any one of claims 5 to 9.