A data processing method, a retrieval method, a device, an apparatus, and a storage medium

By selecting either array or tree structure storage based on the number of field values ​​when constructing the inverted index table, the problem of balancing cache invalidation and query speed in improving index performance is solved, achieving efficient data storage and querying.

CN116521816BActive Publication Date: 2026-06-26BEIJING BAIDU NETCOM SCI & TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING BAIDU NETCOM SCI & TECH CO LTD
Filing Date
2023-04-26
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

In existing technologies, it is difficult to simultaneously improve index performance while reducing cache misses and ensuring data query speed.

Method used

The indexed content is stored using an array structure. When the number of field values ​​does not exceed the threshold, the array structure is used; when the number exceeds the threshold, a specified tree structure, such as a B-tree or a prefix tree, is used to build an inverted index table.

Benefits of technology

It balances reducing cache invalidation frequency with ensuring data query speed, thereby improving data storage continuity and query efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116521816B_ABST
    Figure CN116521816B_ABST
Patent Text Reader

Abstract

The present disclosure provides a data processing method, a retrieval method, an apparatus, a device and a storage medium, relates to the technical field of data processing, and in particular to the technical field of databases, information flows and knowledge graphs. The specific implementation scheme is: in response to receiving a construction instruction for an inverted index table corresponding to a forward index table, determining a target field value of a specified field in the forward index table; determining, based on the forward index table, to-be-indexed content corresponding to the target field value; constructing the inverted index table in a predetermined construction manner, taking the target field value as an index key and the to-be-indexed content as the key value of the index key; wherein the predetermined construction manner comprises: if the number of field values included in the to-be-indexed content does not exceed a predetermined number threshold, storing the to-be-indexed content in an array structure, otherwise storing the to-be-indexed content in a specified tree structure. It can be seen that, by the present scheme, the number of cache invalidations can be reduced and the data query speed can be ensured.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of data processing technology, and more particularly to the fields of database, information flow, and knowledge graph technology. Specifically, it relates to a data processing method, retrieval method, apparatus, device, and storage medium. Background Technology

[0002] Indexes are an important data structure in databases, used to improve data access speed. Therefore, improving index performance is particularly important for databases.

[0003] In related technologies, prefix trees are typically used as the data structure for indexing. Summary of the Invention

[0004] This disclosure provides a data processing method, a retrieval method, an apparatus, a device, and a storage medium.

[0005] According to one aspect of this disclosure, a data processing method is provided, comprising:

[0006] In response to receiving a construction instruction for an inverted index table corresponding to a forward index table, the target field value of a specified field in the forward index table is determined; wherein, the specified field is a field in the forward index table other than the index field;

[0007] Based on the forward index table, the content to be indexed corresponding to the target field value is determined; wherein, the content to be indexed includes: the field value of the index field contained in the specified data entry in the forward index table, and the specified data entry is a data entry in which the specified field has the target field value;

[0008] Using the target field value as the index key and the content to be indexed as the key value of the index key, an inverted index table is constructed according to a predetermined construction method;

[0009] The predetermined construction method includes: if the number of field values ​​included in the content to be indexed does not exceed a predetermined threshold, the content to be indexed is stored in an array structure; otherwise, the content to be indexed is stored in a specified tree structure.

[0010] According to another aspect of this disclosure, a retrieval method is provided, comprising:

[0011] In response to receiving a search request, determine the search terms indicated by the search request;

[0012] From a specified index table, determine the key value of the target index key that matches the search term; wherein, the specified index table is an inverted index table constructed based on any of the data processing methods described above;

[0013] From the forward index table corresponding to the specified index table, retrieve the data entries whose field values ​​of the index fields match the field values ​​of the key values, and use them as the initial search results;

[0014] Based on the initial search results, the search results corresponding to the search request are determined.

[0015] According to another aspect of this disclosure, a data processing apparatus is provided, comprising:

[0016] The first response module is used to respond to receiving a construction instruction for an inverted index table corresponding to a forward index table, and to determine the target field value of a specified field in the forward index table; wherein the specified field is a field in the forward index table other than the index field;

[0017] The first determining module is used to determine the content to be indexed corresponding to the target field value based on the forward index table; wherein, the content to be indexed includes: the field value of the index field contained in the specified data entry in the forward index table, and the specified data entry is a data entry in which the specified field has the target field value;

[0018] The construction module is used to construct an inverted index table according to a predetermined construction method, using the target field value as the index key and the content to be indexed as the key value of the index key.

[0019] The predetermined construction method includes: if the number of field values ​​included in the content to be indexed does not exceed a predetermined threshold, the content to be indexed is stored in an array structure; otherwise, the content to be indexed is stored in a specified tree structure.

[0020] According to another aspect of this disclosure, a retrieval device is provided, comprising:

[0021] The second response module is used to determine the search terms indicated by the search request in response to receiving a search request;

[0022] The second determining module is used to determine the key value of the target index key that matches the search term from a specified index table; wherein the specified index table is an inverted index table constructed based on any of the data processing methods described above.

[0023] The acquisition module is used to acquire data entries from the forward index table corresponding to the specified index table that match the field values ​​of the index fields with the field values ​​of the key values, as the initial search results;

[0024] The third determining module is used to determine the search result corresponding to the search request based on the initial search result.

[0025] According to another aspect of this disclosure, an electronic device is provided, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform any of the data processing methods described above, or a retrieval method.

[0026] According to another aspect of this disclosure, a non-transitory computer-readable storage medium is provided storing computer instructions, wherein the computer instructions are used to cause the computer to perform a data processing method or a retrieval method according to any of the preceding claims.

[0027] According to another aspect of this disclosure, a computer program product is provided, comprising a computer program that, when executed by a processor, implements the data processing method or the retrieval method according to any of the preceding claims.

[0028] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description

[0029] The accompanying drawings are provided to better understand this solution and do not constitute a limitation of this disclosure. Wherein:

[0030] Figure 1 This is a flowchart of a data processing method according to the present disclosure;

[0031] Figure 2 This is a schematic diagram of the data structure of an index table according to this disclosure;

[0032] Figure 3 This is another flowchart of the data processing method according to this disclosure;

[0033] Figure 4 This is yet another flowchart based on the data processing method disclosed herein;

[0034] Figure 5 This is a flowchart of a retrieval method according to this disclosure;

[0035] Figure 6 This is a schematic diagram of the structure of a data processing device according to the present disclosure;

[0036] Figure 7 This is a schematic diagram of the structure of a retrieval device according to the present disclosure;

[0037] Figure 8 This is a block diagram of an electronic device used to implement embodiments of the present disclosure. Detailed Implementation

[0038] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding, and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.

[0039] The following section will first introduce a data processing method provided by an embodiment of this disclosure.

[0040] It should be noted that, in specific applications, the data processing method provided in this disclosure can be applied to various electronic devices, such as personal computers, servers, and other devices with data processing capabilities. Furthermore, it is understood that the data processing method provided in this disclosure can be implemented through software, hardware, or a combination of both.

[0041] The data processing method provided in this embodiment may include the following steps:

[0042] In response to receiving a construction instruction for an inverted index table corresponding to a forward index table, the target field value of a specified field in the forward index table is determined; wherein, the specified field is a field in the forward index table other than the index field;

[0043] Based on the forward index table, the content to be indexed corresponding to the target field value is determined; wherein, the content to be indexed includes: the field value of the index field contained in the specified data entry in the forward index table, and the specified data entry is a data entry in which the specified field has the target field value;

[0044] Using the target field value as the index key and the content to be indexed as the key value of the index key, an inverted index table is constructed according to a predetermined construction method;

[0045] The predetermined construction method includes: if the number of field values ​​included in the content to be indexed does not exceed a predetermined threshold, the content to be indexed is stored in an array structure; otherwise, the content to be indexed is stored in a specified tree structure.

[0046] In the solution provided in this disclosure, in response to receiving a construction instruction for an inverted index table corresponding to a forward index table, the target field value in the forward index table is first determined. Then, the content to be indexed corresponding to the target field value in the forward index table is determined. Based on the number of field values ​​included in the content to be indexed, the storage structure of the content to be indexed is determined. Thus, the target field value is used as the index key, and the content to be indexed is used as the key value of the index key. The content to be indexed is stored according to the determined storage structure to obtain the inverted index table. Since array structures have contiguous storage space, when the number of field values ​​included in the content to be indexed does not exceed a predetermined threshold, storing the content to be indexed in an array structure can improve the continuity of data storage, thereby reducing the number of cache misses. When the number of field values ​​included in the content to be indexed exceeds the predetermined threshold, storing the content to be indexed in a specified tree structure can ensure data query speed. Therefore, this solution can balance reducing the number of cache misses and ensuring data query speed.

[0047] The data processing method provided in the embodiments of this disclosure will now be described with reference to the accompanying drawings.

[0048] like Figure 1 As shown, the data processing method provided in this embodiment may include the following steps:

[0049] S101, in response to receiving a construction instruction for an inverted index table corresponding to a forward index table, determine the target field value of the specified field in the forward index table; wherein, the specified field is a field in the forward index table other than the index field;

[0050] In this embodiment, the forward index table may include multiple data entries, each data entry being a row of data in the forward index table. Each row of data may include the field values ​​of an index field and multiple information fields, and the specified field can be any of the multiple information fields.

[0051] There are multiple ways to determine the specified field, and this disclosure does not limit the process for determining the specified field. Optionally, in practical applications, staff can pre-set the specified field required when constructing an inverted index table for the forward index table. This specified field can be carried in the construction instruction, so that in response to receiving a construction instruction for the inverted index table corresponding to the forward index table, the specified field can be parsed from the construction instruction first, and then the target field value of the specified field in the forward index table can be determined. For example, if the forward index table is an advertisement details table, the index field is the advertisement ID, and the information fields include keyword signature, advertisement bid, advertiser ID, etc., and if the specified field is the keyword signature, and the field value corresponding to the keyword signature includes signature A, signature B, and signature C, then when the construction instruction is received, signature A, signature B, or signature C can be determined as the target field value; where the so-called keyword signature can be the keyword of the advertisement.

[0052] In addition, it should be noted that the construction instruction for the inverted index table corresponding to the forward index table can be triggered at a time, or when the electronic device detects a change in the stored forward index table. Both of these are reasonable, and the timing of triggering the construction instruction is not limited in the embodiments of this disclosure.

[0053] S102, based on the forward index table, determine the content to be indexed corresponding to the target field value; wherein, the content to be indexed includes: the field value of the index field contained in the specified data entry in the forward index table, and the specified data entry is a data entry in which the specified field has the target field value;

[0054] In this embodiment, the content to be indexed is the index content in the inverted index table to be constructed, which may include one or more field values ​​of the index fields. It is understood that since the field values ​​of the index fields in the forward index table are all unique, while the field values ​​in the information fields can be repeated, the same information field value can correspond to multiple index field values. Therefore, the field value in the content to be indexed corresponding to the target field value can be one or more. For example, if the target field value is signature A, and the field values ​​of the specified fields in data entry A and data entry B are both signature A, then data entry A and data entry B are specified data entries, and the field values ​​of the index fields in data entry A and data entry B can constitute the content to be indexed.

[0055] S103, using the target field value as the index key and the content to be indexed as the key value of the index key, construct an inverted index table according to a predetermined construction method;

[0056] The predetermined construction method includes: if the number of field values ​​included in the content to be indexed does not exceed a predetermined threshold, the content to be indexed is stored in an array structure; otherwise, the content to be indexed is stored in a specified tree structure.

[0057] In this embodiment, after determining the target field value and the content to be indexed through steps S101 and S102, an inverted index table can be constructed using the target field value as the index key and the content to be indexed as the key value of the index key. For example, in practical applications, staff can set a predetermined number threshold based on experience, allowing the inverted index table to be constructed according to a predetermined method. That is, the storage structure of the content to be indexed corresponding to the target field value is determined based on the number of field values ​​included in the content to be indexed corresponding to the target field value, thereby storing the content to be indexed and obtaining the inverted index table.

[0058] Understandably, since array structures provide contiguous storage space, storing the content to be indexed in an array structure can improve data storage continuity and reduce cache misses if the number of field values ​​does not exceed a predetermined threshold. Furthermore, for smaller datasets, array storage can also ensure query speed to some extent. However, considering that query speed must be prioritized for larger datasets, if the number of field values ​​exceeds a predetermined threshold, storing the content in a specified tree structure—that is, using a tree structure when the number of field values ​​is large—can guarantee query speed. For example, this specified tree structure can be a B-tree or its variants, such as a B+tree, B-tree, or a prefix tree, etc. This solution, compared to simply using a tree structure to store the content to be indexed, can balance reducing cache misses and ensuring data query speed.

[0059] Alternatively, in one implementation, the specified tree structure is a specified prefix tree, which is a prefix tree in which tree nodes are stored using an array structure.

[0060] In this implementation, the specified tree structure can be a specified prefix tree. Since a higher tree height leads to more cache misses, and the height of a prefix tree depends only on the length of the index key, not the number of index keys, its height is more controllable. Therefore, it outperforms B-trees or their variants in terms of cache misses. Thus, a specified prefix tree structure can be used for data storage, where the content to be indexed is stored as an array in the tree nodes of the specified prefix tree. It is understandable that since array structures have contiguous storage space, using an array structure for data storage in the tree nodes of the specified prefix tree can further improve the contiguousness of data storage, thereby further reducing the number of cache misses while maintaining both cache misses and query speed.

[0061] Additionally, it should be noted that the method of storing the indexed content corresponding to the target field value by combining array structure and specified tree structure can be called artRC (adaptive radix tree of RowContainer). Figure 2 This diagram illustrates the data structure of an index table using artRC (Artful Recursive List). Here, K is the index key, and V is the key value, also known as an inverted index. The length of the key value, i.e., the number of field values ​​included in the content to be indexed, determines whether the data structure used to store the indexed content is an array structure or a specified tree structure. Furthermore, the contiguous storage of leaf nodes in both array and tree structures uses an RC-style storage structure. This RC-style storage structure contains three key fields: data (array), used to store contiguous data; valids bitset, used to indicate whether data is stored at the corresponding position (1 for yes, 0 for no); and cursor, used to indicate the currently used position, this variable only increments.

[0062] In the solution provided in this disclosure, in response to receiving a construction instruction for an inverted index table corresponding to a forward index table, the target field value in the forward index table is first determined. Then, the content to be indexed corresponding to the target field value in the forward index table is determined. Based on the number of field values ​​included in the content to be indexed, the storage structure of the content to be indexed is determined. Thus, the target field value is used as the index key, and the content to be indexed is used as the key value of the index key. The content to be indexed is stored according to the determined storage structure to obtain the inverted index table. Since array structures have contiguous storage space, when the number of field values ​​included in the content to be indexed does not exceed a predetermined threshold, storing the content to be indexed in an array structure can improve the continuity of data storage, thereby reducing the number of cache misses. When the number of field values ​​included in the content to be indexed exceeds the predetermined threshold, storing the content to be indexed in a specified tree structure can ensure data query speed. Therefore, this solution can balance reducing the number of cache misses and ensuring data query speed.

[0063] Optionally, in another embodiment of this disclosure, there are multiple types of array structures, and the maximum number of elements stored in different types of array structures is different;

[0064] For example, the data structure type may include RC1, RC7, RC16, etc. The maximum number of elements stored in different types of array structures is different; for example, the maximum number of elements stored in RC1 is 1, and the maximum number of elements stored in RC7 is 7.

[0065] Accordingly, in this embodiment, storing the content to be indexed in an array structure may include steps A1-A2:

[0066] A1, determine the array structure of the target type that meets the predetermined selection criteria from multiple array structures; wherein, the predetermined selection criteria is that the storage space loss after storing the content to be indexed is less than a predetermined capacity threshold.

[0067] Understandably, since the field values ​​included in the content to be indexed can have different data sizes, meaning the storage space required to store this content varies, the array structure of the target type can be determined based on the data size of the content to be indexed. This ensures that the storage space consumption after storing the content using the array structure of the target type is less than a predetermined capacity threshold. It should be noted that this storage space consumption refers to the amount of unused space in the storage space after storing the content to be indexed, i.e., the amount of free space in the storage space.

[0068] For example, the predetermined capacity threshold could be a memory consumption of no more than 8 bytes, 16 bytes, etc., when storing the content to be indexed. In practical applications, the predetermined capacity threshold can be set by relevant technical personnel according to their needs, and this disclosure does not limit this.

[0069] A2 stores the content to be indexed in an array structure of the target type.

[0070] It is understandable that since the array structure of the target type is an array structure that ensures that the storage space consumption after storing the content to be indexed is less than a predetermined capacity threshold, storing the content to be indexed in the array structure of the target type can help select a suitable array structure for the content to be indexed, so as to maximize the use of the storage space of the array structure and reduce memory consumption.

[0071] Optionally, in one implementation, storing the content to be indexed in an array structure of the target type may include steps A21-A23:

[0072] A21, Check if there is a memory block in the pre-set memory pool used to store the array structure of the target type;

[0073] A22, if it exists, use the memory block in the memory pool to store the content to be indexed in an array structure of the target type;

[0074] If A23 does not exist, request the memory block from the system memory, and use the requested memory block to store the content to be indexed in an array structure of the target type.

[0075] In this implementation, a memory pool can be pre-configured, storing pre-allocated memory blocks. When storing the content to be indexed in an array structure of the target type, the system first checks if a memory block for storing the array structure of the target type exists in the pre-configured memory pool. If it exists, the memory block can be directly retrieved from the memory pool, and the content to be indexed can be stored in the array structure of the target type using that memory block. If it does not exist, the memory block for storing the array structure of the target type can be allocated from system memory, and the content to be indexed can be stored in the array structure of the target type using the allocated memory block. Furthermore, after the data stored in the memory block is released, the memory pool can reclaim the memory block, allowing it to be reused and reducing the system performance overhead caused by constantly allocating or destroying memory blocks.

[0076] As can be seen, this solution can reduce the memory consumption of storing content to be indexed.

[0077] Alternatively, in another embodiment of this disclosure, in Figure 1 Based on the illustrated embodiments, as Figure 3 As shown, after constructing the inverted index table in step S103 above, using the target field value as the index key and the content to be indexed as the key value of the index key according to a predetermined construction method, the method further includes:

[0078] S104, in response to a specified update operation for a data entry in the forward index table, update the key value of the index key in the inverted index table corresponding to the update operation;

[0079] The specified update operation includes deletion or addition operations, and the index key corresponding to the update operation is the field value of the specified field in the data entry indicated by the update operation.

[0080] For example, in practical applications, data logs can be read periodically to update the data content in the forward index table. It is understood that since the inverted index table is built upon the forward index table, and the content to be indexed in the inverted index table is the field value of the index field contained in the specified data entry in the forward index table, when a specified update operation occurs on a data entry in the forward index table (i.e., a deletion or addition operation), the data content in the inverted index table needs to be synchronized with the data content in the forward index table. At this time, the key value of the index key corresponding to the update operation in the inverted index table can be updated. For example, if a data entry with the specified field value of "signature A" and the index field value of "index A" is deleted from the forward index table, the key value of the index key "signature A" in the corresponding inverted index table can be updated, i.e., the index "A" in the key value of the index key "signature A" can be deleted.

[0081] Considering that there is a process of updating the key value of the index key corresponding to the update operation in the inverted index table, then, accordingly, in this embodiment, Figure 3 Based on the illustrated embodiments, as Figure 4 As shown, the method further includes:

[0082] S401, detect whether the storage space loss corresponding to the target key value in the inverted index table is greater than a predetermined capacity threshold; wherein, the target key value is a key value stored in an array structure, and the storage space loss corresponding to the target key value is the storage space loss of the array structure in which the target key value is stored;

[0083] Understandably, after the forward index table is updated, its corresponding inverted index table will also be updated. At this time, since the number of field values ​​included in the content to be indexed will change, it is also possible to check whether the storage space consumption corresponding to the key-value pairs stored in the inverted index table as an array exceeds a predetermined capacity threshold after the inverted index table is updated. It should be noted that the size and setting method of this predetermined capacity threshold can be the same as in step A1 above, and will not be repeated here.

[0084] S402, If yes, determine the array structure of the specified type, and change the array structure where the target key value is currently stored to the array structure of the specified type;

[0085] Wherein, the storage space consumption of the array structure of the specified type is less than a predetermined capacity threshold after storing the target key value.

[0086] Understandably, after the inverted index table is updated, if the storage space consumption corresponding to the target key value in the inverted index table is found to be greater than the predetermined capacity threshold, the array structure where the target key value is currently stored can be updated. That is, the array structure of the specified type is redefined, and the array structure where the target key value is currently stored is changed to the array structure of the specified type. This ensures that after the updated array structure of the specified type stores the target key value, the storage space consumption is less than the predetermined capacity threshold, thereby reducing memory consumption.

[0087] Furthermore, it is understood that since the specified update operation includes deletion or addition operations, when a specified update operation occurs on a data entry in the forward index table, the number of key values ​​in the inverted index table may change. In this case, the tree structure can be optimized based on the updated number of key values, i.e., the tree structure can be upgraded or downgraded, including operations such as inserting or merging tree nodes. The method of upgrading or downgrading the tree structure can be the same as the method of upgrading or downgrading the tree structure in the prior art, and this disclosure does not limit this aspect.

[0088] As can be seen, this scheme can update the type of the data structure in which the target key value is stored when the inverted index table is updated, so that the storage space consumption after the updated array structure stores the target key value is less than the predetermined capacity threshold, thereby reducing memory consumption.

[0089] After constructing the inverted index table using the scheme provided in the above embodiments, as follows: Figure 5 As shown in the embodiments of this disclosure, a retrieval method is also provided, including the following steps:

[0090] S501, in response to receiving a search request, determine the search terms indicated by the search request;

[0091] In this embodiment, in response to receiving a search request, the search term indicated by the search request can be determined first, so as to use the search term for data retrieval. For example, in practical applications, a user can enter a search term in the front-end interface to perform a search. At this time, the front-end interface will send a search request to the corresponding back-end processor. This search request may carry a search term, allowing the back-end processor to determine the search term carried in the search request as the search term indicated by the search request after receiving it; alternatively, the search request may not carry a search term, and the back-end processor, after receiving the search request, will obtain the search term corresponding to the search request from the front-end interface. Both of these are reasonable.

[0092] S502, determine the key value of the target index key that matches the search term from the specified index table; wherein, the specified index table is an inverted index table constructed based on any of the above data processing methods;

[0093] In this embodiment, after the search term is determined, the key value of the target index key that matches the search term can be determined from the inverted index table constructed using the method of the above embodiment. That is, in the specified index table, the key value corresponding to the search term as the target index key is the search term.

[0094] For example, if the search term is A, and the key values ​​of the index key A in the specified index table include A_id and B_id, then the key values ​​of the target index key that matches the search term are A_id and B_id.

[0095] S503, retrieve the data entries from the forward index table corresponding to the specified index table that match the field values ​​of the index fields and the field values ​​in the key values, and use them as the initial search results;

[0096] In this embodiment, since the key value is the field value of the index field in the forward index table, and the forward index table stores data entries containing the index field and multiple information fields, after determining the key value of the target index key that matches the search term through step S502, the data entries containing the field value of the index field in the forward index table corresponding to the specified index table can be obtained as the initial search result.

[0097] For example, if the key values ​​of the target index key that matches the search term are A_id and B_id, then the data entries in the forward index table corresponding to the specified index table with index field values ​​of A_id and B_id are the initial search results.

[0098] S504. Based on the initial search results, determine the search results corresponding to the search request.

[0099] For example, in practical applications, the search request can also carry search conditions, such as geographical range, time range, etc., so that after obtaining the initial search results, the initial search results can be filtered according to the search conditions to obtain the search results corresponding to the search request; or, the initial search results can be sorted from high to low according to their popularity value, and the first preset number of search results can be selected as the search results corresponding to the search request. Both of these are reasonable.

[0100] As can be seen, this solution can quickly find data containing specific search terms, thus improving query speed.

[0101] To better understand the contents of the embodiments of this disclosure, a specific example will be used for illustration below.

[0102] In in-memory database services, there is typically one update thread and a set of retrieval threads. The update thread periodically reads the data log and then updates the tables and indexes; the retrieval threads handle user retrieval requests, translating them into query operations on the indexes and tables. Below, we will introduce the roles of the update and retrieval threads using the inverted index table construction process in advertising operations as an example. The inverted index table construction process is as follows:

[0103] (1) Create a new ad details table based on the data log (corresponding to the forward index table mentioned above). The index field in this ad details table is "ad id", and the other fields are the information fields in this ad details table. The ad details table is shown in Table 1:

[0104] Table 1

[0105] Ad ID Buy Word Signature Ad bid Advertiser ID Plan ID … 30001 666111 100 10001 20001 30002 666112 200 10001 20001 30003 666111 150 10002 20002 30004 666111 120 10003 20003 …

[0106] (2) Determine "buying keyword signature" as the specified field from the advertisement details table, and construct the inverted index table corresponding to the advertisement details table. The inverted index table is shown in Table 2:

[0107] Table 2

[0108]

[0109] In this inverted index table, the index key for the "keyword signature" field value in the ad details table is the content to be indexed corresponding to the index key, namely the "ad set" content.

[0110] The field values ​​in the "ad set" are implemented using a contiguous index data structure. That is, when the number of field values ​​in the "ad set" does not exceed a predetermined threshold, the content of the "ad set" is stored in an array structure. When the number of field values ​​in the "ad set" exceeds the predetermined threshold, the content of the "ad set" is stored in a prefix tree where the tree nodes are stored in an array structure.

[0111] The array structure includes five types: RC1, RC7, RC16, RC80, and RC256. The maximum number of elements stored in each type of array structure differs. The RC (RowContainer) storage structure contains three key fields: data (array), used to store contiguous data; valids bitset, used to indicate whether data is stored at the corresponding position (1 for yes, 0 for no); and cursor, used to indicate the current position, which only increments.

[0112] When using an array structure to store the content of an "ad set", you can first determine the array structure of the target type that meets the predetermined selection conditions. That is, after storing the content of the "ad set", the memory consumption of the target type array structure can not exceed 8 bytes / item. Thus, the content of the "ad set" is stored in the array structure of the target type.

[0113] Furthermore, in practical applications, the number of RC1 type array structures used reaches hundreds of millions. Therefore, a memory pool approach can be adopted. Before using an RC1 type array structure, the memory pool can be checked to see if a memory block for storing RC1 type array structures exists. If it exists, the memory block in the memory pool is used to store the indexed content in an RC1 array structure; if it does not exist, the memory block for storing the RC1 type array structure is allocated from system memory. After the data stored in the memory block is released, the memory pool can also reclaim the memory block, allowing it to be reused. Thus, when the memory block is needed later, it can be directly obtained from the memory pool, reducing the system performance loss caused by constantly allocating or destroying memory blocks.

[0114] The update thread serves the following purposes:

[0115] After periodically reading the data log, the content of the ad details table is updated based on the latest data log. In response to this update, the key value of the index key corresponding to the update operation in the inverted index table is also updated. Since the update operation includes deletion or addition, the content of the "ad set" in the inverted index table may change. Based on the updated "ad set" content, the tree structure can be optimized, including upgrading or downgrading it. The upgrading or downgrading of the tree structure can be done in the same way as in existing technologies, and this example is not limited to this. Furthermore, the type of the array structure can be re-determined based on the number of field values ​​in the "ad set" content to determine whether the array structure type needs to be updated, thereby controlling the memory consumption of the array structure storing the "ad set" content to no more than 8 bytes per record.

[0116] The retrieval thread serves the following purposes:

[0117] After a user sends a search request, the search thread returns a set of matching advertisements based on the search terms included in the request. The specific implementation process is as follows:

[0118] (1) In response to receiving a search request, determine the search term indicated by the search request, for example, the search term may be a set of field values ​​for “buy term signature”;

[0119] (2) From the inverted index table, obtain the key value of the index key that matches the search term, and get the content of the "ad set" corresponding to the field value of the "buy term signature";

[0120] (3) From the details table of the advertisement corresponding to the inverted index table, find the data entries corresponding to the field values ​​of the index field of the "ad set" content, and use them as the initial search results;

[0121] (4) Filter the initial search results that do not meet the search conditions of the search request to obtain the search results corresponding to the search request; wherein, the search conditions can be conditions such as region, delivery time, etc.

[0122] In this scheme, when the number of field values ​​included in the content to be indexed does not exceed a predetermined threshold, the content is stored in an array structure, which improves the continuity of data storage and reduces the number of cache misses. When the number of field values ​​included in the content to be indexed exceeds the predetermined threshold, the content is stored in a specified tree structure, which ensures data query speed, thus balancing the reduction of cache misses and the guarantee of query speed. Furthermore, by using an array structure of the target type that minimizes memory consumption to no more than 8 bytes per record to store the content to be indexed, memory consumption can be reduced. By using a memory pool to store memory blocks of different types of array structures, when a memory block of a different type is needed, it can be retrieved from the memory pool first. After the data stored in the memory block is released, the memory pool can also reclaim the memory block, allowing it to be reused. This reduces the system performance loss caused by constantly allocating or destroying memory blocks.

[0123] Based on the embodiments of the above data processing methods, this disclosure also provides a data processing apparatus, such as... Figure 6 As shown, the device includes:

[0124] The first response module 610 is configured to, in response to receiving a construction instruction for an inverted index table corresponding to a forward index table, determine the target field value of a specified field in the forward index table; wherein, the specified field is a field in the forward index table other than the index field;

[0125] The first determining module 620 is used to determine the content to be indexed corresponding to the target field value based on the forward index table; wherein, the content to be indexed includes: the field value of the index field contained in the specified data entry in the forward index table, and the specified data entry is a data entry in which the specified field has the target field value.

[0126] The construction module 630 is used to construct an inverted index table according to a predetermined construction method, using the target field value as the index key and the content to be indexed as the key value of the index key.

[0127] The predetermined construction method includes: if the number of field values ​​included in the content to be indexed does not exceed a predetermined threshold, the content to be indexed is stored in an array structure; otherwise, the content to be indexed is stored in a specified tree structure.

[0128] Optionally, the specified tree structure is a specified prefix tree, which is a prefix tree in which tree nodes are stored using an array structure.

[0129] Optionally, the array structure can be of multiple types, and the maximum number of elements stored in different types of array structures is different;

[0130] The method of storing the content to be indexed in an array structure includes:

[0131] Determine an array structure of a target type that meets a predetermined selection condition from multiple array structures; wherein the predetermined selection condition is that the storage space loss after storing the content to be indexed is less than a predetermined capacity threshold.

[0132] The content to be indexed is stored in an array structure of the target type.

[0133] Optionally, storing the content to be indexed in an array structure of the target type includes:

[0134] Check if there is a memory block in the pre-set memory pool for storing the array structure of the target type;

[0135] If it exists, use the memory block in the memory pool to store the content to be indexed in an array structure of the target type;

[0136] If it does not exist, request the memory block from the system memory, and use the requested memory block to store the content to be indexed in an array structure of the target type.

[0137] Optionally, after constructing an inverted index table using the target field value as the index key and the content to be indexed as the key value of the index key according to a predetermined construction method, the method further includes:

[0138] In response to a specified update operation for a data entry in the forward index table, the key value of the index key in the inverted index table corresponding to the update operation is updated;

[0139] The specified update operation includes deletion or addition operations, and the index key corresponding to the update operation is the field value of the specified field in the data entry indicated by the update operation.

[0140] Optionally, the method further includes:

[0141] Detect whether the storage space loss corresponding to the target key value in the inverted index table is greater than a predetermined capacity threshold; wherein, the target key value is a key value stored in an array structure, and the storage space loss corresponding to the target key value is the storage space loss of the array structure in which the target key value is stored;

[0142] If so, determine the array structure of the specified type, and change the array structure in which the target key value is currently stored to the array structure of the specified type;

[0143] Wherein, the storage space consumption of the specified array structure after storing the target key value is less than a predetermined capacity threshold.

[0144] Based on the embodiments of the above-described retrieval method, this disclosure also provides a retrieval device, such as... Figure 7 As shown, the device includes:

[0145] The second response module 710 is used to determine the search terms indicated by the search request in response to receiving the search request;

[0146] The second determining module 720 is used to determine the key value of the target index key that matches the search term from a specified index table; wherein the specified index table is an inverted index table constructed based on any of the above data processing methods;

[0147] The acquisition module 730 is used to acquire data entries from the forward index table corresponding to the specified index table that match the field values ​​of the index fields with the field values ​​of the key values, as the initial search results;

[0148] The third determining module 740 is used to determine the search result corresponding to the search request based on the initial search result.

[0149] The collection, storage, use, processing, transmission, provision, and disclosure of user personal information involved in the technical solution disclosed herein comply with the provisions of relevant laws and regulations and do not violate public order and good morals.

[0150] According to embodiments of this disclosure, this disclosure also provides an electronic device, a readable storage medium, and a computer program product.

[0151] An electronic device provided in this disclosure may include:

[0152] At least one processor; and

[0153] A memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the steps of the data processing method or retrieval method described above.

[0154] This disclosure provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of any of the above-described data processing methods or the steps of the above-described retrieval methods.

[0155] In another embodiment provided in this disclosure, a computer program product containing instructions is also provided, which, when run on a computer, causes the computer to perform the steps of any of the data processing methods in the above embodiments, or the steps of the above retrieval methods.

[0156] Figure 8 A schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.

[0157] like Figure 8 As shown, device 800 includes a computing unit 801, which can perform various appropriate actions and processes based on a computer program stored in read-only memory (ROM) 802 or a computer program loaded from storage unit 808 into random access memory (RAM) 803. RAM 803 may also store various programs and data required for the operation of device 800. The computing unit 801, ROM 802, and RAM 803 are interconnected via bus 804. Input / output (I / O) interface 805 is also connected to bus 804.

[0158] Multiple components in device 800 are connected to I / O interface 805, including: input unit 806, such as keyboard, mouse, etc.; output unit 807, such as various types of monitors, speakers, etc.; storage unit 808, such as disk, optical disk, etc.; and communication unit 809, such as network card, modem, wireless transceiver, etc. Communication unit 809 allows device 800 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0159] The computing unit 801 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the various methods and processes described above, such as data processing methods or retrieval methods. For example, in some embodiments, the data processing method or retrieval method may be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program may be loaded and / or installed on device 800 via ROM 802 and / or communication unit 809. When the computer program is loaded into RAM 803 and executed by the computing unit 801, one or more steps of the data processing method or retrieval method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform a data processing method or a retrieval method by any other suitable means (e.g., by means of firmware).

[0160] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.

[0161] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0162] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0163] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0164] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with embodiments of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.

[0165] Computer systems can include clients and servers. Clients and servers are generally located far apart and typically interact via communication networks. Client-server relationships are created by computer programs running on the respective computers and having a client-server relationship with each other. Servers can be cloud servers, servers in distributed systems, or servers incorporating blockchain technology.

[0166] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.

[0167] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.

Claims

1. A data processing method, comprising: In response to receiving a construction instruction for an inverted index table corresponding to a forward index table, a target field value for a specified field is determined in the forward index table; wherein, the specified field is any information field in the forward index table other than the index field; Based on the forward index table, the content to be indexed corresponding to the target field value is determined; wherein, the content to be indexed includes: the field value of the index field contained in the specified data entry in the forward index table, and the specified data entry is a data entry in which the specified field has the target field value; Using the target field value as the index key and the content to be indexed as the key value of the index key, an inverted index table is constructed according to a predetermined construction method; The predetermined construction method includes: if the number of field values ​​included in the content to be indexed does not exceed a predetermined threshold, the content to be indexed is stored in an array structure; otherwise, the content to be indexed is stored in a specified tree structure. The array structure can be of multiple types, and the maximum number of elements stored in different types of array structures varies. The method of storing the content to be indexed in an array structure includes: determining an array structure of a target type that meets predetermined selection criteria from among multiple array structures; wherein the predetermined selection criteria are that the storage space loss after storing the content to be indexed is less than a predetermined capacity threshold; and storing the content to be indexed in the array structure of the target type. The step of storing the content to be indexed in an array structure of the target type includes: Check if there is a memory block in the pre-set memory pool for storing the array structure of the target type; If it exists, use the memory block in the memory pool to store the content to be indexed in an array structure of the target type; If it does not exist, request the memory block from the system memory, and use the requested memory block to store the content to be indexed in an array structure of the target type.

2. The method according to claim 1, wherein, The specified tree structure is a specified prefix tree, which is a prefix tree in which tree nodes are stored using an array structure.

3. The method according to claim 1, wherein, After constructing an inverted index table according to a predetermined construction method, using the target field value as the index key and the content to be indexed as the key value of the index key, the method further includes: In response to a specified update operation for a data entry in the forward index table, the key value of the index key in the inverted index table corresponding to the update operation is updated; The specified update operation includes deletion or addition operations, and the index key corresponding to the update operation is the field value of the specified field in the data entry indicated by the update operation.

4. The method according to claim 3, further comprising: Detect whether the storage space loss corresponding to the target key value in the inverted index table is greater than a predetermined capacity threshold; wherein, the target key value is a key value stored in an array structure, and the storage space loss corresponding to the target key value is the storage space loss of the array structure in which the target key value is stored; If so, determine the array structure of the specified type, and change the array structure in which the target key value is currently stored to the array structure of the specified type; Wherein, the storage space consumption of the specified array structure after storing the target key value is less than a predetermined capacity threshold.

5. A retrieval method, comprising: In response to receiving a search request, determine the search terms indicated by the search request; From a specified index table, determine the key value of the target index key that matches the search term; wherein the specified index table is an inverted index table constructed based on the method described in any one of claims 1-4; From the forward index table corresponding to the specified index table, retrieve the data entries whose field values ​​of the index fields match the field values ​​of the key values, and use them as the initial search results; Based on the initial search results, the search results corresponding to the search request are determined.

6. A data processing apparatus, comprising: The first response module is used to respond to receiving a construction instruction for an inverted index table corresponding to a forward index table, and to determine the target field value of a specified field in the forward index table; wherein the specified field is any information field in the forward index table other than the index field; The first determining module is used to determine the content to be indexed corresponding to the target field value based on the forward index table; wherein, the content to be indexed includes: the field value of the index field contained in the specified data entry in the forward index table, and the specified data entry is a data entry in which the specified field has the target field value; The construction module is used to construct an inverted index table according to a predetermined construction method, using the target field value as the index key and the content to be indexed as the key value of the index key. The predetermined construction method includes: if the number of field values ​​included in the content to be indexed does not exceed a predetermined threshold, the content to be indexed is stored in an array structure; otherwise, the content to be indexed is stored in a specified tree structure. The array structure can be of multiple types, and the maximum number of elements stored in different types of array structures varies. The method of storing the content to be indexed in an array structure includes: determining an array structure of a target type that meets predetermined selection criteria from among multiple array structures; wherein the predetermined selection criteria are that the storage space loss after storing the content to be indexed is less than a predetermined capacity threshold; and storing the content to be indexed in the array structure of the target type. The step of storing the content to be indexed in an array structure of the target type includes: Check if there is a memory block in the pre-set memory pool for storing the array structure of the target type; If it exists, use the memory block in the memory pool to store the content to be indexed in an array structure of the target type; If it does not exist, request the memory block from the system memory, and use the requested memory block to store the content to be indexed in an array structure of the target type.

7. A retrieval device, comprising: The second response module is used to determine the search terms indicated by the search request in response to receiving a search request; The second determining module is used to determine the key value of the target index key that matches the search term from a specified index table; wherein the specified index table is an inverted index table constructed based on the method described in any one of claims 1-4; The acquisition module is used to acquire data entries from the forward index table corresponding to the specified index table that match the field values ​​of the index fields with the field values ​​of the key values, as the initial search results; The third determining module is used to determine the search result corresponding to the search request based on the initial search result.

8. An electronic device, comprising: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

9. A non-transitory computer-readable storage medium storing computer instructions, wherein, The computer instructions are used to cause the computer to perform the method according to any one of claims 1-5.

10. A computer program product comprising a computer program that, when executed by a processor, implements the method according to any one of claims 1-5.