Data processing method and device based on global index, computer device and readable storage medium

By using a global index data processing method, the problem of poor query performance in PostgreSQL partitioned tables was solved, enabling efficient data querying and index maintenance across partitions, thus improving query efficiency and accuracy.

CN121542271BActive Publication Date: 2026-06-19CHINA TELECOM CLOUD TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA TELECOM CLOUD TECH CO LTD
Filing Date
2026-01-19
Publication Date
2026-06-19

Smart Images

  • Figure CN121542271B_ABST
    Figure CN121542271B_ABST
Patent Text Reader

Abstract

This application relates to a data processing method, apparatus, computer device, computer-readable storage medium, and computer program product based on a global index. The method includes: for a global index of a partitioned table, obtaining the sub-partition identifier from the currently traversed index tuple; the partitioned table includes multiple sub-partitioned tables; if a relation descriptor pointer corresponding to the sub-partition identifier is found in the private hash table corresponding to the partitioned table, a data row read operation is performed through the relation descriptor pointer to obtain the data corresponding to the index tuple; the private hash table includes multiple key-value pairs corresponding to each sub-partitioned table, and each key-value pair includes the sub-partition identifier and the relation descriptor pointer. By adopting this method, multiple openings of the same sub-partitioned table can be avoided. By searching the system cache, rapid switching of the memory structure of each sub-partition in the global index lookup is achieved, reducing the overhead of frequently opening sub-partitions and improving the data query performance of the partitioned table.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of database technology, and in particular to a data processing method, apparatus, computer device, computer-readable storage medium, and computer program product based on a global index. Background Technology

[0002] With the rapid development of information technology, the amount of data in all industries is exploding. To efficiently manage massive amounts of data, mainstream relational databases (such as PostgreSQL) generally provide table partitioning functionality. Partitioning significantly improves query performance by physically dividing a large table into multiple smaller, more manageable sub-tables (called partitions or sub-partitions), while logically maintaining it as a unified table.

[0003] However, PostgreSQL does not support global indexes. In other words, the indexes created for partitioned tables in PostgreSQL are actually local indexes created for each sub-table / sub-partition. Local indexes only index data within their respective sub-partitions and cannot maintain global uniqueness across partitions. Consequently, when the query conditions do not include the partition key, the database engine needs to scan all local indexes on all sub-partitions sequentially (Index Scan) and then merge the results using the Append operator. The process of scanning the local indexes of sub-partitions multiple times is equivalent to performing multiple index lookups. When there are many partitions, the I / O and CPU overhead caused by multiple index lookups is enormous, resulting in poor data query performance. Summary of the Invention

[0004] Therefore, it is necessary to provide a data processing method, apparatus, computer device, computer-readable storage medium, and computer program product based on a global index that can improve data query efficiency in response to the above-mentioned technical problems.

[0005] Firstly, this application provides a data processing method based on a global index, including:

[0006] For the global index of the partition table, obtain the sub-partition identifier in the currently traversed index tuple; the partition table includes multiple sub-partition tables;

[0007] If a relation descriptor pointer corresponding to the sub-partition identifier is found in the private hash table corresponding to the partition table, a data row read operation is performed through the relation descriptor pointer to obtain the data corresponding to the index tuple; the private hash table includes multiple key-value pairs corresponding to each sub-partition table, and the key-value pairs include the sub-partition identifier and the relation descriptor pointer.

[0008] In some embodiments, the method further includes:

[0009] If the relation descriptor pointer corresponding to the sub-partition identifier is not found in the private hash table, then the target sub-partition table corresponding to the sub-partition identifier is searched in the multiple sub-partition tables based on the sub-partition identifier, and the data corresponding to the index tuple is extracted in the target sub-partition table based on the physical location data contained in the index tuple.

[0010] In some embodiments, the method further includes:

[0011] Open the target sub-partition table, obtain the relation descriptor pointer of the target sub-partition table, form a key-value pair based on the relation descriptor pointer and the sub-partition identifier of the target sub-partition table, and add the key-value pair to the private hash table.

[0012] In some embodiments, the method further includes:

[0013] Receive the global index creation statement corresponding to the partition table; perform syntax parsing on the global index creation statement to obtain keyword data;

[0014] If the keyword data contains the target global keyword, then traverse each sub-partition table in the partition table to obtain the index data of each sub-partition table;

[0015] A global index for the partition table is generated based on the index data of each of the sub-partition tables.

[0016] In some embodiments, the index data includes the index key values ​​of the sub-partition tables and the sub-partition identifiers; generating a global index of the partition table based on the index data of each of the sub-partition tables includes:

[0017] Based on the index key value, sub-partition identifier, and physical location data of each data row, an index tuple is generated for each data row in the sub-partition table;

[0018] A global index for the partition table is generated by combining the index tuples corresponding to multiple data rows in the multiple sub-partition tables.

[0019] In some embodiments, the method further includes:

[0020] Obtain statistical information of the partition table, including basic statistical information and global statistical information;

[0021] The statistical information and each search path are optimized by a preset optimization algorithm to obtain the search consumption data corresponding to each preset search path.

[0022] The system iterates through each index tuple in the global index of the partition table by finding the preset search path that consumes the least amount of data.

[0023] In some embodiments, the method further includes:

[0024] Obtain the first statistical information of each sub-partition table in the partition table;

[0025] Based on the first statistical information of each of the sub-partition tables, the basic statistical information of the partition table is obtained. The basic statistical information includes the total number of disk pages and the total number of rows, where the total number of rows is the total number of tuples.

[0026] In some embodiments, the method further includes:

[0027] The statistical metadata of each sub-partition is aggregated to obtain the global statistical information of the partition table. The global statistical information includes one or more of the following: the proportion of null values, the number of different values, the list of the most common values, and the histogram.

[0028] In some embodiments, after the step of obtaining the sub-partition identifier in the currently traversed index tuple, the method further includes:

[0029] Based on the sub-partition identifier, query the set of sub-partition identifiers to be cleaned corresponding to the partition table;

[0030] If the sub-partition identifier is found in the set of sub-partition identifiers to be cleaned, then a recyclable identifier is added to the sub-partition identifier.

[0031] In some embodiments, the method further includes:

[0032] After performing data definition language operations on the target sub-partition table in the partition table, it is determined that each index tuple corresponding to the target sub-partition table in the global index is in an invalid state, and an invalid data table is generated based on the first identifier of the partition table and the second identifier of each sub-partition table in an invalid state.

[0033] When the triggering conditions for the cleanup operation are met, the invalid data table corresponding to the partition table is queried to generate a set of sub-partition identifiers to be cleaned corresponding to the partition table.

[0034] In some embodiments, the method further includes:

[0035] The cleanup process deletes the index tuples corresponding to the sub-partition identifiers marked with the recyclable identifier.

[0036] Secondly, this application also provides a data processing apparatus based on a global index, comprising:

[0037] The first acquisition module is used to obtain the sub-partition identifier in the index tuple currently being traversed for the global index of the partition table; the partition table includes multiple sub-partition tables.

[0038] The first query module is used to perform a data row read operation through the relation descriptor pointer to obtain the data corresponding to the index tuple if the relation descriptor pointer corresponding to the sub-partition identifier is found in the private hash table corresponding to the partition table; the private hash table includes multiple key-value pairs corresponding to each sub-partition table, and the key-value pairs include the sub-partition identifier and the relation descriptor pointer.

[0039] Thirdly, this application also provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to perform the following steps:

[0040] For the global index of the partition table, obtain the sub-partition identifier in the currently traversed index tuple; the partition table includes multiple sub-partition tables;

[0041] If a relation descriptor pointer corresponding to the sub-partition identifier is found in the private hash table corresponding to the partition table, a data row read operation is performed through the relation descriptor pointer to obtain the data corresponding to the index tuple; the private hash table includes multiple key-value pairs corresponding to each sub-partition table, and the key-value pairs include the sub-partition identifier and the relation descriptor pointer.

[0042] Fourthly, this application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, performs the following steps:

[0043] For the global index of the partition table, obtain the sub-partition identifier in the currently traversed index tuple; the partition table includes multiple sub-partition tables;

[0044] If a relation descriptor pointer corresponding to the sub-partition identifier is found in the private hash table corresponding to the partition table, a data row read operation is performed through the relation descriptor pointer to obtain the data corresponding to the index tuple; the private hash table includes multiple key-value pairs corresponding to each sub-partition table, and the key-value pairs include the sub-partition identifier and the relation descriptor pointer.

[0045] Fifthly, this application also provides a computer program product, including a computer program that, when executed by a processor, performs the following steps:

[0046] For the global index of the partition table, obtain the sub-partition identifier in the currently traversed index tuple; the partition table includes multiple sub-partition tables;

[0047] If a relation descriptor pointer corresponding to the sub-partition identifier is found in the private hash table corresponding to the partition table, a data row read operation is performed through the relation descriptor pointer to obtain the data corresponding to the index tuple; the private hash table includes multiple key-value pairs corresponding to each sub-partition table, and the key-value pairs include the sub-partition identifier and the relation descriptor pointer.

[0048] The aforementioned data processing method, apparatus, computer device, computer-readable storage medium, and computer program product based on a global index, wherein the method includes: for a global index of a partitioned table, obtaining the sub-partition identifier from the currently traversed index tuple; the partitioned table includes multiple sub-partitioned tables; if a relation descriptor pointer corresponding to the sub-partition identifier is found in the private hash table corresponding to the partitioned table, then a data row read operation is performed through the relation descriptor pointer to obtain the data corresponding to the index tuple; the private hash table includes multiple key-value pairs corresponding to each sub-partitioned table, and the key-value pairs include the sub-partition identifier and the relation descriptor pointer. By adopting this method, multiple openings of the same sub-partitioned table can be avoided, and fast switching of the memory structure of each sub-partition in the global index lookup can be achieved by searching the system cache, reducing the overhead of frequently opening sub-partitions, improving the data query performance of the partitioned table, and achieving efficient global indexing of the partitioned table. Attached Figure Description

[0049] To more clearly illustrate the technical solutions in the embodiments of this application or related technologies, the drawings used in the description of the embodiments of this application or related technologies will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0050] Figure 1 This is a flowchart illustrating a data processing method based on a global index in one embodiment;

[0051] Figure 2 This is a flowchart illustrating the steps for generating a global index in one embodiment;

[0052] Figure 3 This is a flowchart illustrating the steps of traversing the global index in one embodiment;

[0053] Figure 4 This is a flowchart illustrating the cleaning steps in one embodiment;

[0054] Figure 5 This is a schematic diagram of the storage structure of the global index of a partition table in one embodiment;

[0055] Figure 6 This is a schematic diagram of the structure of a hash table and a global index in one embodiment;

[0056] Figure 7 A flowchart of the cleaning steps in one embodiment;

[0057] Figure 8 This is a structural block diagram of a data processing device based on a global index in one embodiment;

[0058] Figure 9 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation

[0059] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0060] It should be noted that the terms "first," "second," etc., used in this application can be used to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish the first element from the second element. The terms "comprising" and "having," and any variations thereof, used in this application, are intended to cover non-exclusive inclusion. The term "multiple" used in this application refers to two or more. The term "and / or" used in this application refers to one of the embodiments, or any combination of multiple embodiments.

[0061] In one embodiment, such as Figure 1 As shown, a data processing method based on a global index is provided. This embodiment illustrates the application of this method to a terminal, but it is understood that the method can also be applied to a server, or to a system including both a terminal and a server, and is implemented through the interaction between the terminal and the server. The terminal can be, but is not limited to, various personal computers, laptops, smartphones, tablets, IoT devices, and portable wearable devices. IoT devices can include smart speakers, smart TVs, smart air conditioners, smart in-vehicle devices, etc. Portable wearable devices can include smartwatches, smart bracelets, head-mounted devices, etc. The server can be a standalone server or a server cluster consisting of multiple servers. In this embodiment, the data processing method based on a global index includes the following steps:

[0062] Step 102: For the global index of the partition table, obtain the sub-partition identifier in the index tuple currently being traversed.

[0063] A partitioned table comprises multiple sub-partitioned tables. A partitioned table refers to a data table in a database system that logically belongs to a whole but physically includes multiple sub-tables. The multiple sub-tables within a partitioned table are called individual sub-partitioned tables. Each sub-partitioned table stores a subset of data, containing multiple data rows. For example, the data stored in each sub-partitioned table can be divided according to a partition key, which can optionally be a date range, a numerical range, a version number, etc. A global index is the index structure of the partitioned table, independent of the physical partitioning of the partitioned table. The construction process of this global index will be described in detail in the following embodiments. The index entries in the global index directly point to specific data rows in the sub-partitioned tables. The global index can be a single, logically unified index object, distinct from the collection of local indexes of multiple sub-partitioned tables. The global index contains multiple index tuples, each corresponding to a data row in any sub-partitioned table. The index tuple / index entry includes a key value, a sub-partition identifier (tableoid), and physical location data (ctid).

[0064] Specifically, the terminal can traverse each index tuple in the global index corresponding to the partition table. For the index tuple currently being traversed by the terminal, it can obtain the sub-partition identifier contained in the index tuple. The sub-partition identifier is used to indicate the sub-partition where the data row corresponding to the index tuple is located. Optionally, the terminal can obtain the data corresponding to the tableoid field in the index tuple to obtain the sub-partition identifier of the index tuple.

[0065] Step 104: If the relation descriptor pointer corresponding to the sub-partition identifier is found in the private hash table corresponding to the partition table, then the data row read operation is performed through the relation descriptor pointer to obtain the data corresponding to the index tuple.

[0066] The private hash table comprises multiple key-value pairs corresponding to each sub-partition table. Each key-value pair includes a sub-partition identifier and a relation descriptor pointer. The private hash table corresponds one-to-one with each partition table. This private hash table can be a lightweight hash table private to each partition table, serving as a cache for the relation descriptors of each sub-partition. The relation descriptor pointer can correspond to each sub-partition and represents the complete in-memory representation of the corresponding sub-partition table, containing all necessary metadata and buffer access status for data retrieval. The data corresponding to the index tuple refers to the actual data stored in the data rows of the sub-partition table corresponding to the index tuple.

[0067] Specifically, after obtaining the sub-partition identifier corresponding to the index tuple, the terminal can query the private hash table of the partition table based on this sub-partition identifier. That is, it searches for the sub-partition identifier in the private hash table. If the cache is hit, meaning the terminal finds the sub-partition identifier in the private hash table, it extracts the relation descriptor pointer corresponding to the sub-partition identifier from the private hash table and uses this relation descriptor pointer to perform a data read operation. The result of this data read operation is the data corresponding to the index tuple obtained by the terminal. This data read operation is a data row (HeapTuple) read operation for the sub-partition table.

[0068] In the aforementioned data processing method based on a global index, the sub-partition identifier is retrieved from the currently traversed index tuple for the global index of the partitioned table. The partitioned table comprises multiple sub-partitioned tables. If a relation descriptor pointer corresponding to the sub-partition identifier is found in the private hash table corresponding to the partitioned table, a data row read operation is performed through the relation descriptor pointer to obtain the data corresponding to the index tuple. The private hash table includes multiple key-value pairs corresponding to each sub-partitioned table, and each key-value pair includes the sub-partition identifier and a relation descriptor pointer. By employing this method, multiple openings of the same sub-partition table can be avoided. By searching the system cache, rapid switching of the memory structure of each sub-partition during global index lookup is achieved, reducing the overhead of frequently opening sub-partitions, improving the data query performance of partitioned tables, and realizing efficient global indexing of partitioned tables. The global index constructed in this embodiment can achieve global consistency across partitions. This global index is created for non-partition columns, covering all sub-partition tables and having unique constraints on the index columns. This further enhances the design flexibility of partitioned tables in business models; that is, uniqueness restrictions can be specified when using global indexes created with non-partition columns, further improving the query efficiency and accuracy of subsequent data queries based on the global index across multiple sub-partition tables.

[0069] In some embodiments, the data processing method based on a global index further includes:

[0070] If the relation descriptor pointer corresponding to the subpartition identifier is not found in the private hash table, then the target subpartition table corresponding to the subpartition identifier is searched in multiple subpartition tables based on the subpartition identifier, and the data corresponding to the index tuple is extracted in the target subpartition table based on the physical location data contained in the index tuple.

[0071] The physical location data represents the physical location of the data corresponding to the index tuple in the sub-partition table, i.e., the offset within the block. The physical location data can include the file block number and the offset within the block. For example, the sub-partition table can be divided into multiple file blocks in the physical dimension.

[0072] Specifically, if a cache miss occurs, meaning the terminal does not find the sub-partition identifier in the private hash table, the terminal can use the system table lookup process to find the target sub-partition table corresponding to the sub-partition identifier, that is, to find the sub-partition table where the index tuple is located. For example, the terminal can search for the target sub-partition table that matches the sub-partition identifier among multiple sub-partition tables contained in the partition table based on the sub-partition identifier. In this way, the terminal can open the target sub-partition table, perform position offset processing on the target sub-partition table based on the physical location data of the data rows contained in the index tuple, and extract the data corresponding to the index tuple.

[0073] Optionally, after opening the target sub-partition table, the terminal can find the corresponding file block in the target sub-partition table based on the file block number in the physical location data in the index tuple, and perform a position offset in the corresponding file block based on the intra-block offset in the physical location data, and determine the data stored in the physical location after the offset as the data stored in the data row indicated by the index tuple.

[0074] In this embodiment, by using the table lookup process and sub-partition identifiers, data can be efficiently and accurately queried and switched to the target sub-partition table for data querying, thereby improving data query efficiency.

[0075] In some embodiments, the data processing method based on a global index further includes:

[0076] Open the target sub-partition table, obtain the relation descriptor pointer of the target sub-partition table, form a key-value pair based on the relation descriptor pointer and the sub-partition identifier of the target sub-partition table, and add the key-value pair to the private hash table.

[0077] Specifically, if a cache miss occurs, meaning the terminal does not find the sub-partition identifier in the private hash table, the terminal can use the system table lookup process to find the target sub-partition table corresponding to the sub-partition identifier. Based on this, the terminal can open the target sub-partition table and, after opening it, obtain the pointer to the relation descriptor of the current target sub-partition table, i.e., obtain the relation structure of the target sub-partition table. The terminal can then combine the relation structure of the target sub-partition table with the sub-partition identifier of the target sub-partition table to form a key-value pair, and insert this key-value pair into the private hash table corresponding to the partition table to obtain the updated private hash table.

[0078] Optionally, the key-value pair can be (target_oid, Relation), where target_oid represents the sub-partition identifier and Relation represents the relation descriptor pointer of the sub-partition table.

[0079] In this embodiment, by generating new key-value pairs in a timely manner when a cache miss occurs, and updating the private hash table of the partition table based on the new key-value pairs, a comprehensive hash table data foundation is provided for subsequent fast, efficient, and accurate data lookup operations based on sub-partition identifiers and relation descriptor pointers.

[0080] In some embodiments, such as Figure 2 As shown, this data processing method based on global indexes also includes:

[0081] Step 202: Receive the global index creation statement corresponding to the partitioned table. Parse the global index creation statement to obtain keyword data.

[0082] The global index creation statement can be a user-triggered index generation operation targeting any partitioned table, such as `CREATE GLOBAL INDEX`. Keyword data can be one or more keywords from this statement parsed by the syntax parser.

[0083] Specifically, the terminal can receive a global index creation statement triggered by the user, and use the syntax parser configured in the terminal to identify keywords in the global index creation statement to obtain keyword data.

[0084] Step 204: If the keyword data contains the target global keyword, then traverse each sub-partition table in the partition table to obtain the index data of each sub-partition table.

[0085] The target global key can be one or more pre-configured keys, such as GLOBAL. The index data for each sub-partition table can be the data used to generate the index tuples corresponding to each data row in each sub-partition table.

[0086] Specifically, if the terminal determines that the keyword data contains a pre-configured target global keyword, that is, if it determines that the global index creation statement contains "GLOBAL", then the terminal can determine that the current user has triggered an execution statement to create a global index for the partition table. In this way, the terminal can perform the operation of creating a global index for the partition table. For example, the terminal can traverse each sub-partition table contained in the partition table. For the sub-partition table currently traversed, the terminal can obtain the index data of the sub-partition table. The index data includes the index key value of the sub-partition table and the sub-partition identifier corresponding to the sub-partition table.

[0087] Optionally, the terminal can use the OID of the sub-partition table as the value of the tableoid of the sub-partition table, that is, determine the OID value of the sub-partition table as the sub-partition identifier of the sub-partition table. Accordingly, for each sub-partition table, the terminal can determine the index data of the sub-partition table by using the obtained sub-partition identifier and the corresponding index key value of the sub-partition table.

[0088] Step 206: Generate a global index for the partition table based on the index data of each sub-partition table.

[0089] Specifically, for the sub-partition table currently being traversed by the terminal, the terminal can insert each index tuple contained in the index data into the pre-configured global index data structure of the data to be inserted, thus obtaining the global index of the partition table. After the terminal has traversed each sub-partition table of the partition table, the current global index data structure can be determined as the global index of the partition table.

[0090] This embodiment provides a novel global index item structure for partitioned tables. By leveraging the include column feature in PostgreSQL, the global index directly locates the physical location of the data, laying the foundation for subsequent high-performance data queries.

[0091] In one embodiment, the index data includes the index key values ​​of the sub-partition tables and the sub-partition identifiers. The specific processing steps for "generating a global index for the partition table based on the index data of each sub-partition table" include:

[0092] Based on the index key value, sub-partition identifier, and physical location data of each data row, an index tuple for each data row in the sub-partition table is generated. The index tuples corresponding to multiple data rows in multiple sub-partition tables are then combined to generate a global index for the partition table.

[0093] The global index comprises multiple index entries, or index tuples, each corresponding one-to-one with a data row in one of the multiple sub-partition tables contained within the partitioned table. Each index tuple includes a key-value portion (KeyColumns) and a non-key-value portion. The key-value portion consists of the user-defined index columns, providing fast key-based lookups. The non-key-value portion includes a hidden system column named `tableoid`, which is the sub-partition identifier. The value of this column is the OID of the specific sub-partition table containing the data row corresponding to this index entry. In other words, this sub-partition identifier indicates the sub-partition table in which the data row corresponding to this index tuple resides. Within the partitioned table, there is a one-to-one correspondence between sub-partition tables and sub-partition identifiers.

[0094] Specifically, the terminal can traverse each sub-partition table. For the currently traversed sub-partition table, the terminal can obtain the corresponding index key value, that is, the key value of each data row contained in the sub-partition table and the physical location data (ctid). Based on the key value, physical location data, and OID of the sub-partition table to which the data row belongs, the terminal constructs an index tuple corresponding to the data row and inserts the index tuple into the data structure of the global index corresponding to the partition table to be inserted. When the terminal has inserted the index tuples of all data rows in the sub-partition table into the data structure, it can be determined that the terminal has completed the traversal of the sub-partition table. Thus, the terminal can traverse the next sub-partition table until the terminal has completed the traversal of all sub-partition tables in the current partition table and obtained the global index corresponding to the partition table.

[0095] In this embodiment, any index entry in the generated global index contains the complete location information of the data row corresponding to the index entry. When searching for data based on the index entry, access can be made only to the global index, thereby achieving efficient and accurate fast access to the data and further improving data query performance.

[0096] In some embodiments, such as Figure 3 As shown, this data processing method based on global indexes also includes:

[0097] Step 302: Obtain the statistical information of the partition table.

[0098] Step 304: Optimize the statistical information and each search path using a preset optimization algorithm to obtain the search consumption data corresponding to each preset search path.

[0099] Step 306: Traverse each index tuple in the global index of the partitioned table by finding the search path with the lowest data consumption.

[0100] The statistical information includes basic statistical information and global statistical information. The preset optimization algorithm can be a dynamic programming algorithm, a genetic algorithm, etc. This embodiment does not limit the specific type of the preset optimization algorithm and can be determined based on the actual application scenario. The search path can be a search strategy, such as a direct table search strategy or a search strategy based on a global index.

[0101] Specifically, the terminal can aggregate basic statistical information and advanced statistical information (global statistical information) for the partition table to obtain the statistical information of the partition table. The terminal can input the above basic statistical information and global statistical information into a preset optimization algorithm, which calculates the query cost data corresponding to each search path, that is, determines the search cost data corresponding to each search path, and determines the search path with the lowest search cost data as the target search path, and performs data search on the partition table based on the target search path. The search cost data can be the amount of data to be read when scanning the partition table and the number of data rows returned by the data table, etc.

[0102] Optionally, the target operation path determined by the terminal can be a search strategy based on a global index. In this way, the terminal can use the search strategy based on a global index to execute the steps of traversing each sub-partition table in the partition table as described in the above embodiment, until the data corresponding to each index tuple is obtained, or the data corresponding to the index tuple indicated by the search request is obtained.

[0103] In this embodiment, by automatically aggregating and updating the global index statistics of the partition table, statistical information that can accurately describe the data distribution of the entire partition table can be obtained. This allows for a more accurate estimation of the search cost data corresponding to each search path, thereby obtaining the optimal search characteristics that better match the performance characteristics of the global index and achieving a balance between search performance and search resource consumption.

[0104] In some embodiments, the data processing method based on a global index further includes:

[0105] Obtain the initial statistics of each sub-partition table within the partition table. Based on the initial statistics of each sub-partition table, obtain the basic statistics of the partition table.

[0106] The basic statistics include the total number of disk pages and the total number of rows, where the total number of rows is the total number of tuples. These basic statistics can be the underlying statistics of sub-partitions, such as the first statistics of each sub-partition table obtained from the pg_class system table corresponding to the partition table.

[0107] Specifically, the terminal can obtain the total number of disk pages and the total number of rows for each sub-partition table from the pg_class system table, and then sum the total number of disk pages and the total number of rows for each sub-partition table to obtain the total number of disk pages (relpages) for all sub-partitions of the partition table, and the total number of rows (reltuples) for all sub-partitions contained in the partition table. In this way, the terminal can determine that the total number of disk pages and the total number of rows are the basic statistical information of the partition table.

[0108] In this embodiment, by automatically and intelligently aggregating the underlying statistical information of the sub-partition table, the total value of the global index of the partition table itself and its parent table can be calculated and cached, and then a search path that is more suitable for the partition table can be determined based on the total value.

[0109] In some embodiments, the data processing method based on a global index further includes:

[0110] The statistical metadata of each sub-partition is aggregated to obtain the global statistical information of the partitioned table.

[0111] The global statistics include the percentage of null values, the number of distinct values, a list of the most common values, and one or more of the following from the histogram: The statistical metadata for each sub-partition can be the pg_statistic data for that sub-partition.

[0112] Specifically, for each sub-partition, the terminal can obtain the corresponding pg_statistic data, such as the number of empty rows in the sub-partition table and the frequency of each value. In this way, the terminal can perform weighted average processing or merge calculation processing on the number of rows and the frequency of each value in each sub-partition table to obtain one or more of the global null value ratio, number of different values, list of most common values, and histogram for the corresponding partition table. Thus, the terminal can obtain the global statistical information corresponding to the partition table.

[0113] Here, a null value can be a NULL value, which indicates that the value is unknown. Null values ​​are identifiers in Structured Query Language that represent data that is unknown, inapplicable, or will be added later. They are used to indicate that the data in a data row is uncertain.

[0114] In this embodiment, information is aggregated from the pg_statistic data of each sub-partition to generate global statistical information that can describe the data distribution of the entire partition table.

[0115] In some embodiments, after obtaining the sub-partition identifier in the currently traversed index tuple, the data processing method based on the global index further includes:

[0116] Based on the sub-partition identifier, query the set of sub-partition identifiers to be cleaned up corresponding to the partition table. If a sub-partition identifier is found in the set of sub-partition identifiers to be cleaned up, add a recyclable identifier to the sub-partition identifier.

[0117] Specifically, the terminal can generate a set of sub-partition identifiers to be cleaned up corresponding to the partition table based on the failed data table of the partition table. After obtaining the set of sub-partition identifiers to be cleaned up, it can perform a scan operation on the global index. For each index item in the global index, that is, each index tuple in the global index, the terminal can obtain the tableoid stored in the INCLUDE column of the index tuple, that is, obtain the sub-partition identifier corresponding to the index tuple, and determine whether the sub-partition identifier is in the set of sub-partition identifiers to be cleaned up. If the sub-partition identifier is found in the set of sub-partition identifiers to be cleaned up, the terminal can add a recyclable identifier to the index tuple corresponding to the sub-partition identifier through the cleanup process, that is, mark the index tuple as recyclable.

[0118] In this embodiment, by scanning the global index in real time, index tuple data that is in an invalid state in the global index can be cleaned up in a timely manner, ensuring the simplification of the global index.

[0119] In some embodiments, such as Figure 4 As shown, this data processing method based on global indexes also includes:

[0120] Step 402: After performing data definition language operations on the target sub-partition table in the partition table, determine that each index tuple corresponding to the target sub-partition table in the global index is in an invalid state, and generate an invalid data table based on the first identifier of the partition table and the second identifier of each invalid sub-partition table.

[0121] Data Definition Language (DDL) operations can be performed on a subpartition of a partitioned table. When a DDL operation is executed on a subpartition of the partitioned table, all index tuples in the global index that point to that subpartition table become invalid. This means that all index tuples pointing to that subpartition table need to be cleaned and removed. The invalidation table can be a newly added system table, such as `pg_global_index_invalid_rel`, used to record invalid subpartitions caused by DDL operations. This table must contain at least the following fields: `parent_oid`, the object identifier (OID, i.e., the first identifier) ​​of the partitioned table, and `child_oid`, the object identifier (OID, i.e., the second identifier) ​​of the deleted subpartition.

[0122] Specifically, when the terminal performs a DDL operation on any one or more sub-partition tables in the partition table, such as when performing a DDL operation on a target sub-partition table in the partition table, the terminal can determine that each index tuple in the global index pointing to the target sub-partition table is in an invalid state. In this way, the terminal can obtain the first identifier of the partition table corresponding to the DDL operation and the second identifier of the target sub-partition table corresponding to the DDL operation, and write the first identifier and the second identifier into a newly added system table to obtain the invalid data table corresponding to the partition table.

[0123] Step 404: When the triggering conditions for the cleanup operation are met, query the invalid data table corresponding to the partition table and generate a set of sub-partition identifiers to be cleaned corresponding to the partition table.

[0124] The cleanup operation can be a cleanup operation targeting the global index, and the triggering condition can be the automatic cleanup daemon (AutoVacuum process) in the partition table or a cleanup command (VACUUM command) triggered by the user.

[0125] Specifically, when the triggering conditions for the cleanup operation are met, the terminal can perform a cleanup operation on the partition table. During the cleanup operation, if the terminal determines that the partition table has a global index, the cleanup operation (Vacuum worker process) executed by the terminal will query the failed data table corresponding to the partition table. Based on the second identifier of each sub-partition table contained therein, a set of sub-partition identifiers to be cleaned corresponding to the partition table will be generated. That is, based on the sub-partition identifiers in the failed data table, the set of sub-partition identifiers to be cleaned corresponding to the partition table is generated. For example, it can be a "set of sub-partition OIDs to be cleaned" {oid1, oid2, ...}.

[0126] In this embodiment, by constructing a set of sub-partition identifiers to be cleaned, a data foundation can be provided for the timely cleanup of subsequent invalid global indexes, avoiding the expansion of global indexes, reducing the time consumption of DDL operations on partition tables, and ensuring data consistency when deleting some data in the global index.

[0127] In some embodiments, the data processing method based on a global index further includes:

[0128] The cleanup process deletes the index tuples corresponding to subpartition identifiers marked as recyclable.

[0129] The cleanup process can be a cleanup operation, i.e., a Vacuum worker process.

[0130] Specifically, the terminal uses the Vacuum process to perform physical deletion operations on index tuples marked as recyclable in the global index. If the index tuple does not have a recyclable mark, it is retained, thus obtaining a global index of cleaned-up invalid data.

[0131] In this embodiment, by scanning the global index in real time, index tuple data that is in an invalid state in the global index can be cleaned up in a timely manner, ensuring the simplification of the global index.

[0132] The following describes in detail the specific implementation steps of the above-mentioned data processing method based on global index, with reference to a specific embodiment:

[0133] PostgreSQL, a database in related technologies, supports declarative partitioning, allowing users to divide tables based on specified keys (such as date ranges, numeric ranges, list values, etc.). When executing queries, the query planner can improve query efficiency by excluding partitions that do not contain relevant data through partition pruning. However, PostgreSQL's partitioned table indexes do not support global indexes; indexes created on partitioned tables are actually local indexes created on each subpartition. Therefore, for queries with non-partition key conditions (i.e., queries that do not contain the partition key) and low selectivity, the database engine must still scan all local indexes on all subpartitions sequentially (Index Scan) and then merge the results using the Append operator. This effectively involves multiple index lookups, and the I / O and CPU overhead is proportional to the number of partitions. When there are many partitions (e.g., hundreds or thousands of partitions generated by daily partitioning), multiple index scans can cause a sharp drop in the query performance of partitioned tables, potentially resulting in performance lower than a full table scan.

[0134] The data processing method based on global indexes provided in this embodiment is essentially a technical solution deeply integrated into the PostgreSQL database kernel, capable of providing efficient global index support for partitioned tables. Specifically, it is a PostgreSQL partitioned table global index method that natively supports the creation and use of global indexes for partitioned tables. The method in this embodiment does not need to access multiple local indexes when executing the query plan, avoiding long and complex data query paths and reducing query execution latency and resource consumption. Data querying through global indexes can maintain the global uniqueness of each index tuple in the global index across partitions. It can create an index covering all partitions with unique constraints on the index columns, improving the flexibility of partitioned tables in business model design. It is simple to implement and maintain, has low performance overhead, and high integration with the database kernel, making it a native and efficient data query solution.

[0135] This embodiment provides a data processing method based on a global index, offering a method for constructing a global index for any partitioned table. It provides a novel global index structure that allows direct location of data rows on sub-partitions via index entries. By introducing an INCLUDE column into the index to store the tableoid (the logically unique ID of the sub-partition), the global index can directly locate the sub-partition pointed to by the index entry.

[0136] The data processing method based on global index provided in this embodiment offers an efficient partition switching and caching mechanism. It enables efficient sub-partition mapping and switching at the execution layer (Executor). It maintains a hash table based on sub-partitions to cache the memory structure of the corresponding sub-partitions for Scan, allowing the database engine to quickly find and switch to the correct sub-table for data reading after obtaining the sub-partition tableoid through the global index.

[0137] The data processing method based on global indexes provided in this embodiment offers an optimized query execution plan. This allows the terminal to directly generate an Index Scan query plan instead of an Append plan for queries using global indexes through the query optimizer (Planner). This enables data queries on partitioned tables to be performed using the search strategy with the lowest search cost. For example, the base table can be accessed directly through the global index, simplifying the execution path and reducing query latency and CPU overhead.

[0138] The data processing method based on global indexes provided in this embodiment can extend the SQL parser of PostgreSQL to add support for the GLOBAL keyword (e.g., CREATE INDEX …… GLOBAL). It can create global indexes on partitioned tables in a similar way to creating ordinary indexes on partitioned tables, and the syntax is compatible with that of ordinary indexes. This is imperceptible to the user, further reducing the learning and usage costs for users and providing a good user experience.

[0139] The data processing method based on global indexes provided in this embodiment offers an asynchronous cleanup mechanism for invalid index entries. This mechanism can solve the problem of index entry invalidation caused by partitioned table data definition language (DDL) operations (such as DROP TABLE and TRUNCATE subpartitions). The asynchronous cleanup mechanism can automatically and gradually clean up global index entries that have become invalid due to DDL operations in the background, thereby ensuring the data consistency of the global indexes. At the same time, it avoids performance blocking and delays caused by synchronous cleanup during DDL execution, ensuring the overall responsiveness and stability of the system.

[0140] The data processing method based on global indexes provided in this embodiment can be applied to a partitioned table global index system. This system includes a join management module, a syntax parsing module, a query optimization module, a plan execution module, and a storage module, wherein:

[0141] The connection management module connects to multiple users to receive SQL commands triggered by users. These SQL commands can be global index building statements or data query instructions, etc. The syntax parsing module is used to parse the incoming SQL commands. For example, the GLOBAL keyword is supported in the index creation command, indicating that the created index is a global index for the partitioned table.

[0142] The query optimization module rewrites the ParseTree of the syntax parsing module into a Query structure and further generates an execution plan. Based on optimization strategies, it selects the optimal execution plan, such as determining the search path with the lowest data consumption as the optimized search path, and then performing data queries on the partitioned table based on this optimized search path. Compared to the original plan of aggregating sub-partition data using the Append operator, this requires adding an Index Scan as a scan plan for the global index, i.e., building a global index corresponding to the partitioned table.

[0143] The execution plan module is used to execute the execution plan generated by the query optimization module. For paths generated by the global index, during an index scan, the required partition table `tableoid` is retrieved from the index. Based on the retrieved `tableoid`, the corresponding partition sub-table is accessed, and table splitting is performed if necessary. The global index can be a `partition globalindex`, and the partition table can be a `Partitioned Table` containing multiple sub-partition tables, such as `partition1`, `partition2`, ..., `partition n`, for example, a partition table containing `n` sub-partition tables. Each sub-partition table in the partition table can be stored on a storage device in the system. The storage module can contain multiple storage devices, and all sub-partitions in the partition table share a single global index.

[0144] In other words, this partitioned table global index system includes a syntax parsing module, an index building module, a query optimization module, an executor routing module, and a garbage collection module, which work together to manage the entire lifecycle of the partitioned table global index. This partitioned table global index system is an implementation based on PostgreSQL partitioned table global indexes, providing a complete technical solution from index definition, construction, querying to maintenance, to natively support efficient global indexes on partitioned tables.

[0145] Optionally, a novel global index tuple structure for partitioned tables is provided. This structure cleverly solves the key problem of global indexes being unable to directly locate the physical location of data by leveraging PostgreSQL's INCLUDE column feature. Specifically, when creating a global index, the system automatically adds the unique identifier of the sub-partition (the system column `tableoid`) as a hidden, system-managed INCLUDE column to the index's tuple structure, thus obtaining the index tuples (index tuple structure) corresponding to each data row within the sub-partition. Each index tuple in the global index includes a key-value part and a non-key-value part; the key-value part (Key Columns) consists of user-defined index columns used to provide fast key-value-based lookups. The non-key-value part (INCLUDE Columns) automatically contains a hidden system column named `tableoid`. The value of this column is the OID of the specific sub-table where the data row corresponding to this index tuple resides.

[0146] The process of creating this global index may include:

[0147] When the terminal receives the `CREATE GLOBAL INDEX ...` statement triggered by the user, its syntax parser can recognize the `GLOBAL` keyword. If the `global` keyword is recognized, the index building engine (such as `index_build`) will traverse all sub-partitions of the partitioned table to build a global index. For example, while retrieving the index key value from each sub-partition, the index building engine will use the sub-partition's OID as the `tableoid` value. The `tableoid` value, together with the index key value, forms a complete index tuple and is inserted into the global index. After traversing all sub-partition tables, the global index corresponding to the partitioned table is generated.

[0148] Based on this, any index entry in the global index contains the complete location information of the target data row corresponding to that index entry. When scanning for data using this index, the engine only needs to access the global index itself to directly extract the (key, tableoid, ctid) information from the index entry. Here, the tableoid explicitly indicates which subpartition the data is located in, while the ctid file block number and intra-block offset indicate the specific physical row location within that subpartition. This eliminates the need for the database engine to determine the partition where the data is located through traditional partition boundary constraint inference or traversing the partition tree, achieving direct location from the logical index to the physical data.

[0149] Optionally, such as Figure 5The diagram shows the storage structure of a global index for a partitioned table. This global index contains multiple index tuples, such as [key1, subpartition identifier1, physical location data1], [key2, subpartition identifier2, physical location data2], [key3, subpartition identifier3, physical location data3], i.e., [key1, ctid1, oid1], [key2, ctid2, oid2], and [key3, ctid3, oid3]. [key1, subpartition identifier1, physical location data1] can point to data row 2 in the P1 subpartition table, corresponding to... [key2, subpartition identifier2, physical location data2] can point to data row 2 in subpartition table P2, and [key3, subpartition identifier3, physical location data3] can point to data row 1 in subpartition table P3. That is, [key1, ctid1, oid1] can point to data row 2 in subpartition table P1, and correspondingly, [key2, ctid2, oid2] can point to data row 2 in subpartition table P2, and [key3, ctid3, oid3] can point to data row 1 in subpartition table P3.

[0150] The global index structure based on the INCLUDE column provided in this embodiment is an innovative global index data structure. By automatically storing the unique identifier (tableoid) of the sub-partition as a hidden INCLUDE column into the index tuple when the index is created, a single index entry can contain the physical location information of the sub-partition where the target data row is located, providing a data foundation for direct location.

[0151] Furthermore, in related technologies, at the database executor level, when scanning a global index and encountering consecutive index entries pointing to different sub-partitions, the traditional index scanning process may require switching the sub-partition table (i.e., switching the Relation data structure) for each row of data read. In PostgreSQL, finding and opening a sub-partition table via tableoid (heap_open or similar operations) involves system cache lookups and lock checks, which are resource-intensive. If this operation is performed frequently within an index scan, the overhead is significant and may negate the performance advantages of the global index. Therefore, this embodiment designs a hash table-based dynamic routing and caching mechanism for sub-partitions at the index scan execution layer. This mechanism enables extremely fast switching of sub-partition memory structures. In this embodiment, a private, lightweight hash table is embedded within the core data structure in PostgreSQL used to represent the index scan status (such as IndexScanDescData), resulting in the private hash table corresponding to the partition table. This private hash table serves as a cache for the sub-partition relation descriptors (Relations). The key in the hash table is the object identifier (OID) of the sub-partition, which is the tableoid extracted from the INCLUDE column of the global index entry; the value can be a pointer to the Relation structure of the corresponding sub-partition table; this structure is the complete representation of the sub-table in memory, containing all the necessary metadata and buffer access status for data reading.

[0152] like Figure 6As shown, when the index scanner retrieves the next tuple from the global index, for example, switching from [key1, ctid1, oid1] to [key2, ctid2, oid2], the terminal can first parse the target sub-partition's tableoid from the INCLUDE column of the index entry. The scanner first searches for the tableoid in its private hash table (hash partitionRelations, which stores sub-partition tables P1, P2, P3 and their corresponding memory pointers, i.e., relation descriptor pointers). If there is a cache hit (i.e., the corresponding tableoid is found in the hash table), the corresponding Relation pointer is immediately retrieved from the hash table. The scanner directly uses this memory structure for subsequent data row (HeapTuple) read operations, completely avoiding the overhead of repeatedly opening sub-tables. If there is a cache miss (i.e., if not found), the corresponding sub-partition table is opened according to the tableoid using the regular system table lookup process to obtain its Relation structure. Subsequently, the (target_oid, Relation) key-value pair is inserted into the private hash table for later use.

[0153] In other words, the hash table-based sub-partition dynamic routing method in this embodiment can protect the private hash table caching mechanism introduced at the executor layer. This mechanism uses the sub-partition's `tableoid` as the key and caches its corresponding relation descriptor in memory, enabling extremely fast switching of the sub-partition's memory structure during global index scanning and avoiding the system overhead of frequently opening sub-partitions.

[0154] Optionally, this embodiment provides a global index execution plan. In PostgreSQL, when querying a partitioned table, the optimizer (Planner) typically generates an execution plan containing Append and / or MergeAppend nodes for scans involving local indexes. This plan requires scanning the indexes of multiple sub-partitions in parallel or sequentially, and then merging the results. Its execution efficiency is negatively correlated with the number of partitions, becoming a performance bottleneck for partitioned table queries. This embodiment improves the query optimizer, enabling it to identify global indexes and generate a new, more efficient execution plan for them.

[0155] Step 1, non-Append execution plan generation, includes path generation and plan generation. Path generation occurs when the optimizer detects a query condition matching a global index during path generation. It creates a new GlobalIndexPath path node. The key difference between this path node and a traditional IndexPath is that it no longer organizes its lower-level sub-partition index scan paths using AppendPath. Plan generation occurs during the path-to-plan conversion phase. The GlobalIndexPath is directly converted into a simple IndexScan (or IndexOnlyScan) plan node, instead of an Append node containing multiple sub-IndexScan nodes.

[0156] The final execution plan generated in this way completely avoids the append operator. The executor only needs to perform a single index scan at runtime, and its execution overhead is constant and independent of the number of sub-partitions in the underlying partition table. This completely solves the problem of performance degradation caused by the increase in the number of partitions in traditional plans.

[0157] Step 2: This embodiment introduces an automatic aggregation and maintenance mechanism for global index statistics, including basic statistics aggregation and advanced statistics aggregation.

[0158] Basic statistics aggregation refers to the system automatically aggregating the underlying statistics of all sub-partitions (mainly from the pg_class system table) to calculate and cache total values ​​for the global index itself and its base table (parent table). For example, it can include the total number of disk pages (relpages) and the total number of rows (reltuples) of all sub-partitions.

[0159] Advanced statistical aggregation refers to the system aggregating information from the pg_statistic data of each sub-partition for the index columns of the global index and columns that may be involved in the query, generating global statistical information that can describe the data distribution of the entire partitioned table, including but not limited to:

[0160] The global percentage of NULL values;

[0161] The number of distinct values ​​globally (NDV).

[0162] A global list of MCVs (Most Common Values) and a histogram; the determination process can be achieved by weighted averaging or merging the statistical information of each sub-partition.

[0163] The system can calculate the cost of GlobalIndexPath based on basic and advanced statistics. The optimizer uses these aggregated statistics, rather than those of any single sub-partition, enabling it to more accurately estimate the amount of data to be read and the number of rows to be returned when scanning the global index. This allows the optimizer to make the optimal plan selection that matches the performance characteristics of the global index. The query optimizer can generate new execution plans for the global index, create a single IndexScan plan node, abandon the traditional Append node, and introduce a global statistics aggregation mechanism to provide accurate cross-partition data distribution information for cost estimation, ensuring the generation of efficient and stable execution plans.

[0164] This embodiment extends the SQL parser of the PostgreSQL database by introducing a new keyword into the existing index creation syntax to explicitly distinguish between the creation behavior of global and local indexes. This allows users to create and use global indexes intuitively and conveniently. Following PostgreSQL's syntax design philosophy, this embodiment adds an optional GLOBAL keyword to the end of the existing CREATE INDEX statement, representing a minimally invasive modification to standard SQL syntax. This maximizes compatibility with all existing SQL scripts, ORM frameworks, and database tools, while completely preserving the original CREATE INDEX syntax. When the user does not add the GLOBAL keyword, a traditional local index is created, behaving exactly the same as in unmodified PostgreSQL. This ensures smooth database version upgrades and zero-impact operation of legacy applications. In this embodiment, the GLOBAL keyword has clear semantics, making it easy for various users and terminal devices to intuitively understand its function. This function creates an index that operates on the entire partitioned table, ensuring the flexibility and accuracy of global index generation, and supporting the syntax parsing extension of the GLOBAL keyword. This protects the extension to the PostgreSQL SQL syntax parser by adding the syntax definition of the optional GLOBAL keyword to the end of the CREATE INDEX statement, providing users with an intuitive, convenient, and fully compatible global index creation interface with the original syntax.

[0165] Specifically, performing Data Definition Language (DDL) operations on a partitioned table, such as performing a DDL operation on a subpartition in a DROP TABLE, will immediately invalidate all global index entries pointing to that subpartition. Related technologies that immediately delete all relevant index entries within a DDL transaction can lead to performance blocking, transaction risks, and resource expropriation. For example, deleting a large partition containing millions of index entries can trigger a large-scale, synchronous deletion of global indexes, causing DDL operations to be extremely time-consuming and severely blocking subsequent operations. Long-running DDL transactions may be interrupted due to failures. Currently, PostgreSQL indexes do not contain transaction information and do not support rollback after deletion, thus compromising data consistency. Synchronous cleanup consumes significant CPU and I / O resources, impacting overall database performance.

[0166] The asynchronous invalidation index reclamation mechanism in this embodiment can avoid the above situation, specifically including:

[0167] S1, Metadata Records, extends system tables. For example, a new system table can be added to record invalid sub-partitions caused by DDL operations. The new system table can be pg_global_index_invalid_rel, which contains at least the following fields: parent_oid, the object identifier (OID) of the partition table, and child_oid, the object identifier (OID) of the deleted sub-partition. In the transaction that performs DDL operations (such as DROP TABLE), the above information is written to the pg_global_index_invalid_rel table to obtain the invalid data table corresponding to the partition table.

[0168] S2, asynchronous cleanup, specifically involves performing global index cleanup through PostgreSQL's automatic cleanup daemon (AutoVacuum) or the user-triggered VACUUM command. This cleanup phase is added within the Vacuum cleanup process, specifically during the global index Vacuum execution. This Vacuum process can be a PostgreSQL background maintenance process or command used to clean up dead tuples, update statistics, and reclaim storage space.

[0169] like Figure 7 As shown, when there is a Vacuum partition table and a global index exists on the partition table, the Vacuum worker process will collect the set of invalid subpartition OIDs through the pg_global_index_invalid_rel table to obtain clean-oids. That is, by querying the pg_global_index_invalid_rel table, a set of subpartition OIDs that need to be cleaned is obtained: {oid1, oid2, ...}.

[0170] In this way, the terminal can use the Vacuum worker process to clean up index entries in the global index whose `oid` is `clean-oids`, thus completing the asynchronous cleanup of the global index. Specifically, the Vacuum worker process can scan the global index. For each index entry, it checks the `tableoid` stored in its `INCLUDE` column. If the `tableoid` exists in the invalid OID set, it means that this index entry points to a non-existent subpartition, making it an invalid entry, and Vacuum will mark it as recyclable and eventually physically delete it. If it does not exist, it is retained.

[0171] The asynchronous garbage collection mechanism for DDL invalid index entries provided in this embodiment is an asynchronous cleanup method triggered by adding invalid subpartitions to the system table and the Vacuum process. By recording the OID of the deleted subpartitions and cleaning up invalid entries uniformly during subsequent Vacuum global indexing, the time-consuming cleanup work is separated from the DDL transaction path, ensuring the efficiency of DDL operations and system consistency, realizing the immediate cleanup of invalid index entries, and avoiding the index from growing indefinitely.

[0172] The goal is to significantly improve the performance of PostgreSQL partitioned tables in non-partition key query scenarios, provide cross-partition uniqueness constraint capabilities, and further enhance the competitiveness of PostgreSQL databases when handling ultra-large-scale data.

[0173] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages in other steps. It is understood that the steps in different embodiments can be freely combined as needed, and all non-contradictory solutions formed by such combinations are within the scope of protection of this application.

[0174] Based on the same inventive concept, this application also provides a globally indexed data processing apparatus for implementing the aforementioned globally indexed data processing method. The solution provided by this apparatus is similar to the implementation described in the above method; therefore, the specific limitations in one or more embodiments of the globally indexed data processing apparatus provided below can be found in the limitations of the globally indexed data processing method described above, and will not be repeated here.

[0175] In one exemplary embodiment, such as Figure 8 As shown, a data processing device 800 based on a global index is provided, comprising:

[0176] The first acquisition module 802 is used to obtain the sub-partition identifier in the index tuple currently being traversed for the global index of the partition table; the partition table includes multiple sub-partition tables.

[0177] The first query module 804 is used to perform a data row read operation through the relation descriptor pointer if the relation descriptor pointer corresponding to the sub-partition identifier is found in the private hash table corresponding to the partition table, so as to obtain the data corresponding to the index tuple; the private hash table includes multiple key-value pairs corresponding to each sub-partition table, and the key-value pairs include the sub-partition identifier and the relation descriptor pointer.

[0178] In one embodiment, the first query module is further configured to, if the relation descriptor pointer corresponding to the sub-partition identifier is not found in the private hash table, then search for the target sub-partition table corresponding to the sub-partition identifier in multiple sub-partition tables based on the sub-partition identifier, and extract the data corresponding to the index tuple in the target sub-partition table based on the physical location data contained in the index tuple.

[0179] In one embodiment, the apparatus further includes:

[0180] The second acquisition module is used to open the target sub-partition table, obtain the relation descriptor pointer of the target sub-partition table, form a key-value pair based on the relation descriptor pointer and the sub-partition identifier of the target sub-partition table, and add the key-value pair to the private hash table.

[0181] In one embodiment, the apparatus further includes:

[0182] The first receiving module is used to receive the global index creation statement corresponding to the partition table; it performs syntax parsing on the global index creation statement to obtain keyword data.

[0183] The first traversal module is used to traverse each sub-partition table in the partition table and obtain the index data of each sub-partition table if the keyword data contains the target global keyword.

[0184] The first generation module is used to generate a global index for the partition table based on the index data of each sub-partition table.

[0185] In one embodiment, the first generation module is specifically used for:

[0186] The third acquisition module is used to generate index tuples for each data row in the sub-partition table based on the index key value, sub-partition identifier, and physical location data of each data row; and to combine the index tuples corresponding to multiple data rows in multiple sub-partition tables to generate a global index for the partition table.

[0187] In one embodiment, the apparatus further includes:

[0188] The fourth acquisition module is used to acquire statistical information of the partition table, including basic statistical information and global statistical information;

[0189] The optimization module is used to optimize statistical information and each search path through a preset optimization algorithm to obtain the search consumption data corresponding to each preset search path.

[0190] The second traversal module is used to traverse each index tuple in the global index of the partition table by finding the preset search path with the lowest data consumption.

[0191] In one embodiment, the apparatus further includes:

[0192] Retrieve the first statistical information of each sub-partition table in the partition table;

[0193] The fifth acquisition module is used to obtain the basic statistics of the partition table based on the first statistics of each sub-partition table. The basic statistics include the total number of disk pages and the total number of rows. The total number of rows is the total number of tuples.

[0194] In one embodiment, the apparatus further includes:

[0195] The first aggregation module is used to aggregate statistical metadata of each sub-partition to obtain global statistical information of the partition table. The global statistical information includes the proportion of null values, the number of distinct values, the list of the most common values, and one or more of the following in the histogram.

[0196] In one embodiment, after the step of obtaining the sub-partition identifier in the currently traversed index tuple, the apparatus further includes:

[0197] The second query module is used to query the set of sub-partition identifiers to be cleaned corresponding to the partition table based on the sub-partition identifier;

[0198] Add a module to add a recyclable identifier to the subpartition identifier if the subpartition identifier is found in the set of subpartition identifiers to be cleaned.

[0199] In one embodiment, the apparatus further includes:

[0200] The data table generation module is used to determine that each index tuple corresponding to the target sub-partition table in the global index is in an invalid state after performing data definition language operations on the target sub-partition table in the partition table, and to generate an invalid data table based on the first identifier of the partition table and the second identifier of each invalid sub-partition table.

[0201] The third query module is used to query the invalid data table corresponding to the partition table when the triggering conditions for the cleanup operation are met, and generate a set of sub-partition identifiers to be cleaned corresponding to the partition table.

[0202] In one embodiment, the apparatus further includes:

[0203] The deletion module is used to delete the index tuples corresponding to subpartition identifiers marked as recyclable through a cleanup process.

[0204] Each module in the aforementioned global index-based data processing device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the operations corresponding to each module.

[0205] In one exemplary embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 9 As shown, this computer device includes a processor, memory, input / output (I / O) interfaces, and a communication interface. The processor, memory, and I / O interfaces are connected via a system bus, and the communication interface is also connected to the system bus via the I / O interfaces. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and a database. The internal memory provides the environment for the operating system and computer programs stored in the non-volatile storage media. The database stores partition table data. The I / O interfaces are used for exchanging information between the processor and external devices. The communication interface is used for communicating with external terminals via a network connection. When executed by the processor, the computer program implements a data processing method based on a global index.

[0206] Those skilled in the art will understand that Figure 9The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0207] In one exemplary embodiment, a computer device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to perform the steps in the above-described method embodiments.

[0208] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon that, when executed by a processor, implements the steps in the above method embodiments.

[0209] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps in the above method embodiments.

[0210] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data must comply with relevant regulations.

[0211] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile memory and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, artificial intelligence (AI) processors, etc., and are not limited to these.

[0212] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this application.

[0213] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. A data processing method based on a global index, characterized in that, The method includes: For the global index of the partition table, obtain the sub-partition identifier in the currently traversed index tuple; the partition table includes multiple sub-partition tables; the global index of the partition table is generated based on the index data of each of the sub-partition tables; the index data includes the index key value of the sub-partition table and the sub-partition identifier; If a relation descriptor pointer corresponding to the sub-partition identifier is found in the private hash table corresponding to the partition table, a data row read operation is performed through the relation descriptor pointer to obtain the data corresponding to the index tuple; the private hash table includes multiple key-value pairs corresponding to each of the sub-partition tables, and the key-value pairs include the sub-partition identifier and the relation descriptor pointer; the private hash table and the partition table are in one-to-one correspondence. The method further includes: Based on the index key value, sub-partition identifier, and physical location data of each data row, an index tuple for each data row in the sub-partition table is generated; the index tuples corresponding to multiple data rows in multiple sub-partition tables are combined to generate a global index for the partition table.

2. The method according to claim 1, characterized in that, The method further includes: If the relation descriptor pointer corresponding to the sub-partition identifier is not found in the private hash table, then the target sub-partition table corresponding to the sub-partition identifier is searched in the multiple sub-partition tables based on the sub-partition identifier, and the data corresponding to the index tuple is extracted in the target sub-partition table based on the physical location data contained in the index tuple.

3. The method according to claim 2, characterized in that, The method further includes: Open the target sub-partition table, obtain the relation descriptor pointer of the target sub-partition table, form a key-value pair based on the relation descriptor pointer and the sub-partition identifier of the target sub-partition table, and add the key-value pair to the private hash table.

4. The method according to claim 1, characterized in that, The method further includes: Receive the global index creation statement corresponding to the partition table; perform syntax parsing on the global index creation statement to obtain keyword data; If the keyword data contains the target global keyword, then the index data of each sub-partition table in the partition table is obtained by traversing the partition table.

5. The method according to claim 1, characterized in that, The method further includes: Obtain statistical information of the partition table, including basic statistical information and global statistical information; The statistical information and each search path are optimized by a preset optimization algorithm to obtain the search consumption data corresponding to each search path. The global index tuples of the partition table are traversed by finding the search path with the lowest data consumption.

6. The method according to claim 5, characterized in that, The method further includes: Obtain the first statistical information of each sub-partition table in the partition table; Based on the first statistical information of each of the sub-partition tables, the basic statistical information of the partition table is obtained. The basic statistical information includes the total number of disk pages and the total number of rows, where the total number of rows is the total number of tuples.

7. The method according to claim 5, characterized in that, The method further includes: The statistical metadata of each sub-partition is aggregated to obtain the global statistical information of the partition table. The global statistical information includes one or more of the following: the proportion of null values, the number of different values, the list of the most common values, and the histogram.

8. The method according to claim 1, characterized in that, After the step of obtaining the sub-partition identifier in the currently traversed index tuple, the method further includes: Based on the sub-partition identifier, query the set of sub-partition identifiers to be cleaned corresponding to the partition table; If the sub-partition identifier is found in the set of sub-partition identifiers to be cleaned, then a recyclable identifier is added to the sub-partition identifier.

9. The method according to claim 8, characterized in that, The method further includes: After performing data definition language operations on the target sub-partition table in the partition table, it is determined that each index tuple corresponding to the target sub-partition table in the global index is in an invalid state, and an invalid data table is generated based on the first identifier of the partition table and the second identifier of each sub-partition table in an invalid state. When the triggering conditions for the cleanup operation are met, the invalid data table corresponding to the partition table is queried to generate a set of sub-partition identifiers to be cleaned corresponding to the partition table.

10. The method according to claim 9, characterized in that, The method further includes: The cleanup process deletes the index tuples corresponding to the sub-partition identifiers marked with the recyclable identifier.

11. A data processing device based on a global index, characterized in that, The device includes: The first acquisition module is used to acquire the sub-partition identifier in the currently traversed index tuple for the global index of the partition table; the partition table includes multiple sub-partition tables; the global index of the partition table is generated based on the index data of each of the sub-partition tables; the index data includes the index key value of the sub-partition table and the sub-partition identifier; The first query module is used to perform a data row read operation through the relation descriptor pointer to obtain the data corresponding to the index tuple if the relation descriptor pointer corresponding to the sub-partition identifier is found in the private hash table corresponding to the partition table; the private hash table includes multiple key-value pairs corresponding to each sub-partition table, and the key-value pairs include the sub-partition identifier and the relation descriptor pointer; the private hash table and the partition table are in one-to-one correspondence. The device further includes: The first generation module is used to generate an index tuple for each data row in the sub-partition table based on the index key value, sub-partition identifier, and physical location data of each data row; and to combine the index tuples corresponding to multiple data rows in multiple sub-partition tables to generate a global index for the partition table.

12. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 10.

13. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 10.

14. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 10.