A data processing method and related equipment

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By constructing a shadow table of frequently queried table fields in the distributed database as the sharding key, the problem of low query efficiency caused by mismatch between the sharding key and the associated fields is solved, and more efficient query request execution is achieved.

CN118820236BActive Publication Date: 2026-06-30PEOPLE'S INSURANCE COMPANY OF CHINA

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: PEOPLE'S INSURANCE COMPANY OF CHINA
Filing Date: 2024-06-20
Publication Date: 2026-06-30

Application Information

Patent Timeline

20 Jun 2024

Application

30 Jun 2026

Publication

CN118820236B

IPC: G06F16/22; G06F16/242; G06F16/27

AI Tagging

Technology Topics

Table (database)Engineering

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Mitigation of computational graph lineage growth
US12664166B2Special data processing applications Database indexingTable (database)Algorithm
A zero-reconstruction management method for multi-source heterogeneous data based on object indexing
CN122285786ATable (database)Datasheet
Data storage method, device, computer storage medium and computer program product
CN115658682BDatabase models Database design/maintainanceTable (database)Relational database
Computer-implemented database method for storing basic data in data table
US12670131B2Table (database)Datasheet
Transaction-oriented database has a large-scale test case reduction method for testing exceptions
CN115794630BError detection/correctionTable (database)Theoretical computer science

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

In a distributed database, how can we ensure the execution efficiency of a query request when the shard key requested in the query request is different from the join field of the table used to generate the query results?

Method used

By obtaining the query weights of the table fields in the target database table, selecting table fields with query weights greater than a preset threshold as available sharding keys, and constructing shadow tables based on these available sharding keys, the matching of sharding keys and related fields is ensured, thereby improving the execution efficiency of query requests.

Benefits of technology

The computation pushdown method improves the execution efficiency of query requests and reduces resource consumption and query time.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN118820236B_ABST

Patent Text Reader

Abstract

This application discloses a data processing method to solve the problem in the prior art of ensuring query execution efficiency when the sharding key of the distributed database table requested in the query request is different from the association field of the table to be joined when generating the query result. The method includes: obtaining the query weight of at least one table field of a target database table; the query weight is determined based on the number of queries and the cumulative query time of the table field of the target database table in the distributed database within a target time period; selecting a table field from the at least one table field as an available sharding key; wherein the selected table field satisfies: the query weight is greater than a preset weight threshold; or, the set of table fields formed by the selected table fields satisfies: the query weight and the cumulative query time of all table fields in the set of table fields together satisfy a preset selection condition; if among the determined available sharding keys, there is an available sharding key that is different from the original shadow table sharding key of the target database table, then at least one available sharding key different from the original shadow table sharding key is used as the target sharding key to construct a shadow table of the target database table distributed to several data storage nodes of the distributed database, with the target sharding key as the sharding key; the original shadow table is the shadow table of the target database table currently stored in several data storage nodes of the distributed database. This application also discloses a data processing device, equipment, storage medium, and computer program product.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of database technology, and in particular to a data processing method, apparatus, electronic device, storage medium, and computer program product. Background Technology

[0002] A distributed database refers to a database that uses high-speed computer networks to connect multiple physically dispersed data storage units to form a logically unified database.

[0003] When storing any database table in a distributed database, the database table can be sharded and then distributed across multiple data storage units.

[0004] Sharding a database table refers to horizontally splitting a database table into multiple data shards according to a specified sharding key (PartitionKey, also known as the table partitioning key).

[0005] The sharding key is a database table field selected from the database table and used to generate sharding rules during the table partitioning process. These sharding rules can be hash modulo sharding rules: the sharding key (e.g., user_id, order_id) is hashed modulo the total number of partitions, and the database table is partitioned based on the modulo result. A concrete example is: when the sharding key is user_id, user_id = 1, modulo 4 gives 1, so it's placed in partition 1; user_id = 3, modulo 4 gives 3, so it's placed in partition 3, and so on. Besides this sharding rule, there are also rules that determine the partitioning range based on the sharding key before partitioning the database table, etc.

[0006] According to current technology, a typical distributed database with a storage-compute separation architecture includes compute nodes (CNs) and data nodes (DNs). When querying a distributed database, compared to pulling data to the CN node for execution (without pushdown), directly sending query requests to the DN for execution, with the CN only forwarding the query results, has proven to reduce the overhead of multiple network interactions between the CN and DN, thereby achieving higher query request execution efficiency.

[0007] To ensure the efficiency of query execution as much as possible, existing technologies pay particular attention to whether the sharding keys used when sharding the database tables requested in the query request are the same as the table fields (i.e., table join fields or join keys) used to generate the query results. If they are the same, then theoretically, using a "calculation pushdown" approach will result in higher query execution efficiency.

[0008] JOIN, or join, refers to searching for table fields in different database tables that match the query conditions (generally including data type and column name), and then automatically joining the found table fields that match the query conditions to return all results that meet the query conditions.

[0009] In real-world scenarios, a common problem arises where the sharding key of the database table being queried differs from the join field of the table used to generate the query results. Ensuring query execution efficiency in such situations is a critical issue that current technologies urgently need to address. Summary of the Invention

[0010] This application provides a data processing method to solve the problem in the prior art of ensuring query request execution efficiency when the sharding key of the distributed database table requested in the query request is different from the association field of the table to be joined when generating the query request result.

[0011] This application also provides a data processing apparatus, an electronic device, a storage medium, and a computer program product.

[0012] The embodiments of this application adopt the following technical solutions:

[0013] A data processing method, comprising:

[0014] Obtain the query weight of at least one table field of the target database table; the query weight is determined based on the number of queries and the cumulative query time of each query on the at least one table field of the target database table in the distributed database within the target time period.

[0015] From the at least one table field, select the table field whose query weight is greater than a preset weight threshold as the available sharding key;

[0016] If among the determined available sharding keys, there is an available sharding key that is different from the original shadow table sharding key of the target database table, then at least one available sharding key that is different from the original shadow table sharding key is used as the target sharding key to construct a shadow table of the target database table distributed to several data storage nodes of the distributed database, with the target sharding key as the sharding key.

[0017] The original shadow table is a shadow table of the target database table currently stored by several data storage nodes of the distributed database.

[0018] A data processing apparatus, comprising:

[0019] The query weight acquisition unit is used to acquire the query weight of at least one table field of the target database table; the query weight is determined based on the number of queries and the cumulative time of query operations performed on the at least one table field of the target database table in the distributed database within the target time period.

[0020] The sharding key selection unit is used to select, from the at least one table field, the table field whose query weight is greater than a preset weight threshold as the available sharding key;

[0021] If, among the determined available sharding keys, there is an available sharding key that is different from the original shadow table sharding key of the target database table, the shadow table construction unit uses at least one available sharding key that is different from the original shadow table sharding key as the target sharding key to construct a shadow table of the target database table distributed to several data storage nodes of the distributed database, with the target sharding key as the sharding key.

[0022] The original shadow table is a shadow table of the target database table currently stored by several data storage nodes of the distributed database.

[0023] An electronic device includes: a memory and a processor, wherein,

[0024] The memory is used to store computer programs;

[0025] The processor, coupled to the memory, is used to execute the computer program stored in the memory for performing the data processing method described above.

[0026] On the other hand, a computer-readable storage medium storing a computer program is also provided, which, when executed by a computer, can implement any of the above-described data processing methods.

[0027] On the other hand, a computer program product is also provided, including a computer program that, when executed by a processor, implements any of the above-described data processing methods.

[0028] The above-described technical solutions adopted in the embodiments of this application can achieve the following beneficial effects:

[0029] This solution allows you to create shadow tables in a distributed database that use frequently queried table fields as sharding keys. In this case, subsequent queries are more likely to request the same sharding key as the table join field used to generate the query result, thus improving query execution efficiency in scenarios using the "computation pushdown" approach. Attached Figure Description

[0030] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, illustrate exemplary embodiments and are used to explain this application, but do not constitute an undue limitation of this application. In the drawings:

[0031] Figure 1 A flowchart illustrating the specific implementation of a data processing method provided in this application embodiment;

[0032] Figure 2 This is a schematic diagram illustrating the process of selecting potential suitable columns as sharding keys in an embodiment of this application.

[0033] Figure 3 This is a schematic diagram illustrating the implementation process of counting the number of executions and the cumulative execution time in the embodiments of this application;

[0034] Figure 4 This is a schematic diagram illustrating the implementation process of determining the table reconstruction scheme in the embodiments of this application;

[0035] Figure 5 This is a schematic diagram illustrating the distribution of the new shadow table to the DN nodes of the distributed database in an embodiment of this application;

[0036] Figure 6 This is a schematic diagram illustrating the specific execution process of a query statement targeting a shadow table in an embodiment of this application;

[0037] Figure 7 This is a schematic diagram illustrating the process of selecting the shadow table to be queried based on rules and costs in sequence in an embodiment of this application;

[0038] Figure 8 This is a schematic diagram illustrating the transformation of redistribution into pushdown query in an embodiment of this application;

[0039] Figure 9 This is a schematic diagram illustrating the use of MVCC to achieve global read consistency in an embodiment of this application;

[0040] Figure 10 This is a schematic diagram illustrating the use of a Global Transaction Manager (GTM) to handle multi-stage commit transactions of compute nodes in this embodiment of the application.

[0041] Figure 11 This is a schematic diagram illustrating the measures taken to ensure data consistency in the embodiments of this application;

[0042] Figure 12 This is a schematic diagram of the specific structure of a data processing device provided in an embodiment of this application;

[0043] Figure 13 This is a schematic diagram of the specific structure of a computing device provided in an embodiment of this application. Detailed Implementation

[0044] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below in conjunction with specific embodiments and corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0045] As will be known to those skilled in the art, with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of this application are also applicable to similar technical problems.

[0046] The terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such terms are interchangeable where appropriate; this is merely a way of distinguishing objects with the same attributes in the embodiments of this application. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion, so that a process, method, system, product, or apparatus that comprises a series of elements is not necessarily limited to those elements, but may include other elements not explicitly listed or inherent to those processes, methods, products, or apparatuses.

[0047] Before introducing the methods provided in the embodiments of this application, some related technologies will be introduced below:

[0048] In a distributed database, a "association key" is a critical field that synchronizes data between two tables. It is primarily used to link and integrate data across different tables, enabling more comprehensive and accurate data queries and analysis.

[0049] By using join keys, relationships can be established between tables, enabling data integration and filtering based on these relationships when querying or processing data. This helps improve the efficiency and accuracy of data queries, as well as achieve data integration and sharing.

[0050] In distributed database systems, the use of association keys is crucial for achieving distributed data storage and retrieval. By designing and using association keys appropriately, efficient data access and management can be achieved, improving the performance and reliability of the entire distributed database system.

[0051] The sharded table referred to in this application refers to a table in a distributed database system that originally contained large amounts of data and needs to be sharded across multiple databases. In this way, each shard contains a portion of the data, and all shards constitute a complete dataset. There is no explicit limit to the size of the table in this application; specific rules can be defined according to the actual situation, such as requiring sharding for database tables whose storage space exceeds a preset threshold.

[0052] As described in the background section, existing technologies pay particular attention to whether the sharding key used when sharding the database tables requested in the query request is the same as the table field (i.e., the table join field or join key) to be joined when generating the query result of the query request. If they are the same, theoretically, using a "calculation pushdown" method will result in higher query execution efficiency.

[0053] However, in real-world scenarios, a common problem arises where the shard key of the distributed database table requested by a query differs from the join field of the table used to generate the query result. Ensuring query execution efficiency in such situations is a pressing issue that current technologies need to address.

[0054] To address the aforementioned problems in the prior art, embodiments of this application provide a data processing method.

[0055] This method ensures high query efficiency in scenarios where the "calculation pushdown" approach is used for querying.

[0056] The execution subject of this method can be any computing device, or software running on the computing device. This application does not limit the type of computing device or software to which the execution subject is.

[0057] Furthermore, different steps of this method can be implemented by different computing devices or software. This application does not limit which computing device or software is used to implement which step in the embodiments.

[0058] Please refer to the attached instruction manual. Figure 1 The following is a flowchart illustrating a specific implementation of a data processing method provided in this application embodiment, including the following steps:

[0059] Step 11: Obtain the query weight of at least one table field of the target database table;

[0060] The target database table mentioned in this application embodiment refers to any data table that can be split into sharded tables and stored in a distributed database.

[0061] In one alternative implementation, the target database table may specifically refer to the database table that needs to be sharded and stored. The target database table has many fields and a large amount of data (also referred to as a "large table" in this application). There are no explicit restrictions on the number of fields and the amount of data stored in the target database table, and the rules can be specified according to the actual situation.

[0062] In this embodiment of the application, the reason why the target database table can specifically refer to a large table is that, under normal circumstances, tables with relatively small data volume and simple structure—which can be called small tables—usually do not need to be split when stored in a distributed database, and therefore do not have corresponding sharding keys or sharding tables, thus generally do not have the problems existing in the background technology.

[0063] The data volume of a "large table" may exceed 1 million rows and occupy more than 1GB of storage space. In a distributed database, it is usually split and stored according to the sharding key, which makes it more likely to have the problems mentioned in the background technology.

[0064] Of course, those skilled in the art will understand that the method provided in this application embodiment can be applied to cases where the target database table is a large table or a small table.

[0065] Table fields are columns in a table that define the data types and attributes of the table. For example, a "Users" table might contain table fields such as "User ID", "Name", and "Email".

[0066] In this embodiment, the query weight of a table field refers to an indicator value that reflects whether a table field is suitable as a sharding key. Specifically, the query weight of a table field can be determined based on the number of queries and the cumulative time spent on querying a certain table field of a target database table in the distributed database within a target time period. The target time period mentioned here generally refers to a historical time period, such as the past 30 days.

[0067] The number of queries reflects the query frequency of a table field, while the cumulative query time reflects the resources consumed by the query operations performed on the corresponding table field.

[0068] The higher the query frequency of a table field, the greater the probability that the table field will be used as a join field in the target table; conversely, the more resources a table field query operation consumes, the more urgent it is to complete the query operation and release the occupied resources. Therefore, in this embodiment, both of these factors are considered to determine the query weight of the table field, so that a suitable table field can be selected as the sharding key for sharding the target database table based on the query weight.

[0069] In one optional implementation, taking a table field "order_id" in the target database table as an example, after obtaining the number of queries n and the cumulative query time t for querying the table field "order_id" within the target time period, n and t are normalized according to a preset normalization algorithm to obtain n' and t', and the query weight value Q of the table field "order_id" is calculated according to the following formula:

[0070] Q=αn'+βt'

[0071] Here, α and β are weighting coefficients set for the number of queries and the cumulative time spent on query operations, respectively. The specific values of the weighting coefficients can be set according to actual needs.

[0072] The normalization algorithm mentioned above can be, for example, linear normalization, which can scale data to a specified range, such as between 0 and 1.

[0073] In this embodiment of the application, the magnitude of the query weight value can be positively correlated with both the number of queries n and the cumulative time t of the query operation.

[0074] In one optional implementation, the number of queries and the cumulative time spent on querying table fields within a target time period can be dynamically collected based on a pre-set data validity period threshold. The specific implementation process includes the following sub-steps 1 to 3:

[0075] Sub-step 1: Determine the target time period based on the preset data validity period threshold;

[0076] The length of the target time period is the same as the data validity period threshold; the target time period is the time period between a historical moment and the current moment.

[0077] For example, assuming the data validity period threshold is 30 days, then the target time period is the 30 days closest to the current moment. The current moment here can refer to the system time at the time of executing sub-step 1.

[0078] Sub-step 2: Obtain the number of queries and the cumulative time taken for each query to at least one table field of the target database table in the distributed database within the target time period;

[0079] In this embodiment of the application, in order to meet the requirement of obtaining the number of queries and the cumulative time spent on the query operation, the table fields queried in each response to the query request and the corresponding query time can be recorded. At the same time, a counter can be used to count the number of times the table fields are queried.

[0080] Among them, the table fields queried in response to the query request are the table fields to be joined when generating the query result.

[0081] The query time for a table field can be the length of time between the moment the query request is sent to the distributed database and the moment the query result is returned.

[0082] Sub-step 3: Based on the obtained number of queries and the cumulative time consumed by the query operations, determine the query weight of each table field in at least one table field.

[0083] In one optional implementation, at least one table field of the target database table mentioned in step 11 may refer to all or some of the table fields of the target database table.

[0084] In an optional implementation, at least one table field of the target database table mentioned in step 11 can be a candidate sharding key from a candidate sharding key selection set determined based on the target database table. Here, a candidate sharding key refers to a table field that can potentially serve as a sharding key; the candidate sharding key selection set is a set of sharding keys composed of candidate sharding keys.

[0085] In an optional implementation, the method provided in this application embodiment may further include the following steps ① to ③, which are used to construct a candidate sharding key selection set:

[0086] Step ①: Create an empty candidate set;

[0087] For example, an empty candidate set can be an empty table.

[0088] Step 2: Identify potential candidate sharding keys from the target database tables;

[0089] Specifically, columns that serve as both the primary key and the unique index of the target database table can be identified as potential candidate sharding keys;

[0090] Alternatively, if the target database table does not have a column that serves as both a primary key and a unique index, identify the column that serves as the primary key or unique index of the target database table as a potential candidate sharding key.

[0091] Alternatively, if the target database table does not have a column that serves as a primary key or a unique index, all columns of the target database table can be identified as potential candidate sharding keys.

[0092] In a database table, a primary key is a column or combination of columns used to uniquely identify each row of data. Primary key constraints ensure data integrity by automatically creating a unique index on the primary key column to enforce data uniqueness, thus preventing the insertion of duplicate rows. Each table can only have one primary key, and the value of the primary key must be unique and cannot be NULL. Primary keys can be used to guarantee data integrity and consistency, and can also serve as indexes for tables, facilitating fast lookups.

[0093] A unique index is an index that helps maintain data integrity by ensuring that no two rows in a table have exactly the same key value. Like primary keys, unique indexes cannot be duplicated. Unique indexes can be used as a means of validating data validity; for example, in a student table, if the ID number field is explicitly prohibited from being duplicated, a unique index is used. Meanwhile, the student ID field is typically set as the primary key of the student table.

[0094] A primary key ensures that every row in a database table is unique and non-repeating, such as an ID card number or student ID number. A unique index serves the same purpose as a primary key, but a table can only have one primary key, which cannot be null, while multiple unique indexes are allowed. A unique index can contain one null record. For example, in a student table, a school typically uses the student ID number as the primary key and the ID card number as the unique index; in an education bureau, the ID card number is used as the primary key, and the student ID number as the unique index. Therefore, the choice of primary key depends on business requirements.

[0095] Step 3: Select potential candidate sharding keys that meet the predetermined conditions from the potential candidate sharding keys and add them to the empty candidate set to obtain the candidate sharding key selection set.

[0096] The pre-determined conditions mentioned here include the following:

[0097] The data in the column is of simple data type—because if it were a complex data type, the sharding logic would become complicated and difficult to manage;

[0098] The data in the column must not be empty—because if it is empty, the data may not be correctly allocated to the sharded table;

[0099] The data in the column does not exhibit severe skewness—for example, if more than 95% of the values in a column are "male" and less than 5% are "female," then there is severe skewness.

[0100] The data in the column does not contain characters that are not supported by the sharding function—characters that are not supported by the sharding function may cause sharding errors or query failures.

[0101] A sharding function is an algorithm used to shard database tables.

[0102] It should be noted that in a special case—if candidate sharding keys cannot be selected from the target database—in an optional implementation, the sharding key with the highest number of distinct values (NDV) can be selected from the sharding keys of the original shadow table as a candidate sharding key and added to an empty candidate set to obtain a candidate sharding key selection set.

[0103] The "original shadow table" mentioned here refers to the shadow table of the target database table currently stored on several data storage nodes of the distributed database.

[0104] Taking the target database table as an example, the shadow table of the target database table mentioned in this application refers to the sharded table obtained by horizontally partitioning the target database table according to the sharding key.

[0105] For example, a target database table A containing user order data is shown in Table 1.

[0106] Table 1:

[0107] order_id user_id order_date product_id quantity 1001 001 2023-03-01 P101 2 1002 002 2023-03-02 P102 3 1003 003 2023-04-03 P102 1 1004 004 2023-04-09 P103 4 1005 005 2023-04-10 P101 1

[0108] When the order_date field of the table is used as the sharding key, the target database table A can be split into two tables:

[0109] Data with the order_date field value in March is contained in sub-table 1, as shown in Table 2.

[0110] Table 2:

[0111] order_id user_id order_date product_id quantity 1001 001 2023-03-01 P101 2 1002 002 2023-03-02 P102 3

[0112] Data with the order_date field value in April is contained in sub-table 2, as shown in Table 3.

[0113] Table 3:

[0114]

[0115]

[0116] Table 1 and Table 2 together form a shadow table with order_date as the sharding key.

[0117] When product_id is used as the sharding key, the target database table A can be divided into three sharded tables. Among them, the data with the product_id field value P101 is included in sharded table a, as shown in Table 4.

[0118] Table 4:

[0119] order_id user_id order_date product_id quantity 1001 001 2023-03-01 P101 2 1005 005 2023-04-10 P101 1

[0120] The data with the product_id field value P102 is contained in sub-table b, as shown in Table 5.

[0121] Table 5:

[0122] order_id user_id order_date product_id quantity 1002 002 2023-03-02 P102 3 1003 003 2023-04-03 P102 1

[0123] The data with the product_id field value P103 is contained in sub-table c, as shown in Table 6.

[0124] Table 6:

[0125] order_id user_id order_date product_id quantity 1004 004 2023-04-09 P103 4

[0126] Tables a, b, and c together form a shadow table with product_id as the sharding key.

[0127] In this embodiment of the application, if the target database table has been split into shadow tables according to the sharding key (referred to as the original sharding key for ease of description), and the shadow tables are distributed to several data storage nodes of the distributed database, such shadow tables can be called original shadow tables.

[0128] Of course, in this embodiment of the application, the target database table can exist in the distributed database in the form of a sharded table, or the target database table itself can be stored on a DN node of the distributed database.

[0129] Step 12: Select a table field from at least one table field of the target database table as an available sharding key;

[0130] The selected table fields satisfy the following conditions: the query weight is greater than the preset weight threshold; or, the set of table fields formed by the selected table fields satisfies the following condition: the query weight of all table fields in the set and the cumulative query time together satisfy the preset selection conditions.

[0131] Since query weight can reflect whether a table field is suitable as a sharding key, and the size of the query weight value can be positively correlated with both the number of queries n and the cumulative time t of the query operation—the larger the query weight value, the more times the table field is queried and / or the longer the cumulative time of the query operation—in an optional implementation, table fields with query weights greater than a preset weight threshold can be selected as available sharding keys.

[0132] In another alternative implementation, a table field can be selected as the available sharding key in the following manner:

[0133] After calculating the query weight of each table field in at least one table field of the target database table, the table fields can be sorted in descending order according to the query weight.

[0134] Then, the cumulative query time t of the table fields with query weights from largest to smallest is accumulated sequentially until the total cumulative query time reaches a certain proportion (e.g., 80%) of the total cumulative query time T of all table fields in at least one table field. At this point, the table field participating in the accumulation is used as the available sharding key.

[0135] The table fields selected in this way can collectively form a set of table fields, and the query weights and cumulative query operation times of all table fields in the set can collectively satisfy the above selection conditions—that is, the query weights of the selected fields are relatively large among the query weights of each table field, and the sum of the cumulative query operation times reaches a certain proportion of the total cumulative query operation times T of all table fields.

[0136] Adding information about the cumulative time spent on query operations to the selection criteria can further constrain the selection of available sharding keys by the cumulative time spent on query operations. This ensures that while minimizing the resources required for resharding the target database table (i.e., controlling the number of available sharding keys), the resource consumption of subsequent queries is minimized (i.e., the resource consumption for queries on the resharded sharded tables / shadow tables is kept within a certain range).

[0137] Step 13: If among the determined available sharding keys, there is an available sharding key that is different from the original shadow table sharding key of the target database table, then use at least one available sharding key that is different from the original shadow table sharding key as the target sharding key, and construct shadow tables distributed to several data storage nodes of the distributed database, with the target sharding key as the sharding key.

[0138] Specifically, the target data table can be sharded based on the target sharding key to obtain a shadow table, and the target sharding key can be set as the sharding key of the obtained shadow table. Then, the obtained shadow table is stored in several data storage nodes of the distributed database.

[0139] In one optional implementation, the determined available sharding keys can be compared one by one with the sharding keys of the original shadow tables of the target database table to determine the available sharding keys that are different from the original shadow table sharding keys. For example, suppose the target database has two original shadow tables with sharding keys A1 and A2 respectively; and the determined available sharding keys are A2, A3, and A4. Then, after comparison, A3 and / or A4 can be selected as the target sharding key.

[0140] In one alternative implementation, a copy table can be obtained by replicating the target database table (e.g., in a DN currently backed up in a distributed database) to achieve the same data and table structure as the target database table. Then, the shard key of the copy table is set to be the same as a target shard key, thus obtaining a fully configured copy table, which serves as a shadow table of the target database table.

[0141] Generally, when the join field (query field) of the table to be joined when generating the query result is exactly the shard key of the original shadow table, the system can use the "key-value lookup" method to directly use the shard key of the original shadow table to locate the original shadow table where the data is located, thus achieving high query efficiency. This is because the system can quickly find the original shadow table containing the required data without searching the entire distributed database.

[0142] If the query field does not match the shard key, the entire distributed database needs to be searched, which may reduce query efficiency.

[0143] In this embodiment of the application, the above method is used to construct a shadow table of the target database table, which is distributed to several data storage nodes, with at least one available shard key that is different from the original shard key as the target shard key. This enables the following: in a distributed database, there will be at least one shadow table of the target database table with a table field with high query frequency as the shard key.

[0144] Based on this, the sharding key of the target database requested by the query request has a high probability of being the same as the join field of the table to be joined when the query result is generated—the join field of the table is likely to be a table field with high query frequency (i.e., the table field used as the sharding key of the shadow table). Therefore, at least in the scenario of querying using the "calculation pushdown" method, high query efficiency is guaranteed.

[0145] In an optional implementation, the method provided in this application embodiment may further include the following steps:

[0146] If among the identified available sharding keys, there exists an available sharding key that is the same as the original shadow table sharding key, then retain the original shadow table currently stored on several data storage nodes whose sharding keys are the same as the available sharding keys.

[0147] If among the identified available sharding keys, there exists an available sharding key that is different from the original shadow table sharding key, then delete the original shadow tables currently stored on several data storage nodes whose sharding keys are different from the available sharding keys.

[0148] By deleting the original shadow table whose sharding key is different from the available sharding key, shadow tables with low query frequency and / or low query resource consumption can be cleared. This can save storage resources of the distributed database without having too much impact on query efficiency.

[0149] After creating a shadow table, the SQL syntax of the distributed database can be extended to support the creation and management of shadow tables, as well as to support the recognition and processing of SQL statements containing shadow table-related syntax. The specific implementation of this technology can refer to the corresponding modifications made to newly added data tables in distributed databases in related technologies, which will not be elaborated upon in this application's embodiments. Furthermore, for querying shadow tables, the query methods for data tables in distributed databases in related technologies can continue to be used, which will not be elaborated upon in this application's embodiments.

[0150] The following describes a specific implementation process of the method provided in the embodiments of this application in practice.

[0151] Before introducing the specific implementation method, some explanations will be given on the querying of distributed databases using existing technologies.

[0152] With current technology, when querying a distributed database, the query can be directly sent to the Data Node (DN) for execution, with the Node (CN) only forwarding the results. This reduces latency caused by multiple network interactions between the CN and DN. We call this ability to "send queries to the DN for execution" "computation pushdown." In distributed database table join scenarios, the degree of "pushdown" directly affects the execution efficiency of Structured Query Language (SQL) statements. Generally, SQL statement pushdown efficiency is higher when the fields in the table join (JOIN) are the same as the table's shard key. In "multidimensional computation" scenarios, efficiency issues need to be addressed through "replication" and "redistribution."

[0153] When a table's join fields differ from its sharding key, and the table size is small, "replication" can achieve a "pushdown" effect. In distributed database systems, data broadcasting is widely used for joins between fact tables and dimension tables with inconsistent sharding strategies. Under this strategy, data from the smaller table is broadcast to all data nodes involved in the statement, except for itself. The join is then performed in parallel within each data node, and finally, the compute nodes aggregate the results. Furthermore, the broadcast data is not all fields, but rather a selection of fields involved in the SQL statement, minimizing the amount of data communicated.

[0154] When a table's join fields differ from its shard key, and the table is large, data redistribution is needed to achieve a "push-down" effect. In this scenario, broadcasting data to any table is unsuitable, especially with a large number of shards (a larger number of shards means a larger amount of data being broadcast ineffectively). An optimized approach is to re-shard the large table, allowing the join process to be completed within each shard. In a distributed database system environment, this execution method is very similar to map-reduce. However, in practice, its efficiency is often much lower than "push-down," and it consumes significant network I / O and memory resources.

[0155] Therefore, for large tables, "pushdown" is the most efficient solution, and the resource consumption and response time of "replication" are often acceptable, so it is not the focus of this optimization. However, this requires users to select the correct sharding when creating the sharded table. Current solutions require users to determine the sharding key during initial table creation, but users often find it difficult to determine which fields to use as sharding keys at this stage. Therefore, a method is needed that reduces the complexity of designing sharded tables for users and automatically reselects the sharding key and redistributes data based on system load after the system goes live.

[0156] To achieve the above objectives, the data processing method provided in this application embodiment can be used. The specific implementation process of this method in a real-world scenario may include the following parts:

[0157] Part 1: Filter out potential columns that can be used as table sharding keys;

[0158] Please refer to Figure 2 This diagram illustrates the process of filtering potential suitable columns as sharding keys from all columns of a large table X (equivalent to the target database table described in the previous example), including the following core steps:

[0159] (1) First, establish an empty candidate set;

[0160] (2) Determine whether the large table X has a primary key and a unique index. If both exist, then iterate through the intersection column of the two. If neither exists, then iterate through all the primary key columns or unique index columns. If neither a primary key nor a unique index exists, then iterate through all the columns in the table.

[0161] (3) Traverse each column of the large table according to the traversal method described in (2), and add columns that simultaneously meet the following conditions to the candidate set to obtain the sharding key candidate set:

[0162] This column is a simple data type, cannot be null, does not have severe skew, and does not contain special characters that are not supported by the slice function.

[0163] The columns in the candidate sharding key set are the candidate sharding keys described in the previous embodiments.

[0164] Part Two: For columns (candidate sharding keys) in the candidate sharding key set, count the number of executions and the cumulative execution time of the corresponding historical query operations; based on the number of executions and the cumulative execution time, select candidate sharding keys whose query weight is greater than the preset weight threshold from the candidate sharding keys as usable sharding keys;

[0165] After obtaining the candidate set of sharding keys, we can count the number of times each column (candidate sharding key) in the candidate set is executed (that is, the number of times it is queried) and the cumulative execution time when it is used as a related field (related key), which can be used as the basis for calculating the weight.

[0166] The specific implementation process can be as follows: Figure 3 As shown:

[0167] (1) First, collect the query SQL statements executed in the actual business environment and extract all the tables involved in the SQL statements.

[0168] (2) Iterate through the columns of the tables involved in the table join query (columns of JOIN). Each column has two counters: the number of queries and the query duration. Determine whether the column exists in the candidate set of the shard key of the filtered large table. If it exists, increment the number of queries in the column by 1 and add the duration of the query SQL to the query duration counter of the column.

[0169] The query count and query duration counters have an expiration period threshold, meaning that counts before a certain point in time will be cleared. For example, if the expiration period threshold is 30 days, only the cumulative query duration and cumulative query count for the most recent 30 days will be counted, and counts older than 30 days will be cleared.

[0170] (4) The cumulative number of queries and the cumulative query duration of each column (candidate sharding key) in the candidate sharding key set are used as the basis for calculating the query weight of the column. The specific calculation method can be found in the previous description and will not be repeated here.

[0171] Based on the number of executions and the cumulative execution time, candidate shard keys with a query weight greater than a preset weight threshold can be selected from the candidate shard keys as usable shard keys.

[0172] Part 3: Determine the table restructuring plan and execute the restructuring plan;

[0173] The specific strategy for determining the table restructuring plan is as follows:

[0174] If there is no available sharding key in the sharding key candidate set for large table X, then retain the original shadow table with the most distinct sharding key values (Number of Distinct Value, NDV) and delete the rest of the redundant original shadow tables of large table X.

[0175] If a large table X has N available sharding keys (N>=1) and there are currently M original shadow tables in the large table X (M>=1), if some available sharding keys are consistent with the sharding of some original shadow tables, then these original shadow tables are retained (i.e., they are not deleted).

[0176] If the sharding keys of some original shadow tables of large table X are not available sharding keys, then delete these original shadow tables.

[0177] If some available sharding keys do not have a corresponding original shadow table, then the sharding key conversion scheme is executed.

[0178] In one alternative implementation, the process of performing the fragment key conversion scheme includes:

[0179] Based on available sharding keys for which there are no corresponding original shadow tables, a new shadow table is created. Data is then synchronized from any original shadow table to the new shadow table. After synchronization, any redundant original shadow tables (i.e., those whose sharding keys are not available) are deleted. The specific steps for converting the sharding key are as follows:

[0180] (1) Create a new shadow table using the available sharding key as the sharding key;

[0181] (2) Scan any existing shadow table, store the data in the existing shadow table into the newly created shadow table, and record all new operations on the existing shadow table in a log file during this process.

[0182] (3) Apply the new operations from the log file to the new shadow table. At this point, the new shadow table has completed data synchronization. Ultimately, the data in each shadow table is consistent, but the sharding keys are different.

[0183] (4) Delete redundant original shadow tables (i.e., original shadow tables whose sharding key is not a usable sharding key).

[0184] In one optional implementation, the specific implementation of "using the cumulative number of queries and cumulative query duration of each column (candidate sharding key) in the candidate sharding key set as the basis for calculating the query weight of the column, and the above-mentioned third part" can be as follows: Figure 4 As shown.

[0185] exist Figure 4 In this context, the candidate columns represent the available sharding keys. Figure 4In this process, the query weights of candidate sharding keys can be calculated first and sorted in descending order. Then, the execution times of candidate sharding keys with decreasing query weights are accumulated sequentially until the accumulated execution time reaches 80% of the total execution time of all candidate sharding keys. At this point, the candidate sharding keys included in the accumulation are considered as usable sharding keys. Then, the relationship between the number of usable sharding keys N and the number of original shadow tables M is used to determine the table reconstruction scheme.

[0186] Part 4: Distribute the newly created shadow tables to the various DN nodes of the distributed database.

[0187] like Figure 5 The diagram illustrates the assumption that shadow tables with f1 and f2 as sharding keys are created for the large table TBL_B, and these two new shadow tables are distributed to various DN nodes of the distributed database.

[0188] The following section describes the specific implementation of CRUD operations on a distributed database after adopting the data processing method provided in the embodiments of this application.

[0189] 1. Perform a query operation

[0190] After data synchronization is complete, the data in each shadow table is consistent. Therefore, selecting different shadow tables during queries can ensure the accuracy of the results.

[0191] The specific execution process of the query statement is as follows: Figure 6 As shown. In Figure 6 middle:

[0192] The first step is to modify the syntax rules and global data dictionary of the distributed database (highlighted in red) so that SQL statements targeting the newly created shadow tables can be parsed.

[0193] The second step is to modify the statement rewriting module (the part in red box that is statement rewriting) so that the query rewriting module can identify which shadow table to use for the specific query.

[0194] The first and second steps mentioned above can be found in related technologies on how to modify the syntax rules, global data dictionary, and statement rewriting module after adding a new data table in a distributed database, and will not be repeated here.

[0195] Based on the above modifications, when you need to query the shadow table of a large table, you can do so as follows: Figure 7As shown, the first step is to try to select a suitable shadow table based on the rule to generate an execution plan. For ease of understanding, it is assumed that TBL_B has only two shadow tables, with f1 and f2 as sharding keys respectively. When a suitable shadow table cannot be selected based on the rule (i.e., the association key is not either f1 or f2; or both f1 and f2 are association keys), a cost-based optimizer can be selected. By comparing the costs, the shadow table with the smaller cost offset is selected as the shadow table to be queried.

[0196] When the optimizer automatically selects a suitable shadow table as the query object based on cost, such as Figure 8 As shown, the original redistribution can be automatically transformed into a more efficient pushdown query, thereby improving query efficiency.

[0197] It's important to note that distributed databases typically use Multiversion Concurrency Control (MVCC) to achieve global read consistency. By introducing shadow tables, MVCC can be applied to the shadow tables selected by the optimizer to achieve global read consistency, making it essentially the same as with regular tables. Figure 9 As shown.

[0198] exist Figure 9 In this context, if a query requires f1 as the join key, the optimizer will select the shadow table TBL_B' for the query and ensure global read consistency through the global transaction ID and global snapshot. Similarly, if a query uses f2 as the join key, the optimizer will select the shadow table TBL_B" for the query, ensuring that the join key is the same as the sharding key and also satisfying global MVCC.

[0199] After creating the new shadow table, DML operations on TBL_B will change compared to traditional tables. Each DML operation requires "distributed transactions" to ensure data consistency between TBL_B' and TBL_B''. DML stands for Data Manipulation Language, a programming language used for database operations, performing access operations on objects and data within the database. It is usually a subset of database-specific programming languages, such as SQL, the standard language in the information software industry, which centers on the INSERT, UPDATE, and DELETE commands, representing insertion (meaning adding or creating), update (modifying), and deletion (destroying), respectively.

[0200] In this embodiment, multi-phase commit can be introduced to ensure consistency between shadow tables. Specifically, the distributed transaction processing capabilities on the compute nodes and the global transaction manager can be utilized to achieve data consistency between TBL_B' and TBL_B", thereby ensuring that transactions related to TBL_B are atomic.

[0201] like Figure 10 As shown, a Global Transaction Manager (GTM standby) can be used to handle multi-phase commit transactions on compute nodes, including the allocation, release, and querying of executed GTIDs (Executed_Gtid_Set, i.e., already executed transaction codes). Specifically, the compute node coordinates the transaction operations of each participant, and the Global Transaction Manager manages the state of the global transaction. Two-phase commit ensures that the final operations either all succeed or all roll back.

[0202] After adding a new shadow table, in addition to ensuring the consistency of each new shadow table within each data node, it is also necessary to ensure data consistency across multiple shadow tables. Measures to ensure data consistency can include... Figure 11 As shown. Generally, both TBL_B' and TBL_B' need to complete their transactions before they can both succeed or roll back. The control logic of the two-phase commit changes from being related to the number of data nodes to being related to the product of the number of data nodes and the number of shadow tables. This introduces some complexity. Therefore, this scheme is suitable for scenarios where query operations are frequent but DML operations are infrequent.

[0203] When multiple transactions concurrently access a database, situations may arise where the same data is read and modified. To prevent disruption of database consistency, concurrency control is necessary. For transactional databases, locks are typically applied to data objects. Lock granularity can be logical units such as columns, rows, tables, and index entries, or physical units such as data pages and index pages. Locks can be categorized by strength, such as shared locks, exclusive locks, and intention locks. When shadow tables are introduced, the database engine should implement separate locking for each shadow table.

[0204] In summary, the solution provided in this application is applicable to distributed databases. For example, if a large table (with many columns and a large amount of data) has multiple fields (typically <= 3; if > 3, this solution would increase DML operation overhead) that are frequently used as join fields in queries with other tables, and these join fields change with business requirements, then multiple shadow tables can be automatically created using the join key as the sharding key. These shadow tables can be automatically added or deleted based on changes in the join fields. When a query needs to be performed using a different join key, the query operation is executed on the shadow table using that join key as the sharding key, thereby significantly improving query efficiency. Simultaneously, this solution still ensures the correctness of database DML operations.

[0205] It should be noted that, Figures 6 to 11 The specific implementation process can be found by referring to the implementation process of relevant mature technologies, and will not be elaborated here.

[0206] To address the issue in existing technologies of ensuring query execution efficiency when the sharding key of the distributed database table requested in the query request differs from the join fields of the table used to generate the query result, such as... Figure 12 As shown in the illustration, this application also provides a data processing apparatus, including:

[0207] The query weight acquisition unit 121 is used to acquire the query weight of at least one table field of the target database table; the query weight is determined based on the number of queries and the cumulative time consumed by the query operations for each of the at least one table field of the target database table in the distributed database within the target time period.

[0208] The available sharding key selection unit 122 is used to select a table field as an available sharding key from the at least one table field; wherein the selected table field satisfies: the query weight is greater than a preset weight threshold; or, the set of table fields formed by the selected table fields satisfies: the query weight and cumulative query operation time of all table fields in the set of table fields together satisfy the preset selection conditions;

[0209] The shadow table construction unit 123 is used to construct a shadow table of the target database table, which is distributed to several data storage nodes of the distributed database and uses the target shard key as the shard key, if there is an available shard key that is different from the original shadow table shard key of the target database table among the determined available shard keys.

[0210] The original shadow table is a shadow table of the target database table currently stored by several data storage nodes of the distributed database.

[0211] In an optional implementation, the apparatus provided in this application embodiment may further include: a raw shadow table processing unit. The raw shadow table processing unit is used for:

[0212] If among the determined available sharding keys, there is an available sharding key that is the same as the original shadow table sharding key, then the original shadow table currently stored by the several data storage nodes that has the same sharding key as the available sharding key is retained;

[0213] If among the identified available sharding keys, there exists an available sharding key that is different from the original shadow table sharding key, then delete the original shadow table currently stored by the plurality of data storage nodes whose sharding key is different from the available sharding key.

[0214] In one optional implementation, the query weight acquisition unit 121 can specifically be used for:

[0215] The target time period is determined based on a preset data validity period threshold; the length of the target time period is the same as the data validity period threshold; the target time period is the time period between a historical moment and the current moment.

[0216] Obtain the number of queries and the cumulative time consumed by each query on at least one table field of the target database table in the distributed database within the target time period;

[0217] Based on the number of queries and the cumulative time consumed by the query operations, the query weight of each table field in the at least one table field is determined.

[0218] In an optional implementation, the at least one table field is a candidate sharding key from a candidate sharding key selection set determined based on the target database table; therefore, the apparatus provided in this application embodiment may further include:

[0219] The candidate fragmentation key determination unit is used for:

[0220] Create an empty candidate set;

[0221] Identify potential candidate sharding keys from the target database tables;

[0222] From the potential candidate sharding keys, select the potential candidate sharding keys that meet the predetermined conditions as candidate sharding keys and add them to the empty candidate set to obtain the candidate sharding key selection set.

[0223] The process of identifying potential candidate sharding keys from the target database table includes: identifying columns that serve as both primary keys and unique indexes of the target database table as potential candidate sharding keys; identifying columns that serve as either primary keys or unique indexes as potential candidate sharding keys when no such columns exist in the target database table; and identifying all columns in the target database table as potential candidate sharding keys when no such columns exist in the target database table.

[0224] The predetermined conditions include: the data type in the column is a simple data type; the data in the column is not empty; the data in the column does not have severe skewness; and the data in the column does not contain characters that are not supported by the slicing function.

[0225] In an optional implementation, the candidate sharding key determination unit can also be used to: if the candidate sharding key cannot be selected from the target database, select the sharding key with the most different values (NDV) from the sharding keys of the original shadow table as the candidate sharding key and add it to the empty candidate set to obtain the candidate sharding key selection set.

[0226] Using the apparatus provided in this application, a shadow table can be constructed in a distributed database, using a table field with high query frequency as the sharding key. In this case, the sharding key of the target database requested by subsequent query requests has a high probability of being the same as the associated field of the table to be joined when generating the query result, thereby ensuring the query request execution efficiency to a certain extent in scenarios where queries are performed using the "computation pushdown" method.

[0227] Based on the same inventive concept as the preceding embodiments, this application also provides an electronic device to solve the problem in the prior art of ensuring query request execution efficiency when the sharding key of the distributed database table requested in the query request is different from the association field of the table to be joined when generating the query request result.

[0228] like Figure 13 As shown, the electronic device includes a memory 131 and a processor 132. The memory 131 can be configured to store various other data to support operation on the electronic device. Examples of such data include instructions for any application or method used to operate on the electronic device. The memory 131 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk.

[0229] The processor 132, coupled to the memory 131, is used to execute the program stored in the memory 131 for performing the data processing method described in the embodiments of this application;

[0230] When the processor 132 executes the program in the memory 131, in addition to the functions described above, it can also perform other functions, as detailed in the descriptions of the preceding embodiments.

[0231] Furthermore, such as Figure 13 As shown, the electronic device also includes: a display 134, a communication component 133, a power supply component 135, an audio component 136, and other components. Figure 13 The diagram only shows some components and does not mean that the electronic device includes only these components. Figure 13 Display component.

[0232] Accordingly, embodiments of this application also provide a computer-readable storage medium storing a computer program, which, when executed by a processor of a computing device, can implement the data processing methods provided in the above embodiments.

[0233] Accordingly, this application also provides a computer program product, including a computer program that, when executed by a processor, implements the data processing method and the distributed database data query method provided in the above embodiments.

[0234] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0235] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0236] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.

Claims

1. A data processing method, characterized in that, include: Obtain the query weight of at least one table field of the target database table; The query weight is determined based on the number of queries and the cumulative time consumed by the query operations for each of the at least one table field of the target database table in the distributed database within the target time period. Select a table field from the at least one table field as an available sharding key; wherein the selected table field satisfies: the query weight is greater than a preset weight threshold; or, the set of table fields formed by the selected table fields satisfies: the query weight and cumulative query operation time of all table fields in the set of table fields together satisfy the preset selection condition; If among the determined available sharding keys, there is an available sharding key that is different from the original shadow table sharding key of the target database table, then at least one available sharding key that is different from the original shadow table sharding key is used as the target sharding key to construct a shadow table of the target database table distributed to several data storage nodes of the distributed database, with the target sharding key as the sharding key. The original shadow table is a shadow table of the target database table currently stored by several data storage nodes of the distributed database. If among the determined available sharding keys, there is an available sharding key that is the same as the original shadow table sharding key, then the original shadow table currently stored by the several data storage nodes that has the same sharding key as the available sharding key is retained; If among the identified available sharding keys, there exists an available sharding key that is different from the original shadow table sharding key, then delete the original shadow table currently stored by the plurality of data storage nodes whose sharding key is different from the available sharding key.

2. The method as described in claim 1, characterized in that, Obtain the query weight of at least one table field of the target database table, including: The target time period is determined based on a preset data validity period threshold; the length of the target time period is the same as the data validity period threshold; the target time period is the time period between a historical moment and the current moment. Obtain the number of queries and the cumulative time consumed by each query on at least one table field of the target database table in the distributed database within the target time period; Based on the number of queries and the cumulative time consumed by the query operations, the query weight of each table field in the at least one table field is determined.

3. The method as described in claim 1, characterized in that, The at least one table field is a candidate sharding key from a candidate sharding key selection set determined based on the target database table; then, the method further includes: Create an empty candidate set; Identify potential candidate sharding keys from the target database tables; From the potential candidate sharding keys, select potential candidate sharding keys that meet predetermined conditions as candidate sharding keys and add them to the empty candidate set to obtain the candidate sharding key selection set; From the target database table, identify potential candidate sharding keys, including: identifying columns that serve as both the primary key and unique index of the target database table as potential candidate sharding keys; if no column in the target database table serves as both the primary key and unique index, identifying columns that serve as either the primary key or unique index as potential candidate sharding keys; if no column in the target database table serves as either the primary key or unique index, identifying all columns in the target database table as potential candidate sharding keys. The predetermined conditions include: the data type in the column is a simple data type; the data in the column is not empty; the data in the column does not have severe skewness; and the data in the column does not contain characters that are not supported by the slicing function.

4. The method as described in claim 3, characterized in that, The method further includes: If the candidate sharding key cannot be selected from the target database table, then the sharding key with the highest number of distinct values (NDV) from the original shadow table is selected as the candidate sharding key and added to the empty candidate set to obtain the candidate sharding key selection set.

5. A data processing apparatus, characterized in that, include: The query weight acquisition unit is used to acquire the query weight of at least one table field of the target database table. The query weight is determined based on the number of queries and the cumulative time consumed by the query operations for each of the at least one table field of the target database table in the distributed database within the target time period. A sharding key selection unit is used to select a table field as an available sharding key from the at least one table field; wherein the selected table field satisfies: the query weight is greater than a preset weight threshold; or, the set of table fields formed by the selected table fields satisfies: the query weight and cumulative query operation time of all table fields in the set of table fields together satisfy the preset selection conditions; The shadow table construction unit is used to construct a shadow table of the target database table, which is distributed to several data storage nodes of the distributed database and uses the target shard key as the shard key, if there is an available shard key that is different from the original shadow table shard key of the target database table among the determined available shard keys. The original shadow table is a shadow table of the target database table currently stored by several data storage nodes of the distributed database. The original shadow table processing unit is used for: If among the determined available sharding keys, there is an available sharding key that is the same as the original shadow table sharding key, then the original shadow table currently stored by the several data storage nodes that has the same sharding key as the available sharding key is retained; If among the identified available sharding keys, there exists an available sharding key that is different from the original shadow table sharding key, then delete the original shadow table currently stored by the plurality of data storage nodes whose sharding key is different from the available sharding key.

6. An electronic device, characterized in that, include: Memory and processor, among which, The memory is used to store computer programs; The processor, coupled to the memory, is configured to execute the computer program stored in the memory for performing the data processing method according to any one of claims 1 to 4.

7. A computer-readable storage medium storing a computer program, which, when executed by a computer, enables the implementation of the data processing method according to any one of claims 1 to 4.

8. A computer program product, characterized in that, It includes a computer program that, when executed by a processor, implements the data processing method according to any one of claims 1 to 4.