Distributed transaction processing method, distributed database system and computing node thereof
By utilizing the row lock management and caching mechanism of MySQL storage nodes in a distributed database system and managing distributed transactions based on globally unique timestamps, the problems of memory overflow and data inconsistency in large-scale transaction processing are solved, achieving efficient and simplified distributed transaction processing.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XIAN TONGXING HENGYAO INFORMATION TECHNOLOGY CO LTD
- Filing Date
- 2026-04-28
- Publication Date
- 2026-06-26
AI Technical Summary
Modern distributed database systems are prone to memory overflow due to insufficient memory when handling large-scale transactions. Existing methods require large transactions to be broken down into smaller transactions, which increases the complexity of application development and may lead to data inconsistency.
By leveraging the transaction and computational capabilities of relational data storage nodes, managing distributed transactions through globally unique timestamps, and utilizing the row-lock management and caching mechanisms of MySQL storage nodes, the complexity of computing nodes is reduced, thereby achieving distributed transaction processing.
It effectively avoids memory overflow, simplifies transaction processing, reduces the burden on computing nodes, and ensures data consistency.
Smart Images

Figure CN122285227A_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of computers, and in particular to a distributed transaction processing method, a distributed database system, and computing nodes thereof. Background Technology
[0002] Modern distributed database systems, such as TiDB and Google Spanner, use key-value (KV) storage. These KV stores only provide atomicity support for single-row operations and lack basic capabilities such as two-phase commit. To achieve distributed consistency across multiple rows, additional features such as two-phase commit, primary key, and non-primary key mechanisms are needed, making implementation complex.
[0003] When TiDB processes transactions, it needs to read all transaction-related data from the storage layer and cache it in the computation layer, where the data processing tasks are then completed. This can lead to Out of Memory (OOM) errors if the transaction is large. Therefore, TiDB users typically need to break large transactions down into several smaller transactions. However, this increases application development complexity; furthermore, multiple smaller transactions are semantically inequivalent to a large transaction, requiring the business logic itself to ensure data consistency. Summary of the Invention
[0004] This disclosure proposes to implement distributed transaction processing based on the transaction and computing capabilities of relational data storage nodes, thereby reducing the complexity of computing nodes.
[0005] This disclosure provides a distributed transaction processing method for computing nodes in a distributed database system, comprising: responding to a distributed transaction from a client, obtaining a globally unique start timestamp, wherein the distributed transaction is associated with multiple relational data storage nodes, and each relational data storage node is used to execute a branch transaction of the distributed transaction; starting the distributed transaction by sending the start timestamp as an identifier of the distributed transaction to the multiple relational data storage nodes; in the write phase, based on the start timestamp, instructing any relational data storage node involved in the write operation to apply a row lock in the row lock management table and write the data to be written to its local data storage table; in the pre-commit phase, obtaining a first commit timestamp, instructing all relational data storage nodes involved in the write operation to update their respective local data storage tables based on the first commit timestamp, and pre-committing the distributed transaction; and in the commit phase, instructing all relational data storage nodes involved in the write operation to commit the distributed transaction so that all write operations are globally effective.
[0006] In some embodiments, instructing any relational data storage node involved in a write operation to apply a row lock in a row lock management table includes: instructing the relational data storage node involved in the write operation to perform: determining whether a user table primary key involved in the write operation exists in the row lock management table; if it does not exist, inserting the user table primary key into the row lock management table and setting the transaction identifier locking the user table primary key to the start timestamp of the current branch transaction; if it exists, updating the transaction identifier locking the user table primary key.
[0007] In some embodiments, updating the transaction identifier locking the user table primary key includes: if the transaction identifier locking the user table primary key is less than the start timestamp of the current branch transaction, waiting for the row lock on the user table primary key to be released, and after release, updating the transaction identifier locking the user table primary key to the start timestamp of the current branch transaction; or, if the transaction identifier locking the user table primary key is greater than the start timestamp of the current branch transaction, obtaining the update timestamp, using the update timestamp to re-acquire the row lock on the user table primary key, and after successful acquisition, updating the transaction identifier locking the user table primary key to the update timestamp.
[0008] In some embodiments, the data storage table includes a persistent data storage table or a temporary data storage table. Instructing the data to be written to the local data storage table includes: instructing the data to be written to the local persistent data storage table, wherein the written but uncommitted data is maintained through the caching mechanism and transaction isolation mechanism of the relational data storage node involved in the write operation, so that the written but uncommitted data is only visible to the current branch transaction; or, instructing the data to be written to the local temporary data storage table, wherein the temporary data storage table has the same structure as the persistent data storage table and is only visible to the current branch transaction.
[0009] In some embodiments, instructing all relational data storage nodes involved in write operations to update their respective local data storage tables based on the first commit timestamp includes: for the case where the data to be written is written to a local persistent data storage table, instructing all relational data storage nodes involved in write operations to update the commit timestamp parameter in their respective local persistent data storage tables to a second commit timestamp; or, for the case where the data to be written is written to a local temporary data storage table, instructing all relational data storage nodes involved in write operations to synchronize the data in their respective local temporary data storage tables to their respective local persistent data storage tables, and update the commit timestamp parameter in their respective local persistent data storage tables to a second commit timestamp; wherein the second commit timestamp is obtained by left-shifting the first commit timestamp by one bit and performing a bitwise OR operation with a deletion flag, the deletion flag using odd and even numbers to represent deletion operations and non-deletion operations respectively, the non-deletion operations including insertion operations and update operations.
[0010] In some embodiments, the method further includes: during the read phase, based on the start timestamp, instructing the relational data storage node involved in the read operation to perform: in response to the read operation and the detection of a row lock on the data to be read, if the transaction ID holding the row lock is less than the start timestamp of the current branch transaction, if the transaction holding the row lock is not currently executing a locking statement, wait for the transaction holding the row lock to commit, and then read the data to be read from the historical snapshot corresponding to the start timestamp; if the transaction holding the row lock is currently executing a locking statement, directly read the data to be read from the historical snapshot corresponding to the start timestamp.
[0011] In some embodiments, the method further includes: during the read phase, based on the start timestamp, instructing the relational data storage node involved in the read operation to perform: detecting that the read operation is related to data already written in the current branch transaction, querying the local data storage table and row lock management table in the main connection to obtain the incremental data of the current branch transaction to the local data storage table, caching the incremental data to a temporary result table, obtaining the incremental data through the temporary result table in the read connection, merging the historical snapshot corresponding to the start timestamp and the incremental data, and reading the data to be read from the merged data.
[0012] In some embodiments, the method further includes: constructing a global row lock waiting relationship based on row lock holding and waiting information of all relational data storage nodes; detecting whether a deadlock exists based on whether a row lock waiting loop exists in the global row lock waiting relationship; and terminating at least one transaction in the row lock waiting loop in response to detecting a deadlock.
[0013] In some embodiments, the data storage table includes: a primary key field, a secondary primary key field, a value field, and a transaction identifier field for modifying the primary key. The primary key field includes the user table primary key, the secondary primary key field includes various second commit timestamps arranged in descending order within the user table primary key, and the value field includes the corresponding value for each second commit timestamp. The second commit timestamp is obtained by left-shifting the first commit timestamp by one bit and performing a bitwise OR operation with a deletion flag. The deletion flag uses odd and even numbers to represent deletion and non-deletion operations, respectively. The non-deletion operations include insertion and update operations.
[0014] In some embodiments, the relational data storage node automatically transfers cached data to disk when it detects insufficient memory.
[0015] This disclosure provides embodiments of a computing node for a distributed database system, comprising one or more modules that execute a distributed transaction processing method.
[0016] Some embodiments of this disclosure provide a computing node for a distributed database system, including: a memory; and a processor coupled to the memory, the processor being configured to execute a distributed transaction processing method based on instructions stored in the memory.
[0017] This disclosure provides some embodiments of a distributed database system, including: at least one computing node and multiple relational data storage nodes, wherein the computing node is configured to execute a distributed transaction processing method.
[0018] Some embodiments of this disclosure propose a computer-readable storage medium having computer instructions stored thereon that, when executed by a processor, implement a distributed transaction processing method.
[0019] Some embodiments of this disclosure provide a computer program product including computer instructions that, when executed by a processor, implement a distributed transaction processing method. Attached Figure Description
[0020] The accompanying drawings used in the description of the embodiments or related technologies will be briefly introduced below. This disclosure can be more clearly understood from the following detailed description with reference to the accompanying drawings.
[0021] Obviously, the accompanying drawings described below are merely some embodiments of this disclosure. Those skilled in the art can obtain other drawings based on these drawings without any creative effort.
[0022] Figure 1The diagram illustrates a distributed transactional database system (hereinafter referred to as a distributed database system) with storage and computation separation according to some embodiments of this disclosure.
[0023] Figure 2 A schematic diagram of a MySQL storage node is shown, representing some embodiments of this disclosure.
[0024] Figure 3 A schematic diagram of a multi-connection parallel architecture is shown for some embodiments of this disclosure.
[0025] Figure 4 This diagram illustrates multi-connection concurrent processing on a single relational data storage node, representing some embodiments of the present disclosure.
[0026] Figure 5 A schematic diagram illustrating a distributed transaction processing method according to some embodiments of this disclosure is shown.
[0027] Figure 6 A schematic diagram of a computing node in a distributed database system according to some embodiments of the present disclosure is shown.
[0028] Figure 7 A schematic diagram of a computing node in a distributed database system according to some embodiments of the present disclosure is shown. Detailed Implementation
[0029] It should be noted that, unless otherwise specifically stated, the relative arrangement, numerical expressions, and values of the components and steps set forth in these embodiments do not limit the scope of this disclosure.
[0030] Those skilled in the art will understand that the terms "first," "second," etc., in the embodiments of this disclosure are only used to distinguish different steps, devices, or modules, and do not represent any specific technical meaning, nor do they indicate a necessary logical order between them.
[0031] It should also be understood that in the embodiments disclosed herein, "multiple" can refer to two or more, and "at least one" can refer to one, two or more.
[0032] It should also be understood that any component, data or structure mentioned in the embodiments of this disclosure can generally be understood as one or more unless expressly defined or given to the contrary in the context.
[0033] Furthermore, the term "and / or" in this disclosure is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, or B existing alone. Additionally, the character " / " in this disclosure generally indicates that the preceding and following related objects have an "or" relationship.
[0034] It should also be understood that the description of the various embodiments in this disclosure emphasizes the differences between the various embodiments, and the similarities or similarities can be referred to each other. For the sake of brevity, they will not be described in detail.
[0035] At the same time, it should be understood that, for ease of description, the dimensions of the various parts shown in the accompanying drawings are not drawn according to actual scale.
[0036] The following description of at least one exemplary embodiment is merely illustrative and is in no way intended to limit this disclosure or its application or use.
[0037] Techniques, methods, and equipment known to those skilled in the art may not be discussed in detail, but where appropriate, such techniques, methods, and equipment should be considered part of the specification.
[0038] It should be noted that similar labels and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be discussed further in subsequent figures.
[0039] Furthermore, in order to avoid obscuring this disclosure with unnecessary details, only processing steps and / or device structures closely related to the scheme at least according to this disclosure are shown in the accompanying drawings, while other details that are not closely related to this disclosure are omitted.
[0040] Figure 1 The diagram illustrates a distributed transactional database system (hereinafter referred to as a distributed database system) with storage and computation separation according to some embodiments of this disclosure.
[0041] like Figure 1 As shown, the distributed database system in this embodiment is divided into a "computing cluster" and a "storage cluster." The "computing cluster," also known as the computing layer, is responsible for managing data partitioning, that is, horizontally splitting the data table into several shards, which are stored on different relational data storage nodes (or relational data storage instances); it is also responsible for driving distributed transactions, that is, ensuring transaction consistency among multiple autonomous relational data storage nodes. The "storage cluster," also known as the storage layer, includes multiple autonomous relational data storage nodes, each storing a portion of the data table (a shard), and can be configured with a master node and standby nodes. Relational data storage nodes / instances include, but are not limited to, MySQL storage nodes / instances and other relational database nodes / instances. MySQL is a mature open-source relational database system, and the storage cluster can be deployed entirely based on the open-source version of MySQL and managed using publicly available MySQL ecosystem tools, requiring almost no development work.
[0042] The "computing cluster" consists of several computing nodes, each deploying a distributed transaction management module and a two-phase commit management module. The distributed transaction management module, implemented through a distributed transaction coordinator, manages the connections between computing nodes and relational data storage nodes. Based on SQL (Structured Query Language) transactions initiated by clients, it generates execution steps that can be performed on the relational data storage nodes and drives distributed task execution, thus achieving distributed transactions. The two-phase commit (2PC) management module, based on distributed protocols supported by the relational data storage nodes, such as the XA (eXtended Architecture) protocol supported by MySQL, manages two-phase transactions. The management functions of two-phase transactions include: generating and maintaining globally unique XA transaction IDs to ensure the uniqueness of transaction identifiers across multiple relational data storage nodes; performing lifecycle management for XA transactions in abnormal states, including but not limited to automatic transaction recovery after an abnormal restart of the relational data storage node; executing subsequent commit operations on XA transactions that have completed the PREPARE state; and cleaning up invalid XA transactions that have not completed the PREPARE state.
[0043] The following description uses MySQL as an example to describe relational data storage nodes. Compared to TiDB's TiKV storage engine, this disclosure uses MySQL (requires mysql-8.x or later) as the storage layer. TiKV is a key-value store based on RocksDB, supporting only single-row data access and consistency, without additional features. MySQL, on the other hand, supports complete transaction features (such as multi-row transactions and isolation levels) and the basic capabilities of two-phase transactions (XA protocol). Furthermore, MySQL supports various storage engines and possesses full SQL computation capabilities. Therefore, transaction processing tasks and temporary data caching can be completely pushed down to the MySQL storage node, unlike TiDB which reads and caches all transaction-related data at the computation layer and performs all data computation tasks there.
[0044] Figure 2 A schematic diagram of a MySQL storage node is shown, illustrating some embodiments of this disclosure. For example... Figure 2 As shown, MySQL storage nodes include: non-transactional engine, transactional engine, temporary table engine, performance monitoring engine, and SQL engine.
[0045] Non-transactional engine: MySQL's non-transactional engine is called "MyISAM". This engine does not support transactions. When writing indexes, data is first written to the index cache (i.e., the key cache), and when writing data files, it is first written to the operating system's file cache (i.e., the page cache). Only when the key cache and page cache are too large and about to overflow are cached data transferred to disk. Therefore, MyISAM tables mostly only write to memory. The temporary result table TAB_TEMP, mentioned later, can be managed by the non-transactional engine.
[0046] Transaction Engine: MySQL's transaction engine is called "InnoDB." This engine supports MVCC (Multi-Version Concurrency Control) based on pessimistic locking, implementing multiple transaction isolation levels including "Read Committed," "Repeatable Read," and "Serializable." "Read Committed" means a transaction can only read data that has been committed by other transactions. "Repeatable Read" means that multiple reads of the same data during a transaction's execution will yield consistent results. "Serializable" means transactions are executed in a queue. Furthermore, InnoDB implements a complete cache management mechanism; its data cache is called the "Buffer Pool," which uses the LRU (Least Recently Used) mechanism to ensure that the most recently hit data is cached as much as possible. The row lock management table TAB_LOCK and the persistent data storage table TAB_DEFAULT, mentioned later, can be managed by the transaction engine.
[0047] Temporary table engine: Used by MySQL to create temporary tables. This engine stores data in an in-memory table by default, and then transfers the data to an InnoDB table when the data volume becomes too large. Temporary tables are only visible to the connection that created them, and are automatically released once the connection is closed. The temporary data storage table TAB_WRITE, mentioned later, can be managed by the temporary table engine.
[0048] The Performance Schema (PFS) engine is used for runtime performance monitoring of MySQL. It provides a series of in-memory tables, allowing users to externally observe the running status of the current database instance. Some exemplary PFS tables include: Thread Connections (threads): Displays runtime connection management information, including the IDs of all active connections and the SQL commands being executed by different connections; Row Lock Holding (data_locks): Displays information about all held row locks, including the locked table, lock ID, holding connection ID, and lock type; Row Lock Waits (data_lock_waits): Displays all lock wait information, including the waiting table, lock ID, holding connection ID, and waiting connection ID.
[0049] SQL Engine: Supports SQL operations, XA protocol, connection management, etc. MySQL is fully compliant with the SQL:2011 standard (ISO / IEC 9075:2011) and partially compatible with new features in SQL:2016 and SQL:2019, providing rich and well-proven SQL operation capabilities. MySQL implements a basic protocol for two-phase transactions (called "XA transactions"), providing functions including two-phase transaction start (XA START), end (XA END), pre-commit (XA PREPARE), commit (XA COMMIT), and recovery.
[0050] MySQL uses temporary tables, MyISAM tables, and InnoDB's buffer pool (i.e., InnoDB table data page cache) as data caches. When the result set of a query needs to be temporarily cached in a data table, it can be stored in a temporary table (data visible only to the current connection) or a MyISAM table (data visible across connections). For cases where the query result set does not need to be cached in a data table, InnoDB's buffer pool can be used for data caching. MySQL's temporary tables cache data in memory by default, and automatically convert it to a disk table (i.e., an InnoDB table) when the data volume is too large. MyISAM tables can use a delayed disk write strategy, and will not actively flush to disk when memory is sufficient. In addition, the buffer pool also has adaptive memory management capabilities. Data is cached in memory pages after data query and data write, and when memory is insufficient, cached pages are automatically moved to disk table space based on the LRU mechanism.
[0051] Leveraging MySQL's comprehensive SQL computing capabilities and caching management mechanism, on the one hand, all computations and temporary result caching can be pushed down to MySQL, reducing network transmission between the storage and computing layers; on the other hand, based on the temporary table engine, the delayed disk write mechanism of the MyISAM engine, and the adaptive memory management mechanism of the InnoDB cache pool, memory overflow can be avoided while caching data as much as possible, thus supporting ultra-large-scale distributed transactions.
[0052] Driving a distributed transaction requires coordinating all relational data storage nodes associated with the transaction. Therefore, multiple connections need to be initiated from the computing node managing the distributed transaction, simultaneously connecting to all relational data storage nodes, such as... Figure 3 As shown. During distributed transaction execution, each relational data storage node can execute its own branch transaction in parallel based on its own connection. For a distributed transaction, each relational data storage node participating in the distributed transaction execution needs a main connection to drive the branch transaction (also called the main transaction or two-phase transaction, such as the XA transaction in MySQL). On the same relational data storage node, several read connections can also be started to handle the read operations of the current branch transaction in parallel. Read operations are also read-only operations, and read connections are also read-only connections. Multiple connections on the same relational data storage node form a "connection group". A "connection group" includes at least one main connection driving the main transaction, and several read connections assisting in the execution of read tasks.
[0053] refer to Figure 4 For scenarios requiring join queries based on write data visible only to the primary connection, a temporary result table (TAB_TEMP) can be used to pass write data visible only to the primary connection for collaborative reading with the read connection.
[0054] Unlike TiDB's distributed transaction processing method, since relational data storage nodes (such as MySQL) are significantly more powerful than TiKV, most of the transaction management, SQL operations, and data caching can be completed on the relational data storage node side. The computing node responsible for distributed transaction management only needs to send instructions to the relational data storage node to drive transaction execution.
[0055] In distributed systems, globally unique and strictly monotonically increasing timestamps can be generated through global timestamp services such as TSO (Timestamp Oracle) or PD (Placement Driver, or cluster manager) timestamp generators. In the embodiments of this disclosure, the start timestamp, update timestamp, and commit timestamp of distributed transactions are all uniformly allocated by this global timestamp service to ensure the temporal consistency of distributed transactions.
[0056] The following section uses MySQL as an example to describe the basic storage structures and storage functions involved in relational data storage nodes.
[0057] The computation layer uses an in-memory table engine to create an association table (tid_ts_map) that stores the relationship between active thread connection identifiers (primary keys) and distributed transaction identifiers. It may also include the transaction start time and an index of the distributed transaction identifier. The distributed transaction identifier could be, for example, a globally unique start timestamp obtained when the distributed transaction begins. This association table is created on each MySQL storage node. Because it's an in-memory table, the data in the in-memory table is cleared every time the MySQL storage node is restarted. Therefore, after the primary connection starts a distributed transaction, the latest association between the current active thread connections and the current distributed transaction needs to be updated in this table.
[0058] Fields in the associated table tid_ts_map include, for example:
[0059] When the computation layer creates a user table, it creates two tables on the MySQL storage node: a row lock management table "TAB_LOCK" and a data persistence storage table "TAB_DEFAULT". Because MySQL has built-in transaction capabilities, it does not need to introduce an additional CF_WRITE table to store the index of committed data in the CF_DEFAULT table, as TiDB does.
[0060] The user table includes at least: the primary key (represented by PKEY) and the non-primary key (represented by VALUES).
[0061] The TAB_LOCK table includes: a primary key field, a deletion flag field, and a transaction identifier field for locking the primary key. It may also include an index field for the transaction identifier of the locked primary key to optimize search efficiency. The primary key field includes the user table primary key; the deletion flag field is used to indicate whether the transaction locking the primary key performed a deletion operation.
[0062] The fields in the TAB_LOCK table include, for example:
[0063] The TAB_DEFAULT table includes: a primary key field, a secondary primary key field, a value field, and a transaction identifier field for modifying the primary key. The primary key field includes the user table primary key, the secondary primary key field includes the commit timestamp, and the commit timestamps are arranged in descending order in the user table primary key. The value field includes the corresponding value for each commit timestamp.
[0064] Fields in the TAB_DEFAULT table include, for example:
[0065] In other words, the TAB_DEFAULT table stores data in an append-only manner, meaning that all versions of the same primary key modified by different transactions are stored in the TAB_DEFAULT table. Adding the latest commit timestamp to the primary key field ensures that a new version of the corresponding primary key is added with each commit. All write operations are converted to append operations; for example, update / insert / delete write operations are appended with data using the current commit timestamp as the secondary primary key. Subsequent transactions (i.e., transactions whose start timestamp is greater than the commit timestamp) can then see the latest committed data version. The commit timestamps in the secondary primary key field are arranged in descending order because, for a distributed transaction identified by its start timestamp, for a group of data in the TAB_DEFAULT table with the same primary key but different commit timestamps, the visible data version is the one with the largest commit timestamp value among the data whose commit timestamp is less than or equal to the start timestamp of the distributed transaction. Arranging the commit timestamps in descending order allows subsequent data versions to be ignored when searching for the visible data version corresponding to the largest commit timestamp, thus improving search performance.
[0066] As mentioned earlier, delete operations within a write operation are also converted to append operations. Some embodiments of this disclosure do not introduce an additional delete flag field in the TAB_DEFAULT table to mark delete operations; instead, they employ an improved design for the commit timestamp.
[0067] For ease of distinction, the original commit timestamp obtained from the global timestamp service is called the first commit timestamp, and the improved commit timestamp is called the second commit timestamp. The second commit timestamp is obtained by left-shifting the first commit timestamp by one bit and performing a bitwise OR operation with a deletion flag. The deletion flag uses odd numbers (e.g., 1) and even numbers (e.g., 0) to represent deletion and non-deletion operations, respectively. Non-deletion operations include insertion and update operations. It is evident that the second commit timestamp does not change the monotonically increasing characteristic of the first commit timestamp. Compared to the first commit timestamp, the second commit timestamp not only indicates the data commit time but also whether a deletion operation occurred.
[0068] The following is a comparative description of the commit timestamp parameters in the data storage table before and after the commit timestamp improvement. Understandably, this disclosure also describes both implementation methods. Before the commit timestamp improvement, the data storage table could store a first commit timestamp. That is, the data storage table included: a primary key field, a secondary primary key field, a value field, and a transaction identifier field for modifying the primary key. The primary key field included the user table primary key, the secondary primary key field included the first commit timestamp, and the first commit timestamps were arranged in descending order within the user table primary key. The value field included the corresponding value for each first commit timestamp. Additionally, the data storage table could include a deletion flag to indicate whether a deletion operation was performed. After the commit timestamp improvement, the data storage table can store a second commit timestamp. That is, the data storage table includes: a primary key field, a secondary primary key field, a value field, and a transaction identifier field for modifying the primary key. The primary key field included the user table primary key, the secondary primary key field included the second commit timestamp, and the second commit timestamps were arranged in descending order within the user table primary key. The value field included the corresponding value for each second commit timestamp. The data storage table does not need to include a deletion flag.
[0069] The improved formula for commit timestamps can be expressed as: __commit_ts=(commit_ts << 1) |delete_bit, where commit_ts represents the first commit timestamp, __commit_ts represents the second commit timestamp, << 1 represents left shift by one bit, | represents "bitwise OR" operation, and delete_bit represents the least significant bit deletion flag.
[0070] Suppose the TAB_DEFAULT table records the following data:
[0071] Here, key1 + key2 represents the primary key of the TAB_DEFAULT table, __commit_ts represents the second commit timestamp, __start_ts represents the transaction identifier that modifies the primary key key1 + key2, i.e., the start timestamp of the transaction start_ts, the value val1 for key1, and the value val2 for key2. The TAB_DEFAULT table records multiple data versions of the primary key (0, 1).
[0072] For example, if the start_ts of the current transaction is 10, then the data visible to the current transaction is the latest version of __commit_ts≤(10 << 1) | 1 (that is, __commit_ts≤21), which is version __commit_ts = 16. Since 16 is an even number, the current version is not a deletion operation.
[0073] For example, if the current transaction's start_ts = 18, then the data visible to this current transaction is the latest version of __commit_ts ≤ (18 << 1) | 1 (that is, __commit_ts ≤ 37), i.e., version __commit_ts = 37. Since 37 is an odd number, and the current version represents a delete operation, the current transaction cannot read the data for primary key (0, 1).
[0074] In other words, if the TAB_DEFAULT table stores the second commit timestamp, when determining the data visible to the current transaction, the current transaction identifier (i.e., the start timestamp of the current transaction) needs to be shifted left by one bit and bitwise ORed with an odd number "1", and then compared with the second commit timestamp to determine that the data version visible to the current transaction is the latest version when __commit_ts≤(start_ts << 1) |1.
[0075] Understandably, if the TAB_DEFAULT table stores the first commit timestamp, when determining the data visible to the current transaction, the data marked with a deletion flag indicating a deletion operation is removed, and the current transaction identifier (i.e., the start timestamp of the current transaction) is compared with the first commit timestamp. From the remaining data, the version of the data visible to the current transaction is determined to be the latest version when commit_ts ≤ start_ts.
[0076] The SQL statement queries the data visible to the current transaction from the TAB_DEFAULT table, including: (1) determining the range of visible data, i.e. retaining data whose commit timestamp is less than or equal to the upper limit of the visible time of the current transaction and meets the filtering conditions (if any), and new version data exceeding the upper limit of the time is not visible to the current transaction, where the upper limit of the visible time of the current transaction = (N << 1) | 1, and N is the start timestamp of the current transaction; (2) grouping by primary key and sorting by version, i.e. grouping all data that meet the time conditions by primary key, and sorting the multiple historical versions corresponding to the same primary key from newest to oldest by commit timestamp; (3) selecting the latest valid version of each primary key, i.e. retaining only the latest version after sorting for each primary key, and no longer participating in subsequent queries; (4) filtering deleted data, i.e. excluding data marked as deleted, and only retaining valid and undeleted data; (5) outputting the latest valid data visible to the current transaction for each primary key.
[0077] If using MySQL, you can use CTE (Common Table Expression) and the window function ROW_NUMBER() OVER(PARTITION BY… ORDER BY…) to retrieve data visible in the current transaction from the TAB_DEFAULT table. If using other databases, you will need to support syntax that is semantically equivalent to the above. Taking MySQL as an example, assuming the start timestamp of the current transaction is `start_ts = N`, the implementation of querying the visible data of the current transaction from `TAB_DEFAULT` is as follows: First, a temporary result set named `ranked_data` (customizable) is created using a CTE expression. Then, all data in the `TAB_DEFAULT` table that meets the specified conditions are queried. Simultaneously, a window function is used to generate a row number for each row, named `rn` (customizable). The generation rule is to group by the `PKEY` field, and within each group, sort by the value of the `__commit_ts` field from largest to smallest, with row numbers starting from 1 and incrementing sequentially. During the query, only data rows whose `__commit_ts` is less than or equal to the visible timestamp value calculated using the formula `(N << 1) | 1` are selected. If there are filtering conditions, these conditions must also be met. After defining the temporary result set, only `PKEY`, `__commit_ts`, and `VALUES` are queried and displayed from this temporary result set. These three fields do not display the row number field rn, and only filter data where the row number rn is equal to 1, representing the latest record in each group and visible to the current transaction. At the same time, invalid data with deletion marks is filtered out by performing a bitwise AND operation between __commit_ts and 1, where the result is equal to 0.
[0078] Let TAB_VISIBLE(tbname, start_ts, filters) represent the data in the tbname_DEFUALT table that is visible at transaction timestamp start_ts, and filters represent the filtering conditions for this table. For example, TAB_VISIBLE(t, N, val2 > 1) represents the data in the t_DEFUALT table that is visible at transaction timestamp N, and the filtering condition for this table is the value val2 > 1. The filtering condition, such as "val2 > 1", is pushed down to the inner subquery, ensuring that only data matching the filtering condition is grouped by the primary key and sorted by version, thereby optimizing query performance. TiDB requires computational processing such as filtering conditions to be performed by compute nodes. Compute nodes read data from storage nodes and then perform filtering and other computational processing. If the transaction size is large, the compute nodes may experience memory overflow. This disclosure allows computational processing such as filtering conditions to be performed by relational data storage nodes, reducing the burden and complexity of compute nodes.
[0079] In a distributed transaction, when caching data written to a table, this disclosure provides two solutions: one is to directly write to the persistent data storage table TAB_DEFAULT; the other is to write to the temporary data storage table TAB_WRITE, whose table structure is the same as TAB_DEFAULT and is only visible to the current transaction. TAB_WRITE is created based on a temporary table engine, is not visible to other connections, and is automatically released when the current connection ends. Relational databases, such as MySQL, have built-in data caching and transaction isolation capabilities. Data that has been written but not committed is cached in a cache pool and is not visible to other transactions, so directly writing to TAB_DEFAULT is feasible. However, if the TAB_DEFAULT itself contains a large amount of data, the performance of subsequent queries of incremental data will be affected by the existing data. Furthermore, directly writing to TAB_DEFAULT may also affect the parallelism of other transactions. In this case, the TAB_WRITE write solution can be used.
[0080] On each relational data storage node, such as a MySQL storage node, create two stored functions for distributed transaction management, one for maintaining the distributed transaction context and the other for handling transaction exceptions.
[0081] The `update_start_ts` stored function sets the start timestamp (distributed transaction ID) of the distributed transaction for the current database connection, obtains the unique thread ID of the current database connection as an identifier to distinguish different connections, and binds and maintains the database connection thread ID with the distributed transaction ID. This binding relationship can be written to a memory-mapped table to ensure that subsequent data read and write operations can correctly identify the current transaction context.
[0082] Taking MySQL as an example, first, the function `update_start_ts` is declared, accepting an unsigned large integer parameter `start_ts`. Then, the function is specified to return an unsigned large integer and marked as a deterministic function, meaning the same input will produce the same output. Next, the function body logic is executed. First, an unsigned large integer variable `tid` is declared, with its default value set to the thread ID of the current MySQL connection. Then, the local variable `@start_ts` of the current connection is assigned the passed parameter `start_ts`. Afterward, an insert operation is performed, inserting the current thread ID, the local variable `@start_ts` of the current connection, and the current system time into the `tid_ts_map` table. If the same thread ID already exists in the table (duplicate primary key), it is not added again; instead, the corresponding `start_ts` field is updated to the latest value, and the `start_time` field is updated to the current time. Finally, the function returns the thread ID of the current connection, and the function body ends.
[0083] The `ts_stale_error` storage function throws a standard error message when a distributed transaction encounters a timestamp expiration or a write conflict, ensuring that the distributed transaction mechanism triggers the exception as expected.
[0084] Taking MySQL as an example, first declare and create the function `ts_stale_error`; then specify that the function will return an unsigned large integer value after execution, and mark the function as a deterministic function. Next, enter the function body logic, perform the exception throwing operation, use `SQLSTATE '45000'` to define a custom exception state, and set the exception message to "Error: write conflict or transaction start timestamp expired"; finally, end the function body definition.
[0085] Based on the above data storage structure, the storage layer data is managed, and the capabilities of the relational data storage nodes are customized to a certain extent using the above storage functions, thereby realizing distributed transaction processing.
[0086] Figure 5 The diagram illustrates a distributed transaction processing method according to some embodiments of this disclosure. For example... Figure 5 As shown, distributed transaction processing involves the startup phase, read phase, write phase, and two-phase commit phase (pre-commit phase and commit phase) of a distributed transaction. Understandably, depending on the needs of the transaction, a specific distributed transaction may involve some or all of these phases.
[0087] Step 510: The compute nodes of the distributed database system (hereinafter referred to as compute nodes) respond to the client's distributed transaction and obtain a globally unique start timestamp. The distributed transaction is associated with multiple relational data storage nodes, and each relational data storage node corresponds to executing a branch transaction of the distributed transaction.
[0088] Taking a MySQL-type storage node as an example, the compute node responds to the client's distributed transaction by obtaining a globally unique start timestamp from the global timestamp service. The distributed transaction is associated with multiple MySQL storage nodes, and each MySQL storage node corresponds to executing a branch of the distributed transaction.
[0089] Step 520: During the startup phase, the compute node initiates a distributed transaction and sends the start timestamp as the identifier of the distributed transaction to the associated relational data storage nodes.
[0090] The start timestamp obtained from the global timestamp service is globally unique and can be used as an identifier (ID) for distributed transactions.
[0091] Compute nodes can initiate distributed transactions using SQL statements, sending the start timestamp as the identifier for the distributed transaction to multiple associated relational data storage nodes. Taking a MySQL storage node as an example, during the startup phase, the compute node initiates an XA transaction via the main transaction, sending the start timestamp as the identifier for the distributed transaction to multiple associated MySQL storage nodes. Specifically, the compute node uses the XA START operation of the MySQL XA protocol to start the XA transaction; it then uses a SELECT statement to call update_start_ts to update the identifier of the distributed transaction for this connection to the start timestamp.
[0092] The startup process is one of the differences between this embodiment and TiDB. Since TiKV does not have transaction and computation capabilities like MySQL storage nodes, transaction processing and computation need to be performed by TiDB's compute nodes. Therefore, TiDB does not need to send a start timestamp to TiKV.
[0093] The default transaction isolation level for MySQL storage nodes is Read Committed. This means that once a transaction within a MySQL storage node is committed, it becomes visible to other transactions. Based on the MVCC (Multi-Version Concurrency Control) mechanism, isolation between distributed transactions is guaranteed. MVCC is a concurrency control mechanism that eliminates read-write conflicts through multiple version snapshots. It retains several historical versions of data, allowing read transactions to backtrack to a specific historical version using the storage engine's internal timestamp or transaction ID, without waiting for write locks to be released.
[0094] Step 530, during the read phase of the compute node, data is read from the relational data storage node based on the start timestamp.
[0095] When a user executes a read-only query within a distributed transaction, the read-only query can be performed through a read join. Regarding visibility, read queries need to address the "non-repeatable read" problem, requiring waiting for an early transaction commit, and cannot be simply resolved using TAB_VISIBLE().
[0096] To better illustrate the "non-repeatable read" problem, this section describes it as if neither the start timestamp nor the commit timestamp is shifted. However, it's understandable that since shifting doesn't change the monotonicity of timestamps, the "non-repeatable read" problem still exists even when both start and commit timestamps are shifted. The scenario where the "non-repeatable read" problem arises is: the transaction ID (start timestamp) of the currently performing read operation is 100; simultaneously, the data that the current transaction needs to read is locked by another transaction with a transaction ID (start timestamp) of 80, which is less than the start timestamp of the current transaction; and the latest commit timestamp in TAB_DEFAULT is 50. In this case, to avoid the "non-repeatable read" problem, it's necessary to wait for the transaction with a start timestamp of 80 to commit (i.e., generate a data version with a commit timestamp > 80, which may be greater than or less than 100) before determining whether the data committed by the transaction with timestamp 80 is visible to the current transaction. Otherwise, if a transaction with timestamp 80 has already acquired a commit timestamp of 90, and if the current transaction does not wait and directly reads data with commit timestamp = 50, then when the same transaction subsequently initiates the same read operation again, it may read data with commit timestamp = 90 (because the transaction with timestamp 80 has already committed). This would cause the "non-repeatable read" problem.
[0097] Therefore, a read operation first needs to determine whether the data it needs to read is locked by other transactions, that is, whether there is a "read-write conflict". Usually, in a read-heavy, write-light system, such conflicts do not occur frequently, so it is acceptable for the execution of read operations to be delayed in this scenario.
[0098] First, check if the table involved in the current read operation is being locked by other transactions. Only if the table is being locked by other transactions can a write operation from another transaction potentially conflict with the read operation of this transaction. This is because if other transactions do not yet hold row locks on the table, their subsequent locks and writes to the table will have a commit timestamp (which is only acquired during the commit phase) that is greater than the start timestamp of the current transaction, making them invisible to the current transaction.
[0099] Taking MySQL as an example, by using the mapping between the THREAD_ID field in the data_locks table of performance_schema, the THREAD_ID field and the start_ts field in the tid_ts_map table, and the mapping between the THREAD_ID field and the PROCESSLIST_ID field in the threads table of performance_schema, the start_ts (start timestamp) of the connection holding the row lock can be obtained. Here, THREAD_ID is the row number internally assigned to the thread, and PROCESSLIST_ID is the actual MySQL connection ID.
[0100] In some embodiments, during the read phase, the compute node, based on the start timestamp, instructs the relational data storage node involved in the read operation to execute the following: In response to the read operation and the detection of a row lock on the data to be read, if the transaction ID holding the row lock is less than the start timestamp of the current branch transaction, it is necessary to determine whether to wait for the transaction holding the row lock (hereinafter referred to as the lock-holding transaction) to commit; if the transaction holding the row lock is not currently executing a locking statement, it cannot be determined whether the lock-holding transaction has already acquired a commit timestamp, and it is necessary to wait for the transaction holding the row lock to commit before reading the data to be read from the historical snapshot corresponding to the start timestamp; if the transaction holding the row lock is currently executing a locking statement, the lock-holding transaction has definitely not reached the commit phase, and the commit timestamp acquired by the lock-holding transaction cannot be less than the start timestamp of the current transaction, so there is no need to wait, and the data to be read can be directly read from the historical snapshot corresponding to the start timestamp. Specifically, by querying the `threads` table of `performance_schema` and viewing the "PROCESSLIST_INFO" field in the result set (the SQL statement currently being executed by the database connection), the SQL statement currently being executed by the lock-holding connection can be determined.
[0101] The compute node can form the read logic executed by the relational data storage node into an SQL statement, and instruct the relational data storage node to execute the corresponding read logic through the SQL statement.
[0102] While waiting for a locked transaction to commit, it's necessary to continuously check if the locked transaction has become stuck (becoming a "zombie transaction"). A simple and feasible implementation is to preset an upper limit on the transaction execution time, that is, to determine if the locked transaction has become stuck by checking if the time since `start_time` in the `tid_ts_map` table exceeds the preset upper limit. This disclosure does not restrict the use of other transaction deadlock detection mechanisms. If a locked transaction is found to be stuck, it needs to be actively cleaned up so that other transactions that may conflict with it can continue to execute.
[0103] Once the read-write conflict is resolved (i.e., all transactions holding locks whose commit timestamps may be greater than the start timestamp of the current transaction have been committed or cleaned up), you can query using TAB_VISIBLE().
[0104] If there are multiple unrelated read operations in the current transaction (e.g., data needs to be read from multiple tables separately), then these read tasks (including the process of waiting for "read-write conflicts" to be resolved) can be executed in parallel across multiple read connections.
[0105] In some embodiments, during the read phase, the compute node, based on the start timestamp, instructs the relational data storage node involved in the read operation to perform the following: If the read operation is detected to be related to data already written in the current branch transaction, the local data storage table and row lock management table are queried in the main connection to obtain the incremental data of the current branch transaction on the local data storage table. The incremental data is cached in a temporary result table. In the read connection, the incremental data is retrieved through the temporary result table. The historical snapshot and incremental data corresponding to the start timestamp are merged, and the data to be read is read from the merged data. Here, the data storage table can be a persistent data storage table or a temporary data storage table.
[0106] For example, if the current transaction has already written to table t1, and a read-only operation needs to read the result of joining table t1 with table t2 (i.e., a JOIN operation), since the data already written by the current transaction is only visible in the main join (the join of an open XA transaction), a temporary table is needed to transfer the data. In this case, a query can be initiated in the main join to query the tables t1_DEFAULT and t1_LOCK, caching the query results (i.e., the incremental data that the current transaction has locked and modified in table t1) in a temporary result table (t1_TEMP) of the MyISAM engine. Then, in the read join, the incremental data in t1_DEFAULT is supplemented using the temporary result table, and then the join operation is performed with table t2.
[0107] Taking MySQL as an example, first create a temporary result table named t1_TEMP. The structure of this temporary result table is the same as the aforementioned TAB_DEFAULT table (here, t1_DEFAULT table). The temporary result table uses the MyISAM engine. Next, data is inserted into the temporary result table t1_TEMP. The data source is the tables t1_LOCK and t1_DEFAULT. During the query, the two tables are joined through the primary key PKEY. In the query results, the primary key field is directly taken from the original table's PKEY. The __commit_ts field is obtained by left-shifting the local variable @start_ts of the current join and performing a bitwise OR operation with the deletion flag __delete in the t1_LOCK table. This indicates that the data was written by the current transaction. The __start_ts field is always assigned a value of 0. Non-primary key fields are directly taken from the VALUES data in the t1_DEFAULT table. The query condition is set to __start_ts of the t1_LOCK table equal to @start_ts of the current transaction, which means only data locked by the current transaction is filtered out. Other conditions for filtering incremental data can also be added as needed.
[0108] The compute node can form the merge read logic executed by the relational data storage node into an SQL statement, and instruct the relational data storage node to execute the merge read logic through the SQL statement.
[0109] This embodiment does not require caching read data to the compute nodes like TiDB does. This is because in InnoDB, queried data is cached in a cache pool; and the cache pool has the LRU characteristic, which ensures that the most recently accessed data is cached first, and when memory is insufficient, cached data that has not been accessed for a long time will be removed from the cache first.
[0110] Step 540: During the write phase, the compute node, based on the start timestamp, instructs any relational data storage node involved in the write operation to apply a row lock in the row lock management table and write the data to be written to the local data storage table.
[0111] In some embodiments, instructing a compute node to apply a row lock to a row lock management table for any write operation involves the following steps: The compute node instructs the relational data storage node involved in the write operation to perform the following: determine whether a user table primary key involved in the write operation exists in the row lock management table; if the user table primary key involved in the write operation does not exist in the row lock management table, it means that the user table primary key is not yet locked, insert the user table primary key into the row lock management table, and set the transaction identifier locking the user table primary key to the start timestamp of the current branch transaction; if the user table primary key involved in the write operation exists in the row lock management table, it means that the user table primary key has been locked, and update the transaction identifier locking the user table primary key. The compute node can formulate the above control logic for applying row locks into an SQL statement, and instruct the relational data storage node to apply a row lock to the row lock management table through the SQL statement.
[0112] Because MySQL natively supports mechanisms such as transactions and row locks, the design of TAB_LOCK is greatly simplified. In an XA transaction, locking the corresponding primary key is achieved by inserting or updating data to be inserted or updated within the transaction and placing it into TAB_LOCK. After the XA transaction is committed or rolled back, the held row locks are automatically released, eliminating the need for additional row lock deletion operations.
[0113] When a write operation involves the primary key of the user table in the row lock management table (i.e., the primary key is already locked), updating the transaction identifier locking the primary key involves: If the transaction identifier locking the primary key is less than the start timestamp of the current branch transaction, wait for the row lock on the primary key to be released, and then update the transaction identifier locking the primary key to the start timestamp of the current branch transaction; or, if the transaction identifier locking the primary key is greater than the start timestamp of the current branch transaction, acquire the update timestamp, re-acquire the row lock on the primary key using the update timestamp, and update the transaction identifier locking the primary key to the update timestamp after successful acquisition. This resolves the write-write conflict. The compute node can implement the above control logic for updating the transaction identifier locking the primary key using an IF statement. Whether the transaction identifier locking the primary key is less than the start timestamp of the current branch transaction is used as a condition. The logic processing for the lesser case is represented by expression 1, and the logic processing for the greater case is represented by expression 2. This condition, expression 1, and expression 2 form an IF statement.
[0114] If using MySQL, you can use the `INSERT INTO… SELECT ... FROM… ON DUPLICATE UPDATE` syntax. If using other databases, you need to support syntax with semantically equivalent meaning. Taking MySQL as an example, first, a locking operation is performed. Data is inserted into the TAB_LOCK table using the "INSERT INTO" statement. The "SELECT ... FROM..." statement specifies that the data comes from the result of the TAB_VISIBLE() function. The selected fields include PKEY, the deletion flag represented by the variable @delete_flag, and the start_ts value obtained through an IF statement. This IF statement checks if the local variable @start_ts of the current connection / transaction is greater than __start_ts in the query result; otherwise, it triggers the timestamp expiration error function ts_stale_error(). The "ON DUPLICATE UPDATE" statement indicates that if a duplicate primary key is encountered during insertion, an update operation is performed. The __delete field is updated to the deletion flag represented by the variable @delete_flag, and the __start_ts field is updated to: if the variable @start_ts is greater than or equal to the existing __start_ts in the table. If the variable `@start_ts` is used, otherwise the timestamp expiration error function `ts_stale_error()` will be triggered. Next, the data write operation is performed. If the `TAB_WRITE` scheme is used, data is inserted into the `TAB_WRITE` table; otherwise, data is inserted into the `TAB_DEFAULT` table by default. The insertion operation is implemented using the `INSERTINTO` statement. The `SELECT ... FROM…` statement indicates that the data also comes from the query result of the `TAB_VISIBLE()` function. The `PKEY` field is selected, `__commit_ts` is fixed at 0, and `__start_ts` is set to the timestamp variable `@start_ts` of the current transaction, along with the original `VALUES` field data. The `ON DUPLICATE UPDATE` statement indicates that if a duplicate primary key is encountered during insertion, an update operation is performed, directly overwriting the original `VALUES` field in the table with the `VALUES` field from the query result `def`.
[0115] As mentioned earlier, data storage tables include persistent data storage tables or temporary data storage tables. In some embodiments, a compute node instructing the data to be written to a local data storage table includes: the compute node instructing the data to be written to a local persistent data storage table, wherein the written but uncommitted data is maintained through the caching mechanism and transaction isolation mechanism of the relational data storage node involved in the write operation, so that the written but uncommitted data is only visible to the current branch transaction; or, the compute node instructing the data to be written to a local temporary data storage table, wherein the temporary data storage table has the same structure as the persistent data storage table and is only visible to the current branch transaction. Both data storage schemes can be used selectively or in combination depending on the needs of the storage scenario. The former storage scheme reuses the native capabilities of the storage engine and is suitable for high-concurrency, low-latency scenarios; the latter storage scheme is suitable for scenarios where a single transaction updates a large amount of data, as the temporary table itself is a transaction-level private space, naturally achieving "visible only to the current branch transaction," without relying on complex MVCC (Multi-Version Concurrency Control) or locking mechanisms.
[0116] Step 550: During the pre-commit phase, the compute node obtains the first commit timestamp and instructs all relational data storage nodes involved in the write operation to update their respective local data storage tables based on the first commit timestamp, thus pre-committing the distributed transaction.
[0117] Unlike TiDB, this embodiment writes data to the data storage table TAB_DEFAULT or TAB_WRITE during the write phase, thus allowing the first commit timestamp to be obtained and updated in TAB_DEFAULT during the pre-commit phase, without needing to obtain it again during the commit phase. As mentioned earlier, the original commit timestamp obtained from the global timestamp service is used as the first commit timestamp. The second commit timestamp is obtained by left-shifting the first commit timestamp by one bit and performing a bitwise OR operation with the deletion flag. The deletion flag uses odd and even numbers to represent deletion and non-deletion operations, respectively. Non-deletion operations include insertion and update operations.
[0118] If the data to be written has already been written to the local persistent storage table, it is only necessary to update the commit timestamp parameter in the persistent storage table. The compute node instructs all relational data storage nodes involved in the write operation to update the commit timestamp parameter in their respective local persistent storage tables to the second commit timestamp.
[0119] Taking MySQL as an example, first, a session variable named `@commit_ts` is set and assigned a specified value, representing the commit timestamp of the current transaction. Next, an update operation is performed, simultaneously operating on the `TAB_DEFAULT` table and the `TAB_LOCK` table. The two tables are joined using the primary key `PKEY`. Data rows in the `TAB_LOCK` table whose `_start_ts` value matches the current transaction's start timestamp `@start_ts` are selected; these are the data rows locked and modified by the current transaction. Then, the selected data is updated by left-shifting the session variable `@commit_ts` by one bit, followed by a bitwise OR operation with the `__delete` flag in the `TAB_LOCK` table. The result is then assigned to the `__commit_ts` field in the `TAB_DEFAULT` table, completing the merging and assignment of the commit timestamp and deletion status.
[0120] When writing data to a local temporary data storage table, the compute node instructs all relational data storage nodes involved in the write operation to synchronize the data in their respective local temporary data storage tables to their respective local persistent data storage tables, and update the commit timestamp parameter in their respective local persistent data storage tables to the second commit timestamp.
[0121] Taking MySQL as an example, first, a session variable named `@commit_ts` is set and assigned a specified value, representing the commit timestamp of the current transaction. Next, an insert operation is performed, writing data to the `TAB_DEFAULT` table. The data sources are the `TAB_LOCK` and `TAB_WRITE` tables, linked by the primary key `PKEY`. Data in the `TAB_LOCK` table whose `__start_ts` value exactly matches the current transaction's start timestamp `@start_ts` is selected; this represents the data locked and modified by the current transaction. In the query results, the primary key field is taken from the `TAB_WRITE` table's `PKEY`, the `__commit_ts` field is obtained by left-shifting the session variable `@commit_ts` by one bit and performing a bitwise OR operation with the `__delete` flag in the `TAB_LOCK` table, the `__start_ts` field directly uses the `__start_ts` value from the `TAB_WRITE` table, and the `VALUE` field is taken from the business data value in the `TAB_WRITE` table. Finally, this retrieved data is batch-inserted into the `TAB_DEFAULT` table.
[0122] After all the data has been written to the persistent storage table, the XA END and XA PREAPRE operations of the MySQL XA protocol are used to complete the pre-committed distributed transaction. At this point, all the data written by the current transaction has been stored on disk. However, this incremental data is still not visible to other transactions, and the lock holding TAB_LOCK has not yet been released.
[0123] Step 560: During the commit phase, the compute node instructs all relational data storage nodes involved in the write operations to commit the distributed transaction so that all write operations take effect globally.
[0124] The commands executed during the commit phase are quite concise. For example, the XA COMMIT operation of the MySQL XA protocol can be used to commit the distributed transaction. At this point, the data written by the current transaction is visible to other transactions, and the lock holding the TAB_LOCK is released.
[0125] In some embodiments, the compute node performs deadlock detection and handling, including: constructing a global row lock waiting relationship based on row lock holding and waiting information of all relational data storage nodes; detecting whether a deadlock exists based on whether a row lock waiting loop exists in the global row lock waiting relationship; and terminating a transaction indicated in the row lock waiting loop in response to the detection of a deadlock.
[0126] Taking MySQL as an example, deadlock detection can be achieved using the data_lock_waits table in performance_schema. This table records the IDs of lock-holding transactions and lock-waiting transactions. Based on this, the row lock holding and waiting information of all MySQL storage nodes can be obtained, thereby constructing a global row lock waiting chain. If there is a waiting loop in the waiting chain, a deadlock situation is determined.
[0127] Modern distributed database systems, such as TiDB and Google Spanner, all use key-value (KV) storage layers. Due to the limited capabilities of the storage layer (lacking transaction capabilities, complex computational capabilities, and only supporting single-row atomicity), the implementation of the computation layer is extremely complex (requiring a complete 2PC protocol and all complex calculations to be performed in the computation layer). Furthermore, all transaction data needs to be read and cached in the computation layer, resulting in a heavy network transmission burden between the computation and storage layers. Large transaction sizes can also cause memory overflow errors in the computation layer, reducing the overall availability of the system.
[0128] This disclosure achieves superior distributed transaction capabilities by using a fully functional and proven relational database such as MySQL. On one hand, the robust transaction features and support for complex computations of relational data storage nodes (such as MySQL storage nodes) significantly reduce the complexity of the computation layer. For example, two-phase commit is implemented based on MySQL's XA protocol, pushing all computations and caching down to the MySQL storage nodes. On the other hand, relational data storage nodes automatically move cached data to disk when insufficient memory is detected, preventing memory overflow. For instance, by utilizing the cache management capabilities of the MySQL storage engine, cached data can be automatically moved to disk when storage node memory is insufficient, preventing memory overflow and enabling support for ultra-large-scale distributed transactions.
[0129] Figure 6 A schematic diagram of a computing node in a distributed database system according to some embodiments of this disclosure is shown. For example... Figure 6 As shown, the computing node 600 in this embodiment includes a memory 610 and a processor 620 coupled to the memory 610. The processor 620 is configured to execute the distributed transaction processing method in any of the foregoing embodiments based on instructions stored in the memory 610.
[0130] The computing node 600 may also include an input / output interface 630, a network interface 640, a storage interface 650, etc. These interfaces 630, 640, 650, as well as the memory 610 and the processor 620, can be connected, for example, via a bus 660.
[0131] The memory 610 may include, for example, system memory, fixed non-volatile storage media, etc. The system memory may store, for example, the operating system, application programs, boot loader, and other programs.
[0132] The processor 620 can be implemented using a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gates, or transistors, or other discrete hardware components.
[0133] The input / output interface 630 provides a connection interface for input / output devices such as monitors, mice, keyboards, and touchscreens. The network interface 640 provides a connection interface for various networked devices. The storage interface 650 provides a connection interface for external storage devices such as SD cards and USB flash drives. The bus 660 can use any bus architecture from a variety of bus structures. For example, bus architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, and Peripheral Component Interconnect (PCI) bus.
[0134] Figure 7 The diagram illustrates a computing node of a distributed database system according to some embodiments of the present disclosure. The computing node includes one or more modules that execute distributed transaction processing methods.
[0135] like Figure 7 As shown, the computing node 700 includes a distributed transaction management module 710 and a two-phase commit management module 720.
[0136] The distributed transaction management module 710 is configured to respond to a client's distributed transaction by obtaining a globally unique start timestamp, wherein the distributed transaction is associated with multiple relational data storage nodes, and each relational data storage node is used to execute a branch transaction of the distributed transaction; to start the distributed transaction, the start timestamp is sent as the identifier of the distributed transaction to the multiple relational data storage nodes; during the write phase, based on the start timestamp, any relational data storage node involved in the write operation is instructed to apply a row lock in the row lock management table and write the data to be written to the local data storage table; The two-phase commit management module 720 is configured to, during the pre-commit phase, obtain a first commit timestamp and instruct all relational data storage nodes involved in write operations to update their respective local data storage tables based on the first commit timestamp, thus pre-committing the distributed transaction; during the commit phase, it instructs all relational data storage nodes involved in write operations to commit the distributed transaction, so that all write operations take effect globally. Furthermore, the two-phase commit management module 720 is also configured to assign XA transaction IDs, track the status of initiated XA transactions, and recover suspended XA transactions after a failure.
[0137] In some embodiments, the distributed transaction management module 710 is configured to instruct the relational data storage node involved in the write operation to perform: determining whether the user table primary key involved in the write operation exists in the row lock management table; if it does not exist, inserting the user table primary key into the row lock management table and setting the transaction identifier locking the user table primary key to the start timestamp of the current branch transaction; if it exists, updating the transaction identifier locking the user table primary key.
[0138] In some embodiments, the distributed transaction management module 710 is configured to, when the transaction identifier locking the user table primary key is less than the start timestamp of the current branch transaction, wait for the row lock on the user table primary key to be released, and after release, update the transaction identifier locking the user table primary key to the start timestamp of the current branch transaction; or, when the transaction identifier locking the user table primary key is greater than the start timestamp of the current branch transaction, obtain the update timestamp, use the update timestamp to re-acquire the row lock on the user table primary key, and after successful acquisition, update the transaction identifier locking the user table primary key to the update timestamp.
[0139] In some embodiments, the distributed transaction management module 710 is configured to instruct the data to be written to a local persistent data storage table, wherein the data that has been written but not committed is maintained through the caching mechanism and transaction isolation mechanism of the relational data storage node involved in the write operation, so that the data that has been written but not committed is only visible to the current branch transaction; or, instruct the data to be written to a local temporary data storage table, wherein the temporary data storage table has the same structure as the persistent data storage table and is only visible to the current branch transaction.
[0140] In some embodiments, the distributed transaction management module 710 is configured to, for the case where the data to be written is written to a local persistent data storage table, instruct all relational data storage nodes involved in the write operation to update the commit timestamp parameter in their respective local persistent data storage tables to a second commit timestamp; or, for the case where the data to be written is written to a local temporary data storage table, instruct all relational data storage nodes involved in the write operation to synchronize the data in their respective local temporary data storage tables to their respective local persistent data storage tables, and update the commit timestamp parameter in their respective local persistent data storage tables to a second commit timestamp; wherein the second commit timestamp is obtained by left-shifting the first commit timestamp by one bit and performing a bitwise OR operation with a deletion flag, the deletion flag using odd and even numbers to represent deletion operations and non-deletion operations respectively, the non-deletion operations including insertion operations and update operations.
[0141] In some embodiments, the distributed transaction management module 710 is configured to, during the read phase, instruct the relational data storage node involved in the read operation to perform the following actions based on the start timestamp: in response to the read operation and the detection of a row lock on the data to be read, if the transaction ID holding the row lock is less than the start timestamp of the current branch transaction, and if the transaction holding the row lock is not currently executing a locking statement, wait for the transaction holding the row lock to commit, and then read the data to be read from the historical snapshot corresponding to the start timestamp; if the transaction holding the row lock is currently executing a locking statement, directly read the data to be read from the historical snapshot corresponding to the start timestamp.
[0142] In some embodiments, the distributed transaction management module 710 is configured to, during the read phase, instruct the relational data storage node involved in the read operation to perform the following based on the start timestamp: detect that the read operation is related to data already written in the current branch transaction, query the local data storage table and row lock management table in the main connection to obtain the incremental data of the current branch transaction to the local data storage table, cache the incremental data in a temporary result table, obtain the incremental data through the temporary result table in the read connection, merge the historical snapshot corresponding to the start timestamp and the incremental data, and read the data to be read from the merged data.
[0143] In some embodiments, the distributed transaction management module 710 is configured to construct a global row lock waiting relationship based on the row lock holding and waiting information of all relational data storage nodes; detect whether a deadlock exists based on whether a row lock waiting loop exists in the global row lock waiting relationship; and terminate at least one transaction in the row lock waiting loop in response to detecting a deadlock.
[0144] Those skilled in the art will understand that embodiments of this disclosure can be provided as methods, systems, or computer program products. Therefore, this disclosure can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this disclosure can take the form of a computer program product embodied on one or more (non-transitory) computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, cloud storage, etc.) containing computer program code. A computer program product should be understood as a software product that primarily implements its solution through a computer program.
[0145] This disclosure is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create a machine for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0146] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0147] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
Claims
1. A distributed transaction processing method, applied to the computing nodes of a distributed database system, comprising: In response to a distributed transaction from a client, a globally unique start timestamp is obtained, wherein the distributed transaction is associated with multiple relational data storage nodes, and each relational data storage node is used to execute a branch transaction of the distributed transaction; Initiate the distributed transaction by sending the start timestamp as the identifier of the distributed transaction to the multiple relational data storage nodes; During the write phase, based on the start timestamp, any relational data storage node involved in the write operation is instructed to apply a row lock in the row lock management table and write the data to be written to the local data storage table. During the pre-commit phase, the first commit timestamp is obtained, and all relational data storage nodes involved in the write operations are instructed to update their respective local data storage tables based on the first commit timestamp, thus pre-committing the distributed transaction. During the commit phase, all relational data storage nodes involved in the write operations are instructed to commit the distributed transaction so that all write operations take effect globally.
2. The method according to claim 1, wherein, Instructing any write operation to apply a row lock to the row lock management table includes: The relational data storage node involved in the write operation is instructed to perform the following: determine whether the user table primary key involved in the write operation exists in the row lock management table; if it does not exist, insert the user table primary key into the row lock management table and set the transaction identifier locking the user table primary key to the start timestamp of the current branch transaction; if it exists, update the transaction identifier locking the user table primary key.
3. The method according to claim 2, wherein, Updating the transaction identifier that locks the primary key of the user table includes: If the transaction identifier locking the user table primary key is less than the start timestamp of the current branch transaction, wait for the row lock on the user table primary key to be released, and after release, update the transaction identifier locking the user table primary key to the start timestamp of the current branch transaction; or, If the transaction identifier locking the primary key of the user table is greater than the start timestamp of the current branch transaction, obtain the update timestamp, use the update timestamp to re-acquire the row lock of the primary key of the user table, and after successful acquisition, update the transaction identifier locking the primary key of the user table to the update timestamp.
4. The method according to claim 1, wherein, The data storage table includes a persistent data storage table or a temporary data storage table, and the data storage table that indicates the data to be written to the local storage includes: The instruction is to write the data to be written to the local persistent storage table, wherein the written but uncommitted data is maintained through the caching mechanism and transaction isolation mechanism of the relational data storage node involved in the write operation, so that the written but uncommitted data is only visible to the current branch transaction; or, The instruction is to write the data to be written to a local temporary data storage table, wherein the temporary data storage table has the same structure as the persistent data storage table and is only visible to the current branch transaction.
5. The method according to claim 4, wherein, Instructing all relational data storage nodes involved in write operations to update their local data storage tables based on the first commit timestamp includes: For cases where the data to be written is to a local persistent data storage table, instruct all relational data storage nodes involved in the write operation to update the commit timestamp parameter in their respective local persistent data storage tables to the second commit timestamp; or, In the case where the data to be written is written to the local temporary data storage table, all relational data storage nodes involved in the write operation are instructed to synchronize the data in their respective local temporary data storage tables to their respective local persistent data storage tables, and update the commit timestamp parameter in their respective local persistent data storage tables to the second commit timestamp. The second submission timestamp is obtained by shifting the first submission timestamp one bit to the left and performing a bitwise OR operation with the deletion flag. The deletion flag uses odd and even numbers to represent deletion and non-deletion operations, respectively. The non-deletion operations include insertion and update operations.
6. The method according to claim 1, further comprising: During the read phase, based on the start timestamp, the relational data storage nodes involved in the read operation are instructed to execute: In response to a read operation and the detection of a row lock on the data to be read, if the transaction ID holding the row lock is less than the start timestamp of the current branch transaction, and if the transaction holding the row lock is not currently executing a locking statement, wait for the transaction holding the row lock to commit, and then read the data to be read from the historical snapshot corresponding to the start timestamp. If the transaction holding the row lock is currently executing a locking statement, directly read the data to be read from the historical snapshot corresponding to the start timestamp.
7. The method according to claim 1, further comprising: During the read phase, based on the start timestamp, the relational data storage nodes involved in the read operation are instructed to execute: If the read operation is detected to be related to data already written in the current branch transaction, the local data storage table and row lock management table are queried in the main connection to obtain the incremental data of the current branch transaction to the local data storage table. The incremental data is cached in a temporary result table. The incremental data is obtained through the temporary result table in the read connection. The historical snapshot corresponding to the start timestamp and the incremental data are merged. The data to be read is read from the merged data.
8. The method according to claim 1, further comprising: Construct a global row lock wait relationship based on the row lock holding and waiting information of all relational data storage nodes; Based on whether there is a row lock waiting loop in the global row lock waiting relationship, detect whether there is a deadlock situation; In response to the detection of a deadlock, at least one transaction in the row lock wait loop is terminated.
9. The method according to any one of claims 1-8, wherein, The data storage table includes: a primary key field, a secondary primary key field, a value field, and a transaction identifier field for modifying the primary key. The primary key field includes the user table's primary key; the secondary primary key field includes each second commit timestamp, and these second commit timestamps are sorted in descending order within the user table's primary key; the value field includes the corresponding value for each second commit timestamp. The second submission timestamp is obtained by shifting the first submission timestamp one bit to the left and performing a bitwise OR operation with the deletion flag. The deletion flag uses odd and even numbers to represent deletion and non-deletion operations, respectively. The non-deletion operations include insertion and update operations.
10. The method according to any one of claims 1-8, wherein, The relational data storage node automatically moves cached data to disk when it detects insufficient memory.
11. A computing node for a distributed database system, comprising: One or more modules that perform the distributed transaction processing method according to any one of claims 1-10.
12. A computing node for a distributed database system, comprising: Memory; And a processor coupled to the memory, the processor being configured to execute the distributed transaction processing method of any one of claims 1-10 based on instructions stored in the memory.
13. A distributed database system, comprising: At least one computing node and multiple relational data storage nodes, wherein the computing node is configured to execute the distributed transaction processing method according to any one of claims 1-10.
14. A computer-readable storage medium having stored thereon computer instructions that, when executed by a processor, implement the distributed transaction processing method according to any one of claims 1-10.
15. A computer program product comprising computer instructions that, when executed by a processor, implement the distributed transaction processing method according to any one of claims 1-10.