Distributed transaction processing method, apparatus and computer device
By integrating a transaction manager into the application and recording the pre-commit state, combined with a multi-level fault handling process, the problem of high failure rate in distributed transaction processing is solved, achieving higher robustness and success rate, and reducing transaction failures caused by network jitter and node failures.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NETEASE (HANGZHOU) NETWORK CO LTD
- Filing Date
- 2026-03-04
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, distributed transaction processing schemes have a high failure rate, especially in the case of node failure and network instability, which leads to data inconsistency and transaction processing failure.
By integrating a transaction manager into the application, distributed transactions can be initiated and coordinated directly to multiple resource managers. During the preparation phase, the pre-commit status is recorded in a custom transaction log. Combined with a multi-level fault handling process, including local recovery attempts and fallback handling by the global management server, a hierarchical fault-tolerant system is constructed.
It reduces the commit failure rate of distributed transactions, improves the robustness and eventual success rate of transaction processing, reduces failures caused by network jitter and node failures, and ensures the persistence and traceability of intermediate transaction states.
Smart Images

Figure CN122240686A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of distributed data processing technology, and in particular to a distributed transaction processing method, apparatus and computer device. Background Technology
[0002] In the field of distributed database technology, ensuring the atomicity and data consistency of cross-node transactions is a core challenge, and the two-phase commit protocol is one of the key mechanisms to achieve this goal. This mechanism involves the collaboration of a transaction manager and multiple resource managers, ensuring that all participants achieve a consistent final state through two phases: prepare and commit.
[0003] In existing technologies, a typical implementation architecture consists of an application, a transaction manager, and a resource manager. Taking MySQL as the resource manager as an example, it participates in distributed transactions through its provided XA interface. To improve reliability, MySQL itself can achieve high availability through master-slave replication. In earlier versions, the recovery mechanism for transactions in the preparation phase might be incompatible with the binary log replication mechanism after a node failure, potentially leading to inconsistencies between master and slave data. Later versions have improved this, for example, by recording binary logs during the preparation phase to ensure replication correctness.
[0004] However, existing transaction processing schemes based on such mechanisms have a high failure rate when handling distributed transactions. Summary of the Invention
[0005] Therefore, it is necessary to provide a distributed transaction processing method, apparatus, and computer device that can reduce the exception rate of transaction processing in order to address the above-mentioned technical problems.
[0006] In a first aspect, this application provides a distributed transaction processing method applied to the application side of a distributed database system. The application side integrates a transaction manager, and the distributed database also includes multiple resource managers and a management server. The method includes:
[0007] Initiate and coordinate distributed transactions to multiple resource managers; In response to the successful execution of a distributed transaction during the preparation phase, the pre-commit status of the distributed transaction is recorded in the preset transaction log; If a transaction termination operation for any resource manager fails during the commit phase, a multi-level fault handling process is executed. The multi-level fault handling process includes: the transaction manager performs a recovery and commit operation; if the recovery fails, the operation failure information of the failed transaction termination operation is sent to the management server, and the management server coordinates the resource manager to complete the transaction termination operation of the distributed transaction based on the transaction log and the operation failure information.
[0008] In one embodiment, recording the pre-commit status of a distributed transaction in a preset transaction log includes: Record resource manager information for successful preparation phases in distributed transactions, and perform asynchronous disk flushing of transaction logs through a timed thread independent of the main business thread.
[0009] In one embodiment, the recovery commit operation is performed by the transaction manager, including: Retry local recovery for failed transaction termination operations; If local recovery retry fails, an asynchronous processing thread is created to continue executing the failed transaction termination operation and return the transaction processing result to the application.
[0010] In one embodiment, the duration of local recovery retry includes a preset first duration; If there are multiple local recovery retries, the interval between two adjacent local recovery retries is a preset second duration.
[0011] In one embodiment, the transaction termination operation includes a rollback operation; The management server, based on the transaction log and operation failure information, coordinates the resource manager to complete the transaction termination operation of the distributed transaction, including: The management server will continuously retry the transaction termination operation within a configurable timeout period after receiving the operation failure information; If the configurable timeout period is not completed, a rollback command will be sent to all resource managers.
[0012] In one embodiment, the management server queries a preset metadata database to obtain and coordinate the status of all resource managers.
[0013] In one embodiment, sending an operation failure message indicating a failed transaction termination operation to the management server includes: The transaction manager creates a network connection with the management server and sends operation failure information through the network connection.
[0014] In one embodiment, the method further includes: In response to a failure of the transaction manager, the management server performs recovery coordination operations based on the transaction log for distributed transactions that have a pre-committed state but no completed state.
[0015] Secondly, this application also provides a distributed transaction processing apparatus for use in the application side of a distributed database system. The application side integrates a transaction manager, and the distributed database also includes multiple resource managers and a management server. The apparatus includes: The transaction initiation module is used to initiate and coordinate distributed transactions to multiple resource managers; The logging module is used to record the pre-commit status of the distributed transaction to a preset transaction log in response to the successful execution of the distributed transaction during the preparation phase. The fault handling module is used to execute a multi-level fault handling process if a transaction termination operation for any resource manager fails during the commit phase. The multi-level fault handling process includes: the transaction manager performs a recovery and commit operation; if the recovery fails, the operation failure information of the failed transaction termination operation is sent to the management server, and the management server coordinates the resource manager to complete the transaction termination operation of the distributed transaction based on the transaction log and the operation failure information.
[0016] Thirdly, this application also provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the above-described method.
[0017] Fourthly, this application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the above-described method.
[0018] Fifthly, this application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the above-described method.
[0019] The distributed transaction processing method, apparatus, and computer equipment provided in this application, by having the application end of the integrated transaction manager directly initiate and coordinate distributed transactions with multiple resource managers, eliminates the network interaction links of independent transaction coordinator nodes, reducing the probability of failures caused by unstable additional network links from the architectural source. Secondly, by recording the pre-commit state to an independent custom transaction log immediately after a successful preparation phase, a reliable state checkpoint independent of any single resource manager is established for the entire distributed transaction, ensuring the persistence and traceability of intermediate transaction states and providing crucial evidence for fault recovery. Furthermore, by triggering a multi-level fault handling process combining local transaction manager recovery attempts and global management server fallback processing when a transaction termination operation failure is detected during the commit phase, a hierarchical, clearly defined, and progressively escalating fault-tolerant system is constructed. This system can flexibly respond to various abnormal scenarios, from transient network jitter to persistent node failures, and automatically repair suspended transactions to the greatest extent possible. Thus, this application enhances the robustness and eventual success rate of distributed transaction processing, thereby reducing the commit anomaly rate of distributed transactions. Attached Figure Description
[0020] To more clearly illustrate the technical solutions in the embodiments of this application or related technologies, the drawings used in the description of the embodiments of this application or related technologies will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.
[0021] Figure 1 This is a schematic diagram of the structure of a distributed database system provided in an embodiment of this application; Figure 2 This application provides a schematic diagram of a distributed transaction processing flow. Figure 3 A flowchart illustrating the steps of a recovery and commit operation performed by a transaction manager, as provided in this embodiment of the application; Figure 4 This is a schematic diagram of a distributed transaction fault handling process provided in an embodiment of this application; Figure 5 This is a schematic diagram of the structure of a distributed transaction processing device provided in an embodiment of this application; Figure 6 This is a schematic diagram of the internal structure of a computer device provided in an embodiment of this application. Detailed Implementation
[0022] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0023] In one exemplary embodiment, Figure 1 This is a schematic diagram of the structure of a distributed database system provided in an embodiment of this application; the distributed transaction processing method provided in this embodiment of the application can be applied to, for example... Figure 1 The distributed database system shown includes an application, multiple resource managers, and a management server. The application integrates a transaction manager.
[0024] In this context, the application side refers to the software entity that runs specific business logic and needs to access a distributed database, such as the server program of an online game. The resource manager refers to the component responsible for managing specific data resources and performing transactional operations; when using a relational database (such as MySQL) at the underlying level, its database engine layer acts as the resource manager. The management server refers to an independently deployed server component used for global coordination and fallback handling of exceptional transactions. The transaction manager refers to the logical component responsible for coordinating distributed transactions across multiple resource managers, driving them through the preparation and commit phases.
[0025] For example, a terminal running a specific business application (such as a game server) integrates the functional code of a transaction manager within its application process. This terminal communicates with an independently deployed management server via a network and can initiate data operations to multiple resource managers (i.e., multiple database nodes) distributed across the network.
[0026] As an example, the game server (application) can integrate the transaction manager code into its own process by importing a specific JAR file. When cross-server player data needs to be updated, it acts as a coordinator, initiating transactions through database connections to the databases of the two involved game regions (Resource Managers A and B), while simultaneously maintaining a heartbeat or reporting connection with a separately deployed management server.
[0027] By integrating the transaction manager into the terminal running specific business applications (such as game servers), the network link between the application and the independent transaction manager in traditional architectures is eliminated. This architectural optimization directly reduces a potential point of failure, thereby reducing the rate of distributed transaction commit anomalies caused by network partitions or latency at the system architecture level.
[0028] In one exemplary embodiment, such as Figure 2 As shown, Figure 2 This is a schematic diagram of a distributed transaction processing process provided in an embodiment of this application, such as... Figure 2 As shown, a distributed transaction processing method is provided, which is applied to... Figure 1 Taking the application side as an example, the explanation includes the following steps S201 to S203. Among them: S201, Initiate and coordinate distributed transactions to multiple resource managers.
[0029] In this context, a resource manager refers to a component responsible for managing specific data resources and performing transactional operations. In a distributed database system, this can be an independent database node instance. Coordinating distributed transactions refers to following a two-phase commit protocol, acting as a transaction coordinator to drive multiple resource managers through the preparation and commit phases, ensuring the atomicity of operations across multiple data nodes.
[0030] For example, when a terminal running a specific business application (such as the server side of an online game) needs to update data on multiple different database nodes, its internally integrated transaction manager component acts as a coordinator, initiating and managing a cross-node transaction process to each resource manager involved. This includes initial transaction branch registration, sending transaction requests containing specific data operations, and organizing the behavior of each participant according to a two-phase commit protocol.
[0031] As an example, in a massively multiplayer online game, when a player performs a cross-server transaction, it simultaneously deducts in-game gold from the database of one regional server and adds an item to the database of another regional server. The terminal (server) running the game logic initiates a distributed transaction with two resource managers (database nodes) representing the databases of the one regional server and the other regional server, and coordinates them to sequentially execute the pre-write operations of deducting gold and adding the item.
[0032] By having the terminal running the specific business application directly assume the role of distributed transaction coordinator, initiating and driving transactions across multiple resource managers, the additional network latency and single point of failure risks that may arise from independent transaction coordinator nodes in traditional architectures can be eliminated. This built-in coordination capability makes the transaction initiation path shorter and more direct, laying a solid foundation for the subsequent reliable execution of two-phase commit, thereby reducing the distributed transaction commit anomaly rate caused by complex or unstable coordination links at the source.
[0033] S202. In response to the successful execution of the distributed transaction during the preparation phase, the pre-commit status of the distributed transaction is recorded in the preset transaction log.
[0034] The preparation phase can refer to the first phase of a two-phase commit protocol (2PC), where the coordinator sends a preparation request to all participants. Each participant performs transaction operations but does not commit, and reports the result (success or failure) back to the coordinator. The pre-commit state can refer to an intermediate state of the distributed transaction after all participants have successfully completed the preparation phase. At this point, the transaction data is ready and locked on each participant's node, but has not yet been finally persisted. The pre-defined transaction log refers to a persistent file with a system-defined format and storage method, specifically used to record key state information of the distributed transaction, and can be distinguished from the database engine's own system logs (such as binary logs).
[0035] For example, a terminal running a specific business application, after sending preparation instructions to all resource managers it coordinates and receiving all successful responses, determines that the distributed transaction has been successfully executed in the preparation phase. At this point, the terminal does not immediately enter the commit phase, but instead writes this critical phase status—the pre-commit success status—along with the necessary transaction identifiers and participant information, into a specially designed transaction log file. This log can be maintained by the transaction manager itself, independent of the database node's log system.
[0036] As an example, after receiving confirmation of "ready successfully" from both the "North China Region" and "East China Region" database nodes, the game server creates a dedicated log entry on its local disk, recording "Transaction T12345 is ready on nodes DB_North and DB_East". This record indicates that the transaction has crossed the first critical point, providing a basis for possible fault recovery.
[0037] By recording the pre-commit state to a separate transaction log after a successful preparation phase, a reliable checkpoint is established for the entire distributed transaction. This checkpoint is independent of the state of any individual resource manager. Even if network partitions or node failures occur during the subsequent commit phase, the system can still know from this log that a ready but incomplete transaction exists. This proactive and customizable state persistence mechanism is the foundation for building subsequent advanced fault tolerance capabilities, thereby enhancing the system's visibility and recoverability of intermediate state transactions and effectively reducing the commit anomaly rate caused by state loss or confusion.
[0038] S203. If a transaction termination operation for any resource manager fails during the commit phase, a multi-level fault handling process is executed.
[0039] The multi-level fault handling process includes: the transaction manager performs a recovery and commit operation; if the recovery fails, the operation failure information of the failed transaction termination operation is sent to the management server, and the management server coordinates the resource manager to complete the transaction termination operation of the distributed transaction based on the transaction log and the operation failure information.
[0040] The commit phase can refer to the second phase of a two-phase commit protocol, where the coordinator, based on feedback from the preparation phase, sends a final commit or rollback instruction to all participants, ensuring the transaction reaches a consistent final state across all nodes. The transaction termination operation refers to the instruction issued during the commit phase to complete the transaction; it can be a commit instruction or a rollback instruction if necessary. The multi-level fault handling process refers to a predefined, structured fault-tolerant mechanism containing multiple layers of response, from shallow to deep and from local to global, which can resolve transaction termination operation failures by attempting each layer sequentially.
[0041] For example, after the terminal issues a commit command to all resource managers, if it detects that the commit operation on any of the resource managers has failed (e.g., network timeout, connection interruption, or node unresponsiveness), it will not immediately determine that the entire transaction has failed or enter an infinite wait. Instead, it can trigger a built-in, hierarchical fault handling process. This process can first attempt to recover locally on the terminal (i.e., within the transaction manager). If the local recovery capabilities are exhausted and the problem remains unresolved, the issue is escalated to a more global and powerful component in the system (the management server) for handling.
[0042] As an example, after a successful preparation phase, the game server sends commit commands to the "North China Region" and "East China Region" databases. The commit to the "North China Region" database is successful, but the network suddenly interrupts when sending the command to the "East China Region" database, causing the commit to fail. In this situation, the game server doesn't necessarily immediately return a transaction failure message to the player, nor does it indefinitely block and wait. Instead, it initiates its multi-level fault handling process to attempt to automatically resolve the problem.
[0043] By implementing a structured, multi-level fault handling process, this approach changes the traditional two-phase commit protocol's crude approach of either simply retrying or rolling back the entire transaction when some participants fail to commit. This escalating fault tolerance strategy can take the most appropriate measures for faults of different natures and durations (such as transient network jitter, short-term node overload, persistent network partitions, etc.), thereby improving the success rate of resolving suspended transactions (partially committed but partially pending). This systematic fault handling capability directly translates into an improved transaction commit success rate, thus reducing the overall commit anomaly rate of distributed transactions.
[0044] In this embodiment, by having the application end of the integrated transaction manager directly initiate and coordinate distributed transactions with multiple resource managers, the network interaction links of independent transaction coordinator nodes can be eliminated, reducing the probability of failures caused by unstable additional network links from the architectural source. Secondly, by recording the pre-commit state to an independent custom transaction log immediately after a successful preparation phase, a reliable state checkpoint independent of any single resource manager can be established for the entire distributed transaction, ensuring the persistence and traceability of intermediate transaction states and providing crucial evidence for fault recovery. Furthermore, by triggering a multi-level fault handling process combining local transaction manager recovery attempts and global management server fallback handling when a transaction termination operation failure is detected during the commit phase, a hierarchical, clearly defined, and progressively escalating fault-tolerant system is constructed. This system can flexibly respond to various abnormal scenarios, from instantaneous network jitter to persistent node failures, and automatically repair suspended transactions to the maximum extent possible. Thus, this application can enhance the robustness and eventual success rate of distributed transaction processing, thereby reducing the commit anomaly rate of distributed transactions.
[0045] In an exemplary embodiment, step S202, recording the pre-commit status of the distributed transaction to a preset transaction log, includes: Record resource manager information for successful preparation phases in distributed transactions, and perform asynchronous disk flushing of transaction logs through a timed thread independent of the main business thread.
[0046] The resource manager information refers to descriptive information that identifies which database nodes(s) have successfully completed the preparation phase, such as node identifiers and transaction branch identifiers. The main business thread refers to the thread that handles the core business logic of the application and the distributed transaction coordination process. The timed thread refers to a background thread that is woken up at fixed time intervals to execute specific tasks. The asynchronous disk flushing operation refers to the operation of persistently synchronizing the transaction log content written to the memory buffer to the disk storage medium. This operation is not executed synchronously with the main business thread, thus avoiding blocking the main business thread.
[0047] For example, when recording the pre-commit status of a transaction, information about all participating nodes is not recorded; instead, only information about those nodes that successfully return a Ready response is selectively recorded. Furthermore, the operation of writing to the log file does not immediately wait for data to be written to disk after calling the write file system call; instead, the data is first placed in a memory buffer, and a separate background thread is responsible for periodically writing the log buffer contents of multiple transactions to disk together.
[0048] As an example, transaction X involves nodes A, B, and C. Only nodes A and B returned a successful Prepare response, while node C timed out. In this case, the transaction log can simply record "Transaction X successfully prepared on nodes A and B". After writing this record, the main thread continues the subsequent process, while another log flushing thread, which runs every 100 milliseconds, is responsible for finally saving this record to the physical disk.
[0049] In this embodiment, selectively recording success status avoids redundancy and interference caused by recording invalid or failed information. An asynchronous disk flushing mechanism decouples disk I / O operations that may cause performance bottlenecks from the main transaction coordination process. In high-concurrency scenarios such as distributed game databases, this can alleviate the performance pressure on the transaction manager caused by log write delays, reduce timeouts and failures caused by congestion in the coordinator's own processing, and thus reduce the transaction commit anomaly rate from a performance optimization perspective.
[0050] In some exemplary implementations, the preset transaction log uses a custom format and creates a separate log file for each distributed transaction, rather than writing to the database's standard binary log.
[0051] In an exemplary embodiment, the multi-level fault handling process in step S203 includes: first, the transaction manager performs a recovery commit operation; if the recovery fails, the operation failure information of the failed transaction termination operation is sent to the management server, and the management server coordinates the resource manager to complete the transaction termination operation of the distributed transaction based on the transaction log and the operation failure information.
[0052] The "recovery commit operation" refers to the action taken by the transaction manager to retry the commit after detecting a commit failure. The "operation failure information" refers to data describing the specific circumstances of the commit operation failure, which may include the failed transaction identifier, the target resource manager identifier, the error type, and a timestamp.
[0053] For example, upon detecting a failed commit command, the transaction manager first verbally attempts to recover, such as by resending the command. If its recovery attempt ultimately fails, the transaction manager can package the details of the failure into a report and send it to a separately deployed management server in the system, transferring subsequent coordination responsibilities to it.
[0054] As an example, if the game server fails to commit to database A, it first attempts to reconnect and send a COMMIT command. If it still fails after several retries due to network connectivity issues, the game server will stop trying and notify the management server of the message "Transaction X failed to commit to node A" via the network connection.
[0055] In this embodiment, a multi-level fault handling logic is established, prioritizing local retries before reporting for assistance. This achieves a reasonable allocation of fault tolerance responsibilities and a smooth escalation of fault handling. Local recovery by the transaction manager can quickly resolve most transient faults; for persistent faults that cannot be resolved locally (such as persistent network partitions), the global management server can take over, avoiding prolonged blocking or unnecessary resource consumption by the transaction manager. This collaborative mechanism ensures that faults are handled by the most suitable component, thereby systematically improving the efficiency of resolving suspended transactions and reducing the overall transaction anomaly rate.
[0056] In one exemplary embodiment, Figure 3 This application provides a flowchart illustrating the steps of a recovery and commit operation performed by a transaction manager, as shown in the embodiments of this application. Figure 3 As shown, in step S203, the transaction manager performs a recovery and commit operation, including: S301. Retry the failed transaction termination operation locally.
[0057] S302. If local recovery retry fails, an asynchronous processing thread is created to continue executing the failed transaction termination operation and return the transaction processing result to the application.
[0058] Local recovery retry refers to the transaction manager immediately or briefly retrying the failed operation within the current main thread or synchronization context. Asynchronous processing threads refer to threads created specifically to handle the failed operation and running independently in the background.
[0059] For example, when the application's transaction manager initiates local recovery, it can first perform a synchronous retry. If the synchronous retry fails, the transaction manager will create a new thread to continuously perform the retry operation in the background. At the same time, in order not to block the application's main logic, the transaction manager can first return a preliminary transaction processing result to the application (such as "processing" or "committed, awaiting final confirmation").
[0060] As an example, after the game server's first submission to database A fails, it immediately retryes twice synchronously. After both attempts fail, it creates a background thread named "Retry Transaction X - Node A" to continue attempting the submission, while the game server's main logic can first return a "Item deduction request accepted" message to the game client, allowing the player to continue with other operations.
[0061] In this embodiment, by separating time-consuming retry operations from the synchronous process to an asynchronous thread and allowing the application to know the results in advance, the entire application service thread is effectively prevented from being suspended for a long time due to the failure of a single resource manager node. This improves the responsiveness and availability of the application, avoids a chain reaction of transaction failures caused by application blocking timeouts, and reduces the risk of a local resource manager failure spreading into a global transaction anomaly.
[0062] In one exemplary embodiment, the duration of local recovery retry includes a preset first duration; If there are multiple local recovery retries, the interval between two adjacent local recovery retries is a preset second duration.
[0063] The first duration refers to the maximum total time allowed by the transaction manager to perform local recovery retries. The second duration refers to the time interval between two consecutive retry attempts.
[0064] For example, when configuring the retry behavior of its transaction manager, two key parameters can be set: one parameter controls the total retry time (first duration), and the other parameter controls how long to wait after each retry before making the next attempt (second duration).
[0065] As an example, the transaction manager can be configured to perform local recovery retry for failed operations for a maximum of 5 minutes (first duration), with each retry interval of 30 seconds (second duration).
[0066] In this embodiment, parameterized configuration of retry duration and interval provides flexibility to cope with different network environments and fault types. A reasonable retry strategy can maximize the chance of repairing temporary faults such as momentary network jitter or brief node unavailability without excessively consuming system resources. This controllable and continuous effort can successfully salvage many transactions that would otherwise fail outright, thereby directly reducing the commit anomaly rate of distributed transactions.
[0067] In one exemplary embodiment, the transaction termination operation includes a rollback operation; The management server, based on the transaction log and operation failure information, coordinates the resource manager to complete the transaction termination operation of the distributed transaction, including: The management server will continuously retry the transaction termination operation within a configurable timeout period after receiving the operation failure information; If the configurable timeout period is not completed, a rollback command will be sent to all resource managers.
[0068] The configurable timeout refers to the maximum time window during which the management server is allowed to retry operations after taking over the processing; this time can be adjusted through system parameters. The unified rollback command refers to the rollback command sent by the management server to all resource managers involved in the distributed transaction, requesting them to revert the transaction.
[0069] For example, a standalone management server, upon receiving a transaction failure report from the application, may not immediately make a final decision. Instead, it may continuously attempt to complete the originally failed operation (such as a commit) within a pre-defined time window set by the administrator. Only after this time window has expired, if the operation still fails, will the management server adopt a conservative strategy, that is, command all resource managers involved in the transaction to perform a rollback to ensure eventual data consistency.
[0070] As an example, after receiving a report that "Transaction X failed to commit to node A," the management server can continuously attempt to send COMMIT commands to node A over the next 10 minutes (with a configurable timeout). If node A remains unreachable after 10 minutes, the management server will simultaneously send ROLLBACK commands to nodes A and B, requesting them to roll back transaction X.
[0071] In this embodiment, the management server provides a final, globally consistent fault-tolerance defense. It continues to complete commits within a configurable grace period to maximize transaction success; after the grace period expires, a unified rollback is enforced to completely resolve pending transactions and prevent long-term data inconsistency. This fallback strategy enhances the system's ability to handle persistent failures, ensures eventual consistency in extreme cases, and reduces the risk of permanent data anomalies caused by unresolved pending transactions.
[0072] In one exemplary embodiment, the management server queries a preset metadata database to obtain and coordinate the status of all resource managers.
[0073] Metadatabase refers to a dedicated database used by the management server to store and maintain metadata of the entire distributed database system (such as network addresses, status, and shards of all resource managers).
[0074] For example, a separately deployed management server maintains or accesses a metadata database that records the configuration and real-time status information of all resource managers in the system. When the management server needs to coordinate transactions, it can query this metadata database to determine which specific resource manager nodes to send instructions to.
[0075] As an example, when the management server decides to roll back transaction X, it needs to know that transaction X involves nodes A and B. It queries the metadata of transaction X recorded in the metadata database to accurately obtain the connection information of nodes A and B, and then sends instructions to them respectively.
[0076] In this embodiment, the metadata of the resource manager is centrally managed through a metadata database, enabling the management server to accurately and efficiently fulfill its role as the global coordinator. This avoids the fragmentation and hard-coding of coordination logic, improves the manageability and scalability of the system, and ensures that fault-tolerant operations can accurately target the target nodes in complex distributed environments. This improves the reliability of the fault-tolerant process itself and indirectly reduces the probability of new anomalies caused by coordination errors.
[0077] In an exemplary embodiment, sending operation failure information (if transaction termination operation fails) to the management server includes: The transaction manager creates a network connection with the management server and sends operation failure information through the network connection.
[0078] In this context, network connection refers to a communication link established between two software entities (here, the transaction manager and the management server within the application) for transmitting data, such as a TCP Socket connection.
[0079] For example, when the transaction manager in the application decides to report a fault, it can proactively initiate a network connection to the management server address and send a message containing failure details through this newly established connection channel.
[0080] As an example, after the transaction manager on the game server fails to commit node A after 5 minutes of asynchronous retries, it creates a socket connection to the management server's IP and port, and sends a formatted message through the connection to inform the transaction X of its details and the failure process.
[0081] In this embodiment, reporting is performed by explicitly creating network connections, making the communication process direct and controllable. This method is simpler and more reliable than relying on complex message middleware or shared storage, reducing the dependence on the fault reporting process itself and potential failure points. It ensures that critical fault information can be transmitted to the global coordinator in a timely and accurate manner, providing the necessary conditions for initiating the final fallback processing. This ensures the complete operation of the fault tolerance mechanism and helps reduce the transaction anomaly rate.
[0082] In one exemplary embodiment, the above method further includes: In response to a failure of the transaction manager, the management server performs recovery coordination operations based on the transaction log for distributed transactions that have a pre-committed state but no completed state.
[0083] In this context, a transaction manager failure could refer to an unexpected crash or restart of the terminal process running a specific business application. A completion status refers to the final state of a transaction recorded in the transaction log, such as committed or rolled back. Recovery coordination operations refer to the process by which the management server reads the transaction log, identifies incomplete transactions, and proactively intervenes to drive these transactions to their final state.
[0084] For example, if an application with an integrated transaction manager suddenly crashes and restarts, a standalone management server can proactively assume the responsibility of recovery. The management server can access the transaction log files previously persisted by the application, scan for transactions marked as pre-committed but without a final status record, and can continue to coordinate the completion of these transactions on behalf of the failed transaction manager.
[0085] As an example, the game server might crash and restart. After detecting this event, the management server can read the transaction log file left on the server's hard drive and find that transaction Y recorded "Prepare successfully on nodes M and N", but there is no subsequent "Done" record. The management server then takes over and completes transaction Y according to its own strategy (such as deciding to commit or rollback after querying the status of each node).
[0086] In this embodiment, by granting the management server the ability to analyze logs and coordinate recovery after a transaction manager failure, the fatal weakness of traditional 2PC—a single point of failure for the coordinator—is addressed. Even if the transaction manager itself fails, the information left by the custom logs can still be executed by the management server, ensuring that no transaction is permanently suspended due to the disappearance of the coordinator. This enhances the robustness of the system and the reliability of the data, fundamentally reducing the rate of distributed transaction commit anomalies caused by transaction manager crashes.
[0087] In some exemplary implementations, in order to simplify the recovery process and avoid conflicts, when the transaction manager restarts after a failure, it no longer processes the remaining unfinished transactions, but instead hands them over to the management server for unified processing.
[0088] In some exemplary embodiments, Figure 4 This is a schematic diagram of a distributed transaction fault handling process provided in an embodiment of this application. Figure 4 As shown, it includes: S401. In response to the successful execution of the distributed transaction during the preparation phase, the application begins to notify the internally integrated transaction manager to commit the distributed transaction.
[0089] S402: The transaction manager inside the terminal directly issues transaction termination operation commands to each resource manager involved.
[0090] S403. Determine whether all resource managers have successfully executed the transaction termination operation command.
[0091] S404. If a resource manager fails to execute, the transaction manager inside the terminal initiates an asynchronous retry process to retry the failed transaction termination operation command; the transaction log on which the retry is based is maintained by an independent log component.
[0092] S405. Determine whether all resource managers have successfully executed the transaction termination operation command after the retry process; if so, it means that the distributed transaction operation was successful.
[0093] S406. Determine whether the asynchronous retry process has enabled all resource managers to execute successfully within the preset time.
[0094] S407 If the asynchronous retry process times out, the handling of the distributed transaction will be transferred to the management server, which will then perform subsequent periodic retry processing and execute a rollback operation after the configurable final timeout period is reached.
[0095] The transaction termination command refers to the instruction used to complete the transaction during the commit phase; it can be a commit instruction or, in specific cases, a rollback instruction. The asynchronous retry process refers to a repeated attempt mechanism created by the transaction manager and running in the background independently of the main business thread. The independent logging component refers to a module responsible for maintaining a custom-formatted transaction log, whose logging and flushing mechanisms are independent of the underlying database's logging system. The transfer of control refers to transferring the responsibility and authority for coordinating and handling the distributed transaction failure from the application's internal transaction manager to the global management server. Periodic retry processing refers to the management server attempting to complete the transaction termination operation at fixed time intervals. The final timeout refers to the maximum total time allowed for retries by the management server; exceeding this time will trigger a forced rollback.
[0096] In this embodiment, by optimizing the overall architecture for the transaction process of the distributed database, the transaction failure rate caused by network anomalies is initially reduced. Furthermore, by introducing asynchronous retry, management server component takeover mechanism, and customized logging scheme, the success rate of transaction processing is further improved, thereby increasing the final successful commit rate of distributed transactions and reducing the commit anomaly rate caused by various faults.
[0097] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.
[0098] The distributed transaction processing apparatus provided in the embodiments of this application is described below. The distributed transaction processing apparatus has the same inventive concept as the distributed transaction processing method described above. The solution to the problem provided by the apparatus is similar to the solution described in the method described above. Therefore, the specific limitations of one or more distributed transaction processing apparatus embodiments provided below can be referred to the limitations of the distributed transaction processing method above. The distributed transaction processing apparatus described below and the distributed transaction processing method described above can be referred to each other, and will not be repeated here.
[0099] In one exemplary embodiment, Figure 5 This is a schematic diagram of the structure of a distributed transaction processing device provided in an embodiment of this application, as shown below. Figure 5 As shown, the application terminal of the distributed database system integrates a transaction manager. The distributed database also includes multiple resource managers and a management server. The distributed transaction processing device 50 includes: a transaction initiation module 510, a log recording module 520, and a fault handling module 530, wherein: The transaction initiation module 510 is used to initiate and coordinate distributed transactions to multiple resource managers. The logging module 520 is used to record the pre-commit status of the distributed transaction to a preset transaction log in response to the successful execution of the distributed transaction during the preparation phase. The fault handling module 530 is used to execute a multi-level fault handling process if a transaction termination operation for any resource manager fails during the commit phase. The multi-level fault handling process includes: the transaction manager performs a recovery and commit operation; if the recovery fails, the operation failure information of the failed transaction termination operation is sent to the management server, and the management server coordinates the resource manager to complete the transaction termination operation of the distributed transaction based on the transaction log and the operation failure information.
[0100] In an exemplary embodiment, the log recording module 520 is used to record resource manager information for successful preparation phases in a distributed transaction, and performs asynchronous disk flushing of the transaction log through a timed thread independent of the main business thread.
[0101] In an exemplary embodiment, the fault handling module 530 is used to perform local recovery retry on the failed transaction termination operation; if the local recovery retry fails, an asynchronous processing thread is created to continue executing the failed transaction termination operation and return the transaction processing result to the application.
[0102] In an exemplary embodiment, the duration of local recovery retry includes a preset first duration; if there are multiple local recovery retries, the interval between two adjacent local recovery retries is a preset second duration.
[0103] In an exemplary embodiment, the transaction termination operation includes a rollback operation; the fault handling module 530 is used by the management server to coordinate the resource managers to complete the transaction termination operation of the distributed transaction based on the transaction log and operation failure information, including: the management server continuously retrying the transaction termination operation within a configurable timeout period after receiving the operation failure information; if it still fails after the configurable timeout period, a rollback instruction is uniformly issued to all resource managers.
[0104] In one exemplary embodiment, the fault handling module 530 is used to query a preset metadata database by the management server to obtain and coordinate the status of all resource managers.
[0105] In one exemplary embodiment, the fault handling module 530 is used by the transaction manager to create a network connection with the management server and send operation failure information through the network connection.
[0106] In an exemplary embodiment, the fault handling module 530 is configured to, in response to a failure of the transaction manager, have the management server perform recovery coordination operations on distributed transactions that have a pre-committed state recorded but no completed state recorded, based on the transaction log.
[0107] The modules in the aforementioned distributed transaction processing device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can invoke and execute the operations corresponding to each module.
[0108] In one exemplary embodiment, this application also provides a computer-readable storage medium having a computer program stored thereon that, when executed by a processor, implements the steps of any of the distributed transaction processing methods described above.
[0109] In one exemplary embodiment, this application also provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of any of the distributed transaction processing methods described above.
[0110] In one exemplary embodiment, this application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of any of the distributed transaction processing methods described in the above embodiments.
[0111] Indicatively, such as Figure 6 As shown, Figure 6 This is a schematic diagram of the internal structure of a computer device 600 provided in an embodiment of this application. The computer device 600 can be provided as a server. (Refer to...) Figure 6 The computer device 600 includes a processor 602, which further includes one or more processors, and memory resources represented by memory 601 for storing instructions executable by the processor 602, such as a computer program. The computer program stored in memory 601 may include one or more modules, each corresponding to a set of instructions. Furthermore, the processor 602 is configured to execute instructions to perform the distributed transaction processing method of any of the above embodiments. The computer device 600 may operate on an operating system stored in memory 601, such as Windows Server™, Mac OS X™, Unix™, Linux™, Free BSD™, or similar.
[0112] The computer device 600 may also include a power supply component 603 configured to perform power management of the computer device 600, a wired or wireless network interface 604 configured to connect the computer device 600 to a network, and an input / output (I / O) interface 605. Wireless operation may be achieved through Wi-Fi, mobile cellular networks, Near Field Communication (NFC), or other technologies. When the computer program is executed by a processor, it implements a distributed transaction processing method. The display unit 607 of the computer device is used to form a visually visible image and may be a display screen, a projection device, or a virtual reality imaging device. The display screen may be an LCD screen or an e-ink display screen. The input device 606 of the computer device may be a touch layer covering the display screen, or buttons, a trackball, or a touchpad located on the computer device casing, or an external keyboard, touchpad, or mouse, etc.
[0113] Those skilled in the art will understand that Figure 6The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.
[0114] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data must comply with relevant regulations.
[0115] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile memory and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, artificial intelligence (AI) processors, etc., and are not limited to these.
[0116] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this application.
[0117] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.
Claims
1. A distributed transaction processing method, characterized in that, An application terminal for a distributed database system, wherein the application terminal integrates a transaction manager, and the distributed database also includes multiple resource managers and a management server; the method includes: Initiate and coordinate distributed transactions with the multiple resource managers; In response to the successful execution of the distributed transaction during the preparation phase, the pre-commit status of the distributed transaction is recorded in a preset transaction log; If a transaction termination operation for any resource manager fails during the commit phase, a multi-level fault handling process is executed. The multi-level fault handling process includes: the transaction manager performing a recovery and commit operation; if the recovery fails, the operation failure information of the failed transaction termination operation is sent to the management server, and the management server coordinates the resource manager to complete the transaction termination operation of the distributed transaction based on the transaction log and the operation failure information.
2. The method according to claim 1, characterized in that, The step of recording the pre-commit status of the distributed transaction to a preset transaction log includes: The resource manager information for successful preparation phases in the distributed transaction is recorded, and the transaction log is asynchronously flushed to disk via a timed thread independent of the main business thread.
3. The method according to claim 1, characterized in that, The recovery and commit operation performed by the transaction manager includes: Retry local recovery for failed transaction termination operations; If local recovery retry fails, an asynchronous processing thread is created to continue executing the failed transaction termination operation and return the transaction processing result to the application.
4. The method according to claim 3, characterized in that, The duration of the local recovery retry includes a preset first duration; If there are multiple local recovery retries, the interval between two adjacent local recovery retries is a preset second duration.
5. The method according to claim 1, characterized in that, The transaction termination operation includes a rollback operation; The step of the management server coordinating the resource manager to complete the transaction termination operation of the distributed transaction based on the transaction log and the operation failure information includes: The management server shall continuously retry the transaction termination operation within a configurable timeout period after receiving the operation failure information; If the configurable timeout period is not exceeded and the task is still unsuccessful, a rollback command will be sent to all resource managers.
6. The method according to claim 1 or 5, characterized in that, The management server queries a preset metadata database to obtain and coordinate the status of all resource managers.
7. The method according to claim 1, characterized in that, Sending the operation failure information of the failed transaction termination operation to the management server includes: The transaction manager establishes a network connection with the management server and sends the operation failure information through the network connection.
8. The method according to claim 1, characterized in that, The method further includes: In response to a failure of the transaction manager, the management server performs a recovery coordination operation on the distributed transactions that have a pre-commit status recorded but no completion status recorded, based on the transaction log.
9. A distributed transaction processing device, characterized in that, An application terminal for a distributed database system, wherein the application terminal integrates a transaction manager, and the distributed database further includes multiple resource managers and a management server; the device includes: The transaction initiation module is used to initiate and coordinate distributed transactions to the multiple resource managers. The logging module is used to record the pre-commit status of the distributed transaction to a preset transaction log in response to the successful execution of the distributed transaction during the preparation phase. The fault handling module is used to execute a multi-level fault handling process if a transaction termination operation for any resource manager fails during the commit phase. The multi-level fault handling process includes: the transaction manager performing a recovery and commit operation; if the recovery fails, the operation failure information of the failed transaction termination operation is sent to the management server, and the management server coordinates the resource manager to complete the transaction termination operation of the distributed transaction based on the transaction log and the operation failure information.
10. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 8.