Data processing method and apparatus, terminal device, and storage medium

By building a unified framework of high-availability management module and NPU Remoting communication layer, high availability of large AI models and stability of training system are achieved, solving the problems of resource waste and low efficiency, and realizing fast fault recovery and low-cost redundant architecture.

CN121303365BActive Publication Date: 2026-06-12GUANGDONG LAB OF ARTIFICIAL INTELLIGENCE & DIGITAL ECONOMY (SZ)

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GUANGDONG LAB OF ARTIFICIAL INTELLIGENCE & DIGITAL ECONOMY (SZ)
Filing Date
2025-12-10
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In existing technologies, the training and inference processes of large AI models suffer from resource waste, low efficiency, high cost, and difficulty in supporting flexible architectures through underlying communication, thus failing to meet high availability requirements.

Method used

A unified framework is adopted, consisting of a high-availability management module, an inference session-level JIT Checkpoint component, a training iteration-level RedoLog component, and an NPU Remoting communication layer. This framework enables real-time backup and millisecond-level recovery of critical session states in inference scenarios, as well as fine-grained fault recovery within a single iteration in training scenarios, thus constructing an N+1 low-cost redundancy architecture.

🎯Benefits of technology

This reduces fault recovery time from hours to minutes or even seconds, overcoming the shortcomings of traditional periodic checkpoint resource waste and high-cost redundancy, and improving the high availability of AI large models and the stability of training systems.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121303365B_ABST
    Figure CN121303365B_ABST
Patent Text Reader

Abstract

The application is suitable for the field of artificial intelligence large model training and reasoning technology, and provides a data processing method and device, terminal equipment and storage medium. When a work unit receives a reasoning question input by a user, the work unit performs reasoning calculation to output a result, and sends first key information generated in the logical reasoning and calculation process of the reasoning question to a backup unit for backup. If the control unit detects a failure of the work unit, the control unit sends a first control instruction to the backup unit. After receiving the first control instruction, the backup unit performs logical reasoning and calculation on the current reasoning question or the next reasoning question input by the user according to all the backup first key information, and outputs a result. The above method can guarantee the continuity of the reasoning process, and effectively meet the core demand of high reliability and low interruption of large models in actual application.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application belongs to the field of artificial intelligence large model training and inference technology, and in particular relates to a data processing method, device, terminal equipment and storage medium. Background Technology

[0002] Currently, artificial intelligence (AI) technology is developing rapidly. Large language models and complex machine learning models such as ChatGPT, DeepSeek, and Qwen are widely used in language processing, autonomous driving, smart healthcare, and industrial quality inspection. Cloud service providers are also using large models to optimize the performance of server virtualization instances. However, the complexity of large-scale distributed training of large models over a long period of time and the high requirements for continuity of online inference services pose severe challenges to the reliability of the underlying infrastructure.

[0003] In related technologies, training relies on periodic CKPT mechanisms, inference uses traditional HA or full redundancy modes, and the underlying layer depends on intra-node IPC. However, these technologies suffer from drawbacks such as wasted training resources and low efficiency, poor inference experience or high cost, and difficulty in supporting flexible architectures through underlying communication, which cannot meet the high availability requirements of large AI models. Summary of the Invention

[0004] This application provides a data processing method, apparatus, terminal device, and storage medium that can ensure the continuity of the inference process and effectively meet the core requirements of high reliability and low interruption for large models in practical applications.

[0005] In a first aspect, embodiments of this application provide a data processing method, including:

[0006] When the working unit receives the reasoning question input by the user, the working unit performs reasoning calculations and outputs the results, and sends the first key information generated in the process of logical reasoning and calculation of the reasoning question to the backup unit for backup.

[0007] If the control unit detects a fault in the working unit, the control unit sends a first control command to the backup unit;

[0008] After receiving the first control command, the backup unit performs logical reasoning and calculation on the current reasoning problem or the next reasoning problem input by the user based on all the backed-up first key information, and outputs the result.

[0009] In this embodiment, after receiving the inference question input by the user, the working unit executes the complete logical inference and calculation process, generates the final result, and outputs it to the user. The first key information generated during the inference calculation process is also synchronized to the backup unit for storage. The control unit monitors the working status of the working unit in real time, and when a fault is detected, it sends an instruction to the backup unit to take over the inference task of the faulty working unit. Since the backup unit backs up the key data of the working unit during the logical inference process in real time, it can seamlessly take over the inference tasks that were not completed when the current working unit failed, as well as subsequent user-inputted inference tasks, based on the complete backup information (i.e., the first key data). Users do not need to re-enter the data, ensuring the continuity of the inference process and avoiding service interruptions due to single-point failures. This effectively meets the core requirements of high reliability and low interruption for large models in practical applications.

[0010] In one possible implementation of the first aspect, the working unit sends the first key information generated during the logical reasoning and computation process of the reasoning problem to the backup unit for backup, including:

[0011] After performing logical reasoning and calculation on the reasoning problem, the working unit stores the first key information generated in the logical reasoning and calculation into the first storage area corresponding to the working unit.

[0012] After detecting a data update in the first storage area, the working unit writes the first key information into the second storage area corresponding to the backup unit through a preset communication method.

[0013] In this embodiment, the working unit synchronizes the generated first key information to the backup unit storage area in real time through a preset communication method. This ensures that the key information is backed up without going through other devices, guaranteeing the timeliness and integrity of the backup information. It also improves the efficiency of the backup unit taking over the task during fault switching by reducing communication latency, thereby enhancing the high availability of the system.

[0014] In one possible implementation of the first aspect, the method further includes:

[0015] The control unit sends multiple heartbeat requests to the working unit;

[0016] If the control unit does not receive a response from the working unit within a preset time after sending a preset number of heartbeat requests, it is determined that the working unit has malfunctioned.

[0017] In this embodiment, the control unit sends a heartbeat request to the working unit and uses "continuous non-response" as the fault judgment criterion, which realizes rapid and accurate detection of working unit faults, avoids misjudgment, and buys time for the subsequent backup unit to take over seamlessly, thus ensuring the high availability of AI large model inference services.

[0018] In one possible implementation of the first aspect, after the control unit sends the first control command to the backup unit, the method further includes:

[0019] The control unit unbinds the access address bound to the working unit and binds the access address to the standby unit.

[0020] In this embodiment, the control unit allows users to always initiate inference requests through a fixed access address by unbinding and rebinding the virtual address, without needing to be aware of the switching between the backend working unit and the backup unit. This achieves seamless continuity of service access after a failure and further enhances the high availability of the AI ​​large model inference service.

[0021] In one possible implementation of the first aspect, the method further includes:

[0022] If the control unit detects a fault in the working unit during forward or backward propagation when the working unit is performing iterative training of the model parameters, it sends a second control command to the working unit.

[0023] After receiving the second control command, the working unit terminates the current iteration training process and restores to the initial state before the current iteration.

[0024] In this embodiment, the working unit terminates the current iteration and restores to the initial state of the current iteration when a fault occurs, quickly avoiding parameter confusion and state inconsistency caused by forward / backward propagation faults. This provides support for restarting subsequent iterations, ensuring the accuracy and continuity of the training process, and improving the reliability of the model training system.

[0025] In one possible implementation of the first aspect, the method further includes:

[0026] The control unit acquires the first log information; wherein, the first log information is the training log generated by the working unit during the iterative training process;

[0027] The control unit filters out the log information corresponding to the time when the working unit malfunctions from the first log information to obtain the second log information;

[0028] The control unit sends the second log information to the working unit so that the working unit can re-execute according to the second log information, so that the working unit can be restored to the second state corresponding to the moment before the fault occurred.

[0029] The working unit continues iterative training of parameters in the second state.

[0030] In this embodiment, the control unit obtains the real-time training logs of the working unit, filters out invalid logs at the time of the fault, and feeds them back to the working unit, enabling it to restore to the state before the fault based on the valid logs and continue iterative optimization. This not only ensures the continuity of model training and the consistency of parameters, but also avoids the interference of fault logs on the training process, thereby improving the stability and fault tolerance of the model training system.

[0031] In one possible implementation of the first aspect, the method further includes:

[0032] If the control unit detects a fault in the working unit during parameter update while the working unit is performing iterative training of the model parameters, it sends a third control command to the working unit.

[0033] After receiving the third control command, the working unit updates the parameters corresponding to the current moment to the theoretical parameters calculated in this iteration, and updates the current state to the third state; where the third state is the state at the start of the next iteration corresponding to this iteration.

[0034] The working unit continues iterative training of the model parameters in the third state.

[0035] In the embodiments of this application, when a working unit fails, it directly updates the parameters to the theoretical values ​​of the current iteration and switches to the starting state of the next iteration. This quickly avoids state chaos during the parameter update phase, allowing for seamless continuation of training without complex rollbacks. This ensures the continuity of model training and parameter consistency, and improves the fault tolerance efficiency and reliability of the training system.

[0036] Secondly, embodiments of this application provide a data processing apparatus, including:

[0037] The reasoning information backup module is used to back up the first key information generated during the logical reasoning and calculation process of the reasoning problem when the working unit receives the reasoning problem input by the user.

[0038] The backup unit switching module is used to send a first control command to the backup unit if the control unit detects a fault in the working unit.

[0039] The backup unit reasoning module is used by the backup unit to perform logical reasoning and calculation on the current reasoning problem or the next reasoning problem input by the user based on all the backed-up first key information after receiving the first control command, and output the results.

[0040] Thirdly, embodiments of this application provide a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the data processing method as described in any of the first aspects above.

[0041] Fourthly, embodiments of this application provide a computer-readable storage medium storing a computer program that, when executed by a processor, implements the data and processing methods described in any of the first aspects above.

[0042] Fifthly, embodiments of this application provide a computer program product that, when run on a terminal device, causes the terminal device to execute any of the data processing methods described in the first aspect above.

[0043] It is understood that the beneficial effects of the second to fifth aspects mentioned above can be found in the relevant descriptions in the first aspect mentioned above, and will not be repeated here. Attached Figure Description

[0044] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0045] Figure 1 This is a flowchart illustrating the data processing method provided in an embodiment of this application;

[0046] Figure 2 This is a schematic diagram of the inference information backup process provided in the embodiments of this application;

[0047] Figure 3 This is a schematic diagram of the process for determining a working unit fault provided in an embodiment of this application;

[0048] Figure 4 This is a schematic diagram of the training working unit provided in the embodiments of this application. Figure 1 ;

[0049] Figure 5 This is a flowchart illustrating the training process provided in the embodiments of this application. Figure 2 ;

[0050] Figure 6 This is a flowchart illustrating the training process provided in the embodiments of this application. Figure 3 ;

[0051] Figure 7 This is a schematic diagram of the system architecture of the data processing method provided in the embodiments of this application;

[0052] Figure 8 This is a schematic diagram of the inference session recovery process provided in an embodiment of this application;

[0053] Figure 9 This is a schematic diagram of the training iteration recovery process provided in the embodiments of this application;

[0054] Figure 10 This is a structural block diagram of the data and processing apparatus provided in the embodiments of this application;

[0055] Figure 11 This is a schematic diagram of the structure of the terminal device provided in the embodiments of this application. Detailed Implementation

[0056] In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not for limitation, in order to provide a thorough understanding of the embodiments of this application. However, those skilled in the art will understand that this application may also be implemented in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods have been omitted so as not to obscure the description of this application with unnecessary detail.

[0057] It should be understood that, when used in this application specification and the appended claims, the term "comprising" indicates the presence of the described features, integrals, steps, operations, elements and / or components, but does not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or a collection thereof.

[0058] It should also be understood that the term “and / or” as used in this application specification and the appended claims means any combination of one or more of the associated listed items and all possible combinations, and includes such combinations.

[0059] As used in this application specification and the appended claims, the term "if" may be interpreted, depending on the context, as "when," "once," "in response to determination," or "in response to detection." Similarly, the phrase "if determined" or "if detected [the described condition or event]" may be interpreted, depending on the context, as meaning "once determined," "in response to determination," "once detected [the described condition or event]," or "in response to detection [the described condition or event]."

[0060] Furthermore, in the description of this application and the appended claims, the terms "first," "second," "third," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.

[0061] References to "one embodiment" or "some embodiments" in this specification mean that one or more embodiments of this application include a specific feature, structure, or characteristic described in connection with that embodiment. Therefore, the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in still other embodiments," etc., appearing in different parts of this specification do not necessarily refer to the same embodiment, but rather mean "one or more, but not all, embodiments," unless otherwise specifically emphasized.

[0062] Currently, artificial intelligence (AI) technology is developing rapidly. Large language models and complex machine learning models such as ChatGPT, DeepSeek, and Qwen are widely used in language processing, autonomous driving, smart healthcare, and industrial quality inspection. Cloud service providers are also using large models to optimize the performance of server virtualization instances. However, the complexity of large-scale distributed training of large models over a long period of time and the high requirements for continuity of online inference services pose severe challenges to the reliability of the underlying infrastructure.

[0063] In related technologies, training relies on periodic CKPT mechanisms, inference uses traditional HA or full redundancy modes, and the underlying layer depends on intra-node IPC. However, these technologies suffer from drawbacks such as wasted training resources and low efficiency, poor inference experience or high cost, and difficulty in supporting flexible architectures through underlying communication, which cannot meet the high availability requirements of large AI models.

[0064] To address the aforementioned technical issues, this application provides a data processing method. By constructing a unified framework comprising a high-availability management module, an inference session-level JIT Checkpoint component, a training iteration-level RedoLog component, and an NPU Remoting communication layer, this method achieves real-time backup and millisecond-level recovery of session critical states (KV Cache) in inference scenarios and fine-grained fault recovery within a single iteration in training scenarios under an N+1 low-cost redundancy architecture. This reduces the fault recovery time (RTO) from several hours to minutes or even seconds, overcoming the shortcomings of traditional periodic checkpoint resource waste and high 1+1 primary / backup costs. It can be widely applied in cloud AI platforms, large-model inference services, large-scale model training, and edge computing.

[0065] See Figure 1 This is a schematic flowchart of a data processing method provided in an embodiment of this application. It is intended as an example and not a limitation. The method may include the following steps:

[0066] S101, when the working unit receives the reasoning question input by the user, the working unit performs reasoning calculation and outputs the result, and sends the first key information generated in the logical reasoning and calculation process of the reasoning question to the backup unit for backup.

[0067] In this embodiment of the application, when a user initiates a reasoning request (such as asking a question to an AI chatbot or generating code using AI tools), the "working unit" (a computing node equipped with a large model, including acceleration hardware such as NPU / GPU) responsible for processing the request will first perform reasoning calculations: based on the user's input, it calls the algorithm logic of the large model (such as attention mechanism and decoding process), generates the final response result, and returns it to the user.

[0068] Meanwhile, during inference computation, the working unit generates core data that maintains "conversation continuity," namely "first key information" (specifically, a KV cache, or attention key-value pair cache). This information records the contextual logic of the current dialogue (such as semantic association data of the user's previous questions and the AI's historical responses), which is crucial for the AI ​​to "remember" the context and generate coherent responses when the user continues to ask questions. The working unit sends this key information to the "backup unit" (a computing node in hot standby mode) in real time to ensure that the key information stored in the backup unit is completely consistent with that of the working unit, avoiding the loss of session state due to working unit failure.

[0069] In one embodiment, see Figure 2 This is a schematic diagram of the inference information backup process provided in the embodiments of this application, such as... Figure 2 As shown, step S101 includes:

[0070] S201, after the working unit performs logical reasoning and calculation on the reasoning problem, it stores the first key information generated in the logical reasoning and calculation into the first storage area corresponding to the working unit.

[0071] In this embodiment, after the working unit receives the inference question and confirms the establishment of a new session, it first completes the creation of the "first storage area" to ensure that key information has a dedicated, high-speed storage medium. The high-bandwidth memory (HBM) of the NPU in the working unit is preferentially used as the "first storage area." This is because HBM is a dedicated high-speed memory for the NPU, with read / write speeds reaching hundreds of GB / s (far exceeding traditional DDR memory or disks), which can match the "millisecond-level access" requirements for context data during large model inference—using low-speed storage would lead to increased latency in reading the KV cache during subsequent rounds of inference, directly affecting the user experience (e.g., slower AI responses).

[0072] The core of large-scale model inference is "attention computation." The model first "encodes" the input text (converting characters into vectors), and then calculates the semantic relationships between each word and other words through an attention layer. During this process, "attention keys" and "attention values" are generated. The "JIT Checkpoint component" deployed in the working unit monitors changes in memory data during NPU inference in real time. When it detects the output of a Key and Value from the attention layer, it immediately marks this data as "first key information" and records its data dimensions (e.g., "BatchSize=1, Head=12, Length=50") and data type (e.g., FP16 half-precision floating-point). This ensures accurate reconstruction of the context logic during subsequent storage and retrieval, avoiding inference anomalies caused by incorrect data format. After capturing the "first key information," the working unit writes it to the "first storage area" according to preset rules, ensuring a clear data structure and facilitating fast subsequent retrieval.

[0073] S202, after the working unit detects that the data in the first storage area has been updated, it writes the first key information into the second storage area corresponding to the backup unit through a preset communication method.

[0074] In this embodiment, the "JIT Checkpoint component" deployed in the working unit is responsible for detection. This component monitors changes in memory address data in the "first storage area" in real time (such as writing new KV Cache data or modifying existing data) through the "memory change monitoring interface" provided by the NPU hardware, without needing to poll the CPU (to avoid consuming computing resources). After detecting an update, the component quickly records the "update type" (full update / incremental update), the "update data range" (such as the memory address range and data length of the newly added data), and the "corresponding session ID," providing accurate data source information for subsequent synchronization.

[0075] Specifically, the communication method between the working unit and the backup unit is pre-configured as the "NPU Remoting Communication Layer," which is based on the RDMA (Remote Direct Memory Access) / RoCEv2 protocol to achieve direct data transfer across node NPU memory, avoiding the performance bottleneck of traditional communication. The working unit's "synchronization module" sends a "data synchronization request" to the backup unit according to preset communication parameters, including metadata such as "session ID," "transfer data type (incremental / full)," and "data length." After receiving the request, the backup unit responds with "ready" through the response interface of the "NPU Remoting Communication Layer," confirming that its "secondary storage area" is ready to receive data.

[0076] The working unit's NPU directly writes synchronization data to the memory address corresponding to the "second storage area" of the backup unit's NPU via the RDMA protocol. The data does not need to pass through the CPUs and operating systems of both sides. The entire process is completed collaboratively by the NPU and the high-speed network card (supporting the RoCEv2 protocol). The transmission bandwidth can match the NPU's memory bandwidth (such as hundreds of GB / s), ensuring that even full synchronization (such as tens of MB of data) can be completed within milliseconds.

[0077] In the above method, the working unit synchronizes the generated first key information to the backup unit storage area in real time through a preset communication method. This ensures that the key information backup does not need to go through other devices, thus guaranteeing the timeliness and integrity of the backup information. It also improves the efficiency of the backup unit taking over the task during fault switching by reducing communication latency, thereby enhancing the high availability of the system.

[0078] S102, if the control unit detects a fault in the working unit, the control unit sends a first control command to the backup unit.

[0079] In this embodiment, the control unit is a high-availability management module provided in this application. This module continuously monitors the running status of all "work units" (computing nodes that execute inference / training tasks). Once it confirms that a work unit has failed (such as hardware damage or software crash, making it unable to continue processing tasks) through heartbeat detection, health checks, or other methods, it immediately sends a "first control command" to the "standby unit" in hot standby mode. This command clearly informs the standby unit which work unit has failed, which inference sessions need to be taken over, and the storage location of the corresponding key data (such as KVCache) in the standby unit, allowing the standby unit to quickly prepare for seamless task takeover and avoid service interruption.

[0080] In one embodiment, see Figure 3 This is a schematic diagram of the process for determining a working unit fault provided in an embodiment of this application, such as... Figure 3 As shown, it includes:

[0081] S301, the control unit sends multiple heartbeat requests to the working unit.

[0082] In this embodiment, the control unit (i.e., the high availability management module) needs to continuously confirm whether the "working units" (computing nodes that execute inference / training tasks) are operating normally. The core method is to send heartbeat requests multiple times. For example, the control unit sends lightweight "heartbeat request" messages to each working unit at preset time intervals. After receiving the request, the working unit needs to promptly return a "heartbeat response" containing its own health status (such as NPU running status, task progress, and critical data integrity). Through the "multiple send-receive response" loop mechanism, the control unit can track the status of the working units in real time, avoiding misjudgments caused by lost single requests or network fluctuations, and ensuring that faults are accurately and promptly detected.

[0083] For example, the interval is set according to the business's requirements for fault detection sensitivity. Inference scenarios (requiring rapid response) are typically set at 100-500 milliseconds per attempt, while training scenarios (with long task cycles) can be set at 1-5 seconds per attempt. Too short an interval will increase network and computing resource consumption, while too long an interval may delay fault detection. A lightweight format is adopted, containing only three types of core information: "Request ID" (uniquely identifies each heartbeat to avoid duplicate processing), "Target Work Unit ID" (clearly identifies the recipient, such as "WorkUnit_08"), and "Request Timestamp" (facilitates synchronization between the work unit and control unit to determine if the request has timed out). The overall data volume is controlled within tens of bytes to avoid consuming network bandwidth.

[0084] If no response is received for a single request, a preset number of retries (usually 3 times) and retry interval (e.g., 50 milliseconds for the first retry, doubling for each subsequent retry) are set to prevent misjudgment of faults due to temporary network jitter.

[0085] S302, if the control unit does not receive a response from the working unit within a preset time after sending a preset number of heartbeat requests, it is determined that the working unit has malfunctioned.

[0086] In this embodiment, the control unit (high availability management module) sends heartbeat requests to the working unit (the computing node that performs inference / training tasks) at fixed intervals to confirm whether the working unit is operating normally. The system presets two key conditions: one is a "preset number of times" (e.g., sending 3 times consecutively), and the other is a "preset time" (e.g., waiting 100 milliseconds after each request).

[0087] After the control unit sends the preset number of heartbeat requests, if it does not receive a response from the working unit within the corresponding preset time for each request (indicating that the working unit may be unable to communicate or operate normally due to hardware damage, software crash, etc.), the control unit will then formally determine that the working unit has failed and initiate subsequent fault handling procedures (such as instructing a backup unit to take over the task). This "continuous multiple confirmations" logic avoids misjudgments caused by accidental factors such as single network fluctuations or temporary lag, ensuring the accuracy of fault diagnosis.

[0088] In the above method, the control unit sends a heartbeat request to the working unit and uses "continuous no response" as the fault judgment criterion, which realizes rapid and accurate detection of working unit faults, avoids misjudgment, and buys time for the subsequent backup unit to take over seamlessly, thus ensuring the high availability of AI large model inference services.

[0089] In one embodiment, after the control unit sends a first control command to the backup unit, the method further includes:

[0090] The control unit unbinds the access address bound to the working unit and binds the access address to the standby unit.

[0091] In this embodiment of the application, in the AI ​​large model N+1 high availability architecture, each working unit is bound to a unique "virtual IP" (the unique network address for users or clients to access the working unit, such as 192.168.1.10), and the client sends inference requests to the working unit through this virtual IP.

[0092] When the control unit (high availability management module) confirms a failure in a working unit, it first performs an "unbinding operation," removing the virtual IP originally bound to the failed working unit from that unit, preventing it from receiving client requests. Then, it performs a "binding operation," reassigning the unbound virtual IP to the standby unit. The core of this process is to ensure that the client is unaware of the address change, continuing to send requests through the original virtual IP. Actual requests are automatically routed to the standby unit, achieving a seamless failover and ensuring uninterrupted inference sessions.

[0093] In the above method, the control unit allows users to always initiate inference requests through a fixed access address by unbinding and rebinding the virtual address, without needing to be aware of the switching between the backend working unit and the backup unit. This achieves seamless continuity of service access after a failure and further enhances the high availability of the AI ​​large model inference service.

[0094] S103 After receiving the first control command, the backup unit performs logical reasoning and calculation on the current reasoning problem or the next reasoning problem input by the user based on all the backed-up first key information, and outputs the result.

[0095] In this embodiment of the application, when the standby unit receives the "first control command" (i.e., the faulty working unit takeover command) sent by the control unit, it will immediately switch from the "standby state" to the "business processing state".

[0096] Because the backup unit's second storage area has already backed up the "first key information" (i.e., KV Cache, recording attention key-value pairs of the session context) of all inference sessions of the failed working unit in real time, there is no need to reload the data. Based on this backed-up key information, the system can directly perform the same logical reasoning and calculation process as the original working unit on the user's current inference question (such as a request that was not completed before the failure) or the next inference question initiated later (such as a follow-up question from the user), ultimately generating a coherent response result and outputting it to the user. The entire process is completely imperceptible to the user, as if the service has never been interrupted.

[0097] In the above method, after receiving the inference problem input by the user, the working unit executes the complete logical inference and calculation process, generates the final result, and outputs it to the user. The first key information generated during the inference calculation process is also synchronized to the backup unit for storage. The control unit monitors the working status of the working unit in real time, and when a failure is detected, it sends an instruction to the backup unit to take over the inference task of the failed working unit. Since the backup unit backs up the key data of the working unit during the logical inference process in real time, it can seamlessly take over the inference tasks that were not completed when the current working unit failed, as well as subsequent user-inputted inference tasks, based on the complete backup information (i.e., the first key data). Users do not need to re-enter the information, ensuring the continuity of the inference process and avoiding service interruptions due to single-point failures. This effectively meets the core requirements of high reliability and low interruption for large models in practical applications.

[0098] In one embodiment, see Figure 4 This is a schematic diagram of the training working unit flow provided in the embodiments of this application. Figure 1 ,like Figure 4 As shown, it includes:

[0099] S401, when multiple working units are performing iterative training of model parameters, if the control unit detects that any working unit has a fault during forward or backward propagation, it sends a second control command to all the working units.

[0100] In this embodiment of the application, when the working unit is performing "parameter iteration training" of the AI ​​large model (i.e., calculating the prediction results through multiple rounds of forward propagation, updating the model weights through backward propagation, and gradually optimizing the model performance), the control unit (high availability management module) will continuously monitor its training status.

[0101] If a fault is detected in a working unit during critical "forward propagation" (calculating model output from input training data) or "backward propagation" (adjusting model parameters based on errors) stages (such as NPU computational interruption, gradient calculation anomalies, memory overflow, etc.), preventing the training process from continuing, the control unit will immediately send a "second control command" to the faulty working unit. The core purpose is to stop the currently invalid training operation, protect the generated intermediate training data (such as the gradient of the current iteration and model weight snapshots), avoid data corruption or resource waste, and prepare for subsequent fault recovery (such as switching to a backup unit to continue training).

[0102] S402, after receiving the second control command, the plurality of working units terminate the current iteration training process and restore to the initial state before the current iteration.

[0103] In this embodiment, when the working unit (the computing node that performs iterative training of a large AI model) receives the "second control command" (an emergency control command for forward / backward propagation faults) sent by the control unit, it will immediately execute two core operations:

[0104] First, "Terminate the current iteration training process," stopping the ongoing training of this batch (e.g., interrupting unfinished backpropagation gradient updates, stopping NPU computing power scheduling) to avoid the escalation of the fault or the waste of resources due to ineffective computation. Second, "Restore to the initial state before this iteration," rolling back the model parameters, training environment, data state, etc., to the baseline state before the start of this iteration (e.g., the 1000th batch of training) (e.g., loading the model weight snapshot saved before this iteration, resetting the training data pointer, clearing the temporary gradient data generated in the current iteration), providing a clean and consistent initial condition for subsequent fault repair (e.g., restarting training, switching to a backup unit to continue training), ensuring that the training process can be seamlessly connected.

[0105] In the above method, the working unit terminates the current iteration and restores to the initial state of the current iteration when a fault occurs, which quickly avoids parameter confusion and state inconsistency caused by forward / backward propagation faults. This provides support for restarting subsequent iterations, ensuring the accuracy and continuity of the training process, and improving the reliability of the model training system.

[0106] In one embodiment, see Figure 5 This is a flowchart illustrating the training process provided in the embodiments of this application. Figure 2 ,like Figure 5 As shown, after the working unit is restored to its initial state before the current iteration, the method also includes:

[0107] S501, the control unit acquires the first log information; wherein, the first log information is the training log generated by the multiple working units during the iterative training process.

[0108] In this embodiment of the application, in a distributed training scenario, the present invention achieves rapid fault recovery within a single iteration by recording and replaying lightweight Application Programming Interface (API) call logs, i.e., the first log information, avoiding the huge overhead of rolling back the entire checkpoint. During each iteration of the training task, the RedoLog component deployed on the Host records all kernel function API calls sent from the Host to the Device (NPU) side, such as Kernellaunch and memory copying. These records only contain the API name and parameters, with a very small data volume, forming a lightweight "redo log".

[0109] During the iterative training of parameters for a large AI model, multiple working units (computation nodes executing training tasks) generate "RedoLogs" (i.e., "first log information") in real time. These logs record key details of the entire training process—such as the training progress of each batch of data, the computation results of forward / backward propagation (loss value, gradient norm), model parameter updates, hardware resource usage (NPU computing power, memory usage), and even abnormal prompts during training (such as data format errors, computing power fluctuations). When the control unit detects a failure in a working unit during propagation, the control unit retrieves the log information corresponding to that working unit.

[0110] S502, the control unit filters out the log information corresponding to the time when the working unit fails from the first log information to obtain the second log information.

[0111] In this embodiment, the control unit has already acquired the first log information (including the progress of the entire training process, calculation results, resource usage, and exception prompts) generated during the iterative training of multiple working units. Then, it first identifies the specific time when the working unit malfunctioned (e.g., "15:03:22 during Batch_2001 training"). Next, it filters and removes logs directly related to the time of the malfunction from the complete first log information (e.g., error logs at the time of the malfunction, interrupted calculation progress logs, and hardware status logs at the moment of the malfunction). The remaining normal training logs, unaffected by the malfunction (e.g., normal training records of each batch before the malfunction, parameter update data, etc.), constitute the second log information. Its core purpose is to remove invalid or interfering data related to the malfunction and retain the valid logs of the training process, facilitating subsequent recovery of training based on normal data or analysis of training effects.

[0112] S503, the control unit sends the second log information to the working unit so that the working unit can re-execute according to the second log information, so that the working unit can be restored to the second state corresponding to the moment before the fault occurred.

[0113] In this embodiment, the control unit is responsible for coordinating the state consistency of all NPU computing cards (i.e., multiple working segments). The second log information is specifically for recording the API calls (such as forward propagation calculation, parameter update, data reading, etc. in model training) that have been successfully completed on all computing cards before the fault occurs, and the recording strictly follows the order of operation execution.

[0114] When recovery is needed, the management module reads this second log information and re-executes (i.e., "replays") these successfully completed API calls one by one on all compute cards in the order recorded in the log (essentially sending the second log information to each work unit, allowing each work unit to re-execute the training task before the failure). After all replays are completed, the state of all compute cards (including model parameters, computation progress, data cache, etc.) will be completely restored to the consistent state at the moment the failure occurred (i.e., the second state). At this time, the training task can seamlessly continue from the node where the failure was interrupted, without any issues of inconsistent state or lost progress.

[0115] S504, the working unit continues iterative training of parameters in the second state.

[0116] In this embodiment, the working unit has completed fault recovery through the second log information, accurately restoring to the second state at the moment before the fault occurred (this state includes core data such as model weights, training batch progress, hyperparameter configuration and training environment parameters of the last effective iteration before the fault).

[0117] Based on this, the working unit does not need to re-execute the effective training steps before the failure. Instead, it directly starts from the second state and continues to advance the iterative training process of the model parameters. It executes the forward propagation calculation, backward propagation gradient solution, and parameter update operation based on the optimizer for subsequent batches of data in sequence according to the preset training plan, ensuring the continuity of the training process and the coherent optimization of model performance.

[0118] In the above method, the control unit obtains the real-time training logs of the working unit, filters out invalid logs at the time of the fault, and feeds them back to the working unit, so that it can restore the state before the fault based on the valid logs and continue iterative optimization. This not only ensures the continuity of model training and the consistency of parameters, but also avoids the interference of fault logs on the training process, thereby improving the stability and fault tolerance of the model training system.

[0119] In one embodiment, see Figure 6 This is a flowchart illustrating the training process provided in the embodiments of this application. Figure 3 ,like Figure 6 As shown, it includes:

[0120] S601: When multiple working units are performing iterative training of model parameters, if the control unit detects a fault in any working unit during parameter update, it sends a third control command to all working units.

[0121] In this embodiment of the application, when the working unit is in the critical stage of parameter update during the iterative training of AI large model parameters (i.e., the core step of adjusting model weights through the optimizer based on the gradient data calculated by backpropagation to achieve model performance optimization), the control unit (high availability management module) will continuously monitor the execution status of this stage.

[0122] If a fault is detected in a working unit during parameter update (such as optimizer calculation error, weight writing failure, or NPU computational power interruption causing update interruption), preventing the completion of the parameter update for the current round, the control unit will immediately send a "third control command" to the faulty working unit. Its core purpose is to allow the working unit to urgently terminate the abnormal parameter update operation, protecting the valid model parameters and gradient data from before the fault, preventing invalid updates from contaminating model parameters, and laying the foundation for subsequent training recovery based on complete data.

[0123] S602, after receiving the third control command, the plurality of working units update the parameters corresponding to the current time to the theoretical parameters calculated in this iteration, and update the current state to the third state; wherein, the third state is the state at the start of the next iteration corresponding to this iteration.

[0124] In this embodiment, when multiple parallel working units performing iterative training of model parameters (such as multiple computing nodes in distributed training) all receive the "third control instruction" (an emergency control instruction for faults in the parameter update phase) sent by the control unit, they will simultaneously execute two key operations:

[0125] First, "parameter calibration" – updating the currently incomplete / abnormal model parameters of each working unit to "theoretical parameters calculated in this iteration" (i.e., the standard model parameters that should have been completed, calculated by the optimizer based on the effective gradient data of this iteration, rather than the partially updated parameters or abnormal parameters when a fault occurred); second, "state switching" – synchronously updating the training state of all working units to "third state," which is the standard initial state that should be available when the next iteration (next batch of training) starts after the current iteration has been completed normally (including the calibrated complete parameters, the updated training progress indicator, the unified optimizer state, etc.), ensuring that the states of multiple working units are consistent, clearing obstacles for subsequent parallel training.

[0126] The core objective is to address the issue of inconsistent parameters across multiple working units caused by faults. By uniformly calibrating to the "theoretical parameters," all working units can skip the abnormal state of the faulty stage and synchronously enter the standard starting point of the next iteration, thus avoiding the impact of "state fragmentation" on the overall model performance during distributed training.

[0127] S603, the working unit continues iterative training of the model parameters in the third state.

[0128] In this embodiment, the working unit has completed the operation corresponding to the third control command, successfully calibrating the model parameters to the theoretical parameters of this iteration, and the training state of all relevant working units has been synchronously switched to the third state - which is the standard initial state required for the next iteration to start after the current iteration is completed normally, including complete and consistent model parameters, synchronously updated iteration progress indicators, uniformly configured optimizer state and training hyperparameters.

[0129] Based on this, the working unit does not need to backtrack or repeat invalid training steps related to faults. It directly starts from the third state and continues to advance the iterative training of model parameters according to the preset training plan. It sequentially executes the forward propagation, backward propagation and parameter update process of subsequent batches of data, ensuring the continuity of distributed training, parameter consistency and coherent optimization of model performance.

[0130] In the above method, when a working unit fails, it directly updates the parameters to the theoretical values ​​of the current iteration and switches to the starting state of the next iteration. This quickly avoids state chaos during the parameter update phase and allows for seamless continuation of training without complex rollbacks, ensuring the continuity of model training and parameter consistency, and improving the fault tolerance efficiency and reliability of the training system.

[0131] See Figure 7 This is a schematic diagram of the system architecture of the data processing method provided in the embodiments of this application, such as... Figure 7 As shown, it specifically includes:

[0132] The high availability management module (i.e., the control unit) is responsible for heartbeat detection and health status monitoring of all working units and standby units. When a failure is detected in any working unit, this module will immediately make a decision and trigger the corresponding failover process.

[0133] Work unit: A compute node or accelerator card that performs the actual AI inference or training tasks. Each unit deploys an agent for the JITCheckpoint or RedoLog component.

[0134] Standby Unit: A compute node or accelerator card in hot standby mode. It does not carry business traffic, but its memory is ready to receive state backups from any working unit and can quickly take over the tasks of a failed unit after receiving instructions from the management module.

[0135] NPU Remoting Communication Layer: State synchronization between all units is accomplished through the NPU Remoting communication layer proposed in this invention. This layer implements direct memory access across nodes between NPU / GPU memory via the RDMA / RoCEv2 protocol, bypassing the CPU and operating system kernel. It features ultra-low latency and extremely high bandwidth, ensuring that the state synchronization process has a negligible impact on the performance of the main task.

[0136] See Figure 8 This is a schematic diagram of the inference session recovery process provided in the embodiments of this application, such as... Figure 8 As shown, it specifically includes:

[0137] Initialization and Session Establishment: The user's inference request is routed to worker node A. The large model on node A begins processing the request and creates and gradually populates the KVCache in its NPU memory.

[0138] Real-time state backup: The JITCheckpoint component deployed on node A monitors changes to the KVCache in its NPU memory in real time. Whenever a model generates a new token that causes the KVCache to update, the component captures this incremental data and writes it directly to the corresponding location in the NPU memory of the standby node B via RDMA through the NPURemoting communication layer. This process occurs asynchronously with extremely low latency.

[0139] Fault Detection and Switchover: The high availability management module detects that worker node A has lost response via a heartbeat mechanism. The module immediately determines that it has failed and issues a takeover command to standby node B. Simultaneously, through network layer configuration (such as virtual IP migration), user traffic originally destined for node A is transparently redirected to node B.

[0140] Seamless takeover and service continuity: Since node B's NPU memory already contains a KVCache copy that is completely identical to that before node A's failure, when the user's next request arrives, the model on node B can directly utilize this KVCache to continue decoding and generate a coherent response. The entire switchover process is imperceptible to the user, thus achieving zero-interruption session continuity assurance.

[0141] See Figure 9 This is a schematic diagram of the training iteration recovery process provided in the embodiments of this application, such as... Figure 9 As shown, it specifically includes:

[0142] API Call Records (RedoLog): During each iteration of the training task, the RedoLog component deployed on the host records all kernel function API calls sent from the host to the device (NPU), such as Kernellaunch and memory copy. These records only contain the API name and parameters, and the amount of data is extremely small, forming a lightweight "redo log".

[0143] Fault Scenario 1: Error during forward / backward propagation:

[0144] Detection: The high availability management module detected an error occurring on a certain accelerator card during forward or backward propagation calculations.

[0145] Recovery: The management module immediately pauses all accelerator cards participating in training and issues the following instructions: a) All cards (including faulty cards, if resettable) restore their internal state to the state at the start of the current iteration (Iterationi). This state is typically in memory and recovery is extremely fast. b) Based on the recorded RedoLog, the management module re-executes (replays) the API calls that were successfully completed before the failure on all cards in sequence. c) After the replay is complete, the state of all cards is restored to exactly the state at the moment of failure, and training resumes from that point.

[0146] Fault Scenario 2: Error occurs during the gradient descent (parameter update) phase:

[0147] Detection: The failure occurred during the stage when all gradient calculations were completed and the model parameters were being updated.

[0148] Recovery: Since parameter updates are a critical, state-transitional step, a failure after a partial update can lead to inconsistent states that are difficult to roll back. The recovery strategy is as follows: the management module instructs all accelerator cards participating in training to directly update their model parameters to the final result calculated in the current iteration (Iterationi), i.e., set the state to the state at the start of Iterationi+1, and then directly begin the next iteration. This "forward recovery" strategy sacrifices the minimal computation that may occur during a failed iteration, but guarantees eventual state consistency and avoids complex reverse operations.

[0149] Global checkpoints as a fallback: The traditional periodic CKPT scheme based on the AI ​​framework is still retained, but its execution frequency can be significantly reduced (for example, from once every 2 hours to once every 24 hours), serving only as a last resort in the event of multiple simultaneous failures or catastrophic scenarios where the RedoLog cannot be recovered.

[0150] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

[0151] Corresponding to the data processing method in the above embodiments, Figure 10 This is a structural block diagram of the data and processing apparatus provided in the embodiments of this application. For ease of explanation, only the parts related to the embodiments of this application are shown.

[0152] Reference Figure 10 The data and processing device 10 includes:

[0153] Training module 101 is used for:

[0154] If the control unit detects a fault in the working unit during forward or backward propagation when the working unit is performing iterative training of the model parameters, it sends a second control command to the working unit.

[0155] After receiving the second control command, the working unit terminates the current iteration training process and restores to the initial state before the current iteration.

[0156] Optionally, training module 101 is also used for:

[0157] The control unit acquires the first log information; wherein, the first log information is the training log generated by the working unit during the iterative training process;

[0158] The control unit filters out the log information corresponding to the time when the working unit malfunctions from the first log information to obtain the second log information;

[0159] The control unit sends the second log information to the working unit so that the working unit can re-execute according to the second log information, so that the working unit can be restored to the second state corresponding to the moment before the fault occurred.

[0160] The working unit continues iterative training of parameters in the second state.

[0161] Optionally, training module 101 is also used for:

[0162] If the control unit detects a fault in the working unit during parameter update while the working unit is performing iterative training of the model parameters, it sends a third control command to the working unit.

[0163] After receiving the third control command, the working unit updates the parameters corresponding to the current moment to the theoretical parameters calculated in this iteration, and updates the current state to the third state; where the third state is the state at the start of the next iteration corresponding to this iteration.

[0164] The working unit continues iterative training of the model parameters in the third state.

[0165] The reasoning information backup module 102 is used to perform reasoning calculations and output results when the working unit receives a reasoning problem input by the user, and to send the first key information generated during the logical reasoning and calculation process of the reasoning problem to the backup unit for backup.

[0166] The data and processing device 10 also includes a fault determination module 103, for:

[0167] The control unit sends multiple heartbeat requests to the working unit;

[0168] If the control unit does not receive a response from the working unit within a preset time after sending a preset number of heartbeat requests, it is determined that the working unit has malfunctioned.

[0169] The data and processing device 10 also includes an address switching module 104, used for:

[0170] The control unit unbinds the access address bound to the working unit and binds the access address to the standby unit.

[0171] The backup unit switching module 105 is used to send a first control command to the backup unit if the control unit detects a fault in the working unit.

[0172] The backup unit reasoning module 106 is used to perform logical reasoning and calculation on the current reasoning problem or the next reasoning problem input by the user based on all the backed-up first key information after the backup unit receives the first control command, and output the result.

[0173] Optionally, the inference information backup module 106 is also used for:

[0174] After performing logical reasoning and calculation on the reasoning problem, the working unit stores the first key information generated in the logical reasoning and calculation into the first storage area corresponding to the working unit.

[0175] After detecting a data update in the first storage area, the working unit writes the first key information into the second storage area corresponding to the backup unit through a preset communication method.

[0176] It should be noted that the information interaction and execution process between the above-mentioned devices / units are based on the same concept as the method embodiments of this application. For details on their specific functions and technical effects, please refer to the method embodiments section, and they will not be repeated here.

[0177] in addition, Figure 10The data processing device shown can be a software unit, a hardware unit, or a combination of software and hardware built into an existing terminal device. It can also be integrated into the terminal device as an independent component, or it can exist as a standalone terminal device.

[0178] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is merely an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit. Furthermore, the specific names of the functional units and modules are only for easy differentiation and are not intended to limit the scope of protection of this application. The specific working process of the units and modules in the above system can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.

[0179] Figure 11 This is a schematic diagram of the structure of the terminal device provided in the embodiments of this application. For example... Figure 11 As shown, the terminal device 11 of this embodiment includes: at least one processor 110 ( Figure 11 (Only one is shown in the diagram) a processor, a memory 111, and a computer program 112 stored in the memory 111 and executable on at least one processor 110, wherein the processor 110 executes the computer program 112 to implement the steps in any of the above-described data processing method embodiments.

[0180] The terminal device can be a computing device such as a desktop computer, laptop, handheld computer, or cloud server. This terminal device may include, but is not limited to, a processor and memory. Those skilled in the art will understand that... Figure 11 This is merely an example of terminal device 11 and does not constitute a limitation on terminal device 11. It may include more or fewer components than shown, or combine certain components, or different components, such as input / output devices, network access devices, etc.

[0181] The processor 110 may be a central processing unit (CPU), or it may be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or any conventional processor.

[0182] In some embodiments, memory 111 may be an internal storage unit of terminal device 11, such as a hard disk or memory of terminal device 11. In other embodiments, memory 111 may be an external storage device of terminal device 11, such as a plug-in hard disk, smart media card (SMC), secure digital card (SD), flash card, etc., equipped on terminal device 11. Furthermore, memory 111 may include both internal and external storage units of terminal device 11. Memory 111 is used to store operating system, applications, bootloader, data, and other programs, such as program code of computer programs. Memory 111 can also be used to temporarily store data that has been output or will be output.

[0183] This application also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, can implement the steps in the above-described method embodiments.

[0184] This application provides a computer program product that, when run on a terminal device, enables the terminal device to implement the steps described in the various method embodiments above.

[0185] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments of this application can be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. A computer-readable medium can include at least: any entity or device capable of carrying computer program code to a device / terminal device, a recording medium, a computer memory, a read-only memory (ROM), a random access memory (RAM), an electrical carrier signal, a telecommunication signal, and a software distribution medium. Examples include USB flash drives, portable hard drives, magnetic disks, or optical disks. In some jurisdictions, according to legislation and patent practice, computer-readable media cannot be electrical carrier signals or telecommunication signals.

[0186] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0187] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0188] In the embodiments provided in this application, it should be understood that the disclosed apparatus / terminal devices and methods can be implemented in other ways. For example, the apparatus / terminal device embodiments described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms.

[0189] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0190] The above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be included within the protection scope of this application.

Claims

1. A data processing method, characterized in that, Applied to a data processing system, the data processing system including a working unit, a backup unit, and a control unit, the method includes: When the working unit receives a reasoning question input by the user, the working unit performs reasoning calculations and outputs results, and sends the first key information generated during the logical reasoning and calculation process of the reasoning question to the backup unit for backup. After performing logical reasoning and calculation on the reasoning problem, the working unit stores the first key information generated in the logical reasoning and calculation into the first storage area corresponding to the working unit; wherein, the first storage area is the high-bandwidth memory of the NPU of the working unit; After detecting a data update in the first storage area, the working unit writes the first key information into the second storage area corresponding to the backup unit through a preset communication method; wherein, the second storage area is the NPU high-bandwidth memory of the backup unit; wherein, the preset communication method is the NPURemoting communication layer communication method based on the RDMA / RoCEv2 protocol; If the control unit detects a malfunction in the working unit, the control unit sends a first control command to the backup unit; The control unit unbinds the access address bound to the working unit and binds the access address to the backup unit; The work unit deploys a JIT Checkpoint component; the JIT Checkpoint component monitors changes in memory address data in the first storage area in real time through the memory change monitoring interface provided by the NPU.

2. The data processing method as described in claim 1, characterized in that, The method further includes: The control unit sends multiple heartbeat requests to the working unit; If the control unit does not receive a response from the working unit within a preset time after sending a preset number of heartbeat requests, it is determined that the working unit has malfunctioned.

3. The data processing method as described in claim 1, characterized in that, The method further includes: If the control unit detects a fault in any of the working units during forward or backward propagation when multiple working units are performing iterative training of model parameters, it sends a second control command to all the working units. Upon receiving the second control command, the multiple working units terminate the current iteration training process and restore to the initial state before the current iteration.

4. The data processing method as described in claim 3, characterized in that, The method further includes: The control unit acquires first log information; wherein, the first log information is training logs generated by the multiple working units during the iterative training process; The control unit filters out the log information corresponding to the time when the working unit malfunctions from the first log information to obtain the second log information; The control unit sends the second log information to the multiple working units, so that the multiple working units re-execute according to the second log information, so that the multiple working units are restored to the second state corresponding to the moment before the fault occurred. The working unit continues iterative training of the parameters in the second state.

5. The data processing method as described in claim 4, characterized in that, The method further includes: If the control unit detects a fault in any of the work units during parameter update while multiple work units are performing iterative training of the model parameters, then a third control command is sent to all the work units. After receiving the third control command, the multiple working units update the parameters corresponding to the current moment to the theoretical parameters calculated in this iteration, and update the current state to the third state; wherein, the third state is the state at the start of the next iteration corresponding to this iteration; In the third state, multiple working units continue iterative training of the model parameters.

6. A data processing apparatus, characterized in that, include: The reasoning information backup module is used to, when the working unit receives a reasoning question input by the user, perform reasoning calculations and output results, and send the first key information generated during the logical reasoning and calculation process of the reasoning question to the backup unit for backup. A backup unit switching module is used so that if the control unit detects a fault in the working unit, the control unit sends a first control command to the backup unit. The backup unit reasoning module is used to, after receiving the first control command, perform logical reasoning and calculation on the current reasoning problem or the next reasoning problem input by the user based on all the backed-up first key information, and output the result; The inference information backup module is also used for: After performing logical reasoning and calculation on the reasoning problem, the working unit stores the first key information generated in the logical reasoning and calculation into the first storage area corresponding to the working unit; wherein, the first storage area is the high-bandwidth memory of the NPU of the working unit; After detecting a data update in the first storage area, the working unit writes the first key information into the second storage area corresponding to the backup unit through a preset communication method; wherein, the second storage area is the NPU high-bandwidth memory of the backup unit; wherein, the preset communication method is the NPURemoting communication layer communication method based on the RDMA / RoCEv2 protocol; The backup unit switching module is also used for: The control unit unbinds the access address bound to the working unit and binds the access address to the backup unit; The data monitoring module is used for: The work unit deploys a JIT Checkpoint component; the JIT Checkpoint component monitors changes in memory address data in the first storage area in real time through the memory change monitoring interface provided by the NPU.

7. A terminal device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the method as described in any one of claims 1 to 5.

8. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the method as described in any one of claims 1 to 5.