Testing method and apparatus for artificial-intelligence cluster, and computing device cluster

By acquiring architecture and resource information from an artificial intelligence cluster and conducting online fault testing using idle time or data volume thresholds, the problem of increased fault testing time and reduced resource utilization during training and inference processes in existing technologies is solved, achieving efficient fault detection.

WO2026129707A1PCT designated stage Publication Date: 2026-06-25HUAWEI CLOUD COMPUTING TECHNOLOGIES CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
HUAWEI CLOUD COMPUTING TECHNOLOGIES CO LTD
Filing Date
2025-08-22
Publication Date
2026-06-25

Smart Images

  • Figure CN2025116400_25062026_PF_FP_ABST
    Figure CN2025116400_25062026_PF_FP_ABST
Patent Text Reader

Abstract

Provided in the present application are a testing method and apparatus for an artificial-intelligence cluster, and a computing device cluster. The method is applied to a management platform, which is used for managing infrastructures, wherein the infrastructures comprise an artificial-intelligence cluster, which comprises a plurality of computing units, the plurality of computing units being used for operating an artificial-intelligence model. The method comprises: acquiring architecture information of an artificial-intelligence model, wherein the architecture information is used for describing the deployment architecture of the artificial-intelligence model in a plurality of computing units; on the basis of the architecture information, determining first fault testing information, which is used for indicating a condition for a target computing unit from among the plurality of computing units to perform fault testing when operating the artificial-intelligence model; and sending the first fault testing information to the target computing unit from among the plurality of computing units. Online fault testing is implemented during the normal operation of an artificial-intelligence model, without consuming additional time, thereby reducing the time for model training and inference, and increasing the resource utilization rate.
Need to check novelty before this filing date? Find Prior Art

Description

A testing method, apparatus, and computing device cluster for an artificial intelligence cluster

[0001] This application claims priority to Chinese Patent Application No. 202411855017.5, filed on December 16, 2024, entitled "A Testing Method, Apparatus and Computing Device Cluster for an Artificial Intelligence Cluster", the entire contents of which are incorporated herein by reference. Technical Field

[0002] This invention relates to the field of artificial intelligence technology, and in particular to a testing method, apparatus, and computing device cluster for artificial intelligence clusters. Background Technology

[0003] As large models grow from billions of parameters to trillions of parameters, the size of the AI ​​clusters required for their training and inference increases rapidly. Due to the long training cycle and high load of large models, failures may occur when training large models.

[0004] To detect faults promptly and minimize their impact, the training and inference of large models can be interrupted, and then online fault testing can be performed.

[0005] However, since it requires interrupting the training and inference of large models, it may increase the training and inference time of the models. Summary of the Invention

[0006] This invention provides a testing method, apparatus, and computing device cluster for artificial intelligence clusters. During the normal operation of the artificial intelligence model, online fault testing can be performed simultaneously without spending additional time, reducing the time for model training and inference, and improving resource utilization.

[0007] The first aspect of the present invention provides a testing method for an artificial intelligence cluster. The method is applied to a management platform for managing infrastructure, including an artificial intelligence cluster. The artificial intelligence cluster includes multiple computing units for running artificial intelligence models. The method includes:

[0008] Obtain the architecture information of the artificial intelligence model, which describes the deployment architecture of the artificial intelligence model in multiple computing units; determine the first fault test information based on the architecture information, which indicates the conditions for the target computing unit in the multiple computing units to perform fault tests when running the artificial intelligence model; and send the first fault test information to the target computing unit in the multiple computing units.

[0009] In this solution, online fault testing is simultaneously implemented during the normal operation of the artificial intelligence model, without requiring additional time, reducing the time for model training and inference, and improving resource utilization.

[0010] In some possible implementations, the method also includes: obtaining data allocation information of the artificial intelligence model, which describes the strategy for allocating the input data of the artificial intelligence model to multiple computing units;

[0011] Based on the architecture information, determine the first fault test information, including: based on the architecture information and data allocation information, determine the resource usage information, which is used to describe the resource usage of multiple computing units running the artificial intelligence model; based on the resource usage information, determine the first fault test information.

[0012] In one example of this implementation, when the architecture information is a pipelined parallel architecture, the method further includes: determining the idle time model of multiple computing units based on the pipelined parallel architecture, wherein the pipelined parallel architecture is used to describe how multiple parts of the artificial intelligence model are deployed to different computing units in the artificial intelligence cluster, and the output of the first part of the multiple parts is the input of the second part of the multiple parts;

[0013] Based on resource usage information, the first fault test information is determined, including: determining the idle time period of the target computing unit when running the artificial intelligence model among multiple computing units according to the idle time model and resource usage information; using the idle time period as the condition for fault testing of the target computing unit when running the artificial intelligence model to obtain the first fault test information.

[0014] In this solution, by adopting a pipeline parallel approach, fault testing can be performed during idle periods, thereby improving resource utilization efficiency.

[0015] In one example of this implementation, when the architecture information is an expert parallel architecture, the expert parallel architecture is used to describe multiple expert models deployed to different computing units in an artificial intelligence cluster, wherein multiple expert models are connected to a gating network, and the gating network is used to distribute the input data of the artificial intelligence model to at least some or all of the multiple expert models, and the gating network is deployed in each of the multiple computing units;

[0016] The method also includes: determining the total amount of data and the number of batches based on data allocation information, wherein the total amount of data is used to indicate the amount of input data for multiple computing units;

[0017] Based on resource usage information, the first fault test information is determined, including:

[0018] Based on resource usage information, total data volume, and batch number, a target data volume threshold is determined. This target data volume indicates the amount of input data for the expert model deployed in the target computing unit across multiple computing units. Using a target data volume less than or equal to the data volume threshold as a condition for fault testing when the target computing unit runs the artificial intelligence model, the first fault test information is obtained.

[0019] In this scheme, in scenarios employing expert model parallelism and gated networks, a data volume threshold is determined based on the total data volume, batch number, and resource usage information. This allows the computing unit to perform fault testing when the data processing volume is low, thereby improving resource utilization efficiency.

[0020] In one possible implementation, the method also includes:

[0021] Send test cases to the target computing unit. The test cases are used to test whether the target computing unit has any faults.

[0022] In this scheme, the computing unit ensures the effectiveness and accuracy of fault testing by running test cases, provided that the first fault detection information is met.

[0023] In one possible implementation, the method also includes:

[0024] Obtain fault indication information, which indicates that at least one computing unit among multiple computing units has failed; determine second fault test information based on the fault indication information, which instructs the computing units among multiple computing units that have not failed to perform fault tests; and send the second fault test information to the computing units among multiple computing units that have not failed.

[0025] In this scheme, if any of the N computing units fails, the failed computing unit needs to be reconfigured. At this time, the other normal computing units will be in a waiting process. Using the waiting process to perform fault testing can improve resource utilization efficiency.

[0026] In one possible implementation, fault testing includes silent data error detection.

[0027] This invention provides a testing apparatus for an artificial intelligence cluster. The apparatus comprises several modules, each module executing a step in the testing method for an artificial intelligence cluster provided in the first aspect of this invention. The division of modules is not limited here. For the specific functions performed by each module of this testing apparatus and the beneficial effects achieved, please refer to the functions of each step in the testing method for an artificial intelligence cluster provided in the first aspect of this invention; further details will not be repeated here.

[0028] For example, a testing device for an artificial intelligence cluster is applied to a management platform. The management platform manages infrastructure, including an artificial intelligence cluster. The artificial intelligence cluster includes multiple computing units for running artificial intelligence models. The device includes:

[0029] The architecture acquisition module is used to acquire the architecture information of the artificial intelligence model. The architecture information describes the deployment architecture of the artificial intelligence model in multiple computing units.

[0030] The condition acquisition module is used to determine the first fault test information based on the architecture information. The first fault test information is used to indicate the conditions for the target computing unit among multiple computing units to perform fault testing when running the artificial intelligence model.

[0031] The sending module is used to send the first fault test information to the target computing unit among multiple computing units.

[0032] In one possible implementation, a condition acquisition module is used to acquire data allocation information of an artificial intelligence model, which describes a strategy for allocating input data of the artificial intelligence model to multiple computing units; based on the architecture information and the data allocation information, resource usage information is determined, which describes the resource usage of multiple computing units running the artificial intelligence model; and based on the resource usage information, first fault test information is determined.

[0033] In one possible implementation, when the architecture information is a pipelined parallel architecture, the condition acquisition module is used to determine the idle time model of multiple computing units based on the pipelined parallel architecture. The pipelined parallel architecture is used to indicate that multiple parts of the artificial intelligence model are deployed to different computing units in the artificial intelligence cluster, and the output of the first part of the multiple parts is the input of the second part of the multiple parts. Based on the idle time model and resource usage information, the idle time period of the target computing unit when running the artificial intelligence model is determined. The idle time period is used as a condition for fault testing of the target computing unit when running the artificial intelligence model to obtain the first fault test information.

[0034] In one possible implementation, when the architecture information is an expert parallel architecture, the expert parallel architecture is used to describe multiple expert models deployed to different computing units in an artificial intelligence cluster, wherein the multiple expert models are connected to a gating network, the gating network is used to distribute the input data of the artificial intelligence model to at least some or all of the multiple expert models, and the gating network is deployed in each of the multiple computing units.

[0035] The condition acquisition module is used to determine the total amount of data and the number of batches based on data allocation information. The total amount of data is used to indicate the amount of input data for multiple computing units. Based on resource usage information, the total amount of data, and the number of batches, a data volume threshold for the target data volume is determined. The target data volume is used to indicate the amount of input data for the expert model deployed in the target computing unit among the multiple computing units. The target data volume being less than or equal to the data volume threshold is used as a condition for the target computing unit to perform fault testing when running the artificial intelligence model, thus obtaining the first fault test information.

[0036] In one possible implementation, the sending module is also used to send test cases to the target computing unit, and the test cases are used to test whether the target computing unit has any faults.

[0037] In one possible implementation, the condition acquisition module is further configured to acquire fault indication information, which indicates that at least one of the multiple computing units has failed; and determine second fault test information based on the fault indication information, which instructs the computing units that have not failed among the multiple computing units to perform fault testing.

[0038] The sending module is also used to send the second fault test information to the non-faulty computing units among the multiple computing units.

[0039] In one possible implementation, fault testing includes silent data error detection.

[0040] A third aspect of the present invention provides a testing apparatus for an artificial intelligence cluster, comprising: at least one memory for storing a program; and at least one processor for executing the program stored in the memory, wherein when the program stored in the memory is executed, the processor is configured to perform a method as provided in the first aspect or any possible design of the first aspect.

[0041] Fourthly, embodiments of the present invention provide a testing apparatus for an artificial intelligence cluster. The apparatus executes computer program instructions to perform methods as provided in the first aspect or any possible design of the first aspect. Exemplarily, the apparatus may be a chip or a processor.

[0042] In one example, the device may include a processor that may be coupled to memory, read instructions from the memory, and execute methods provided by the first aspect or any possible design of the first aspect, according to those instructions. The memory may be integrated into the chip or processor, or it may be independent of the chip or processor.

[0043] A fifth aspect of the invention provides a computing device cluster including at least one computing device, each computing device including a processor and a memory; the processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device to cause the computing device to perform a method as provided in the first aspect or any possible design of the first aspect.

[0044] A sixth aspect of the invention provides a computer program product comprising instructions that, when executed by a cluster of computer devices, cause the cluster of computer devices to perform a method as provided in the first aspect or any possible design of the first aspect.

[0045] A seventh aspect of the invention provides a computer-readable storage medium including computer program instructions that, when executed by a cluster of computing devices, perform a method as provided in the first aspect or any possible design of the first aspect. Attached Figure Description

[0046] To more clearly illustrate the technical methods of the embodiments of the present invention, the accompanying drawings used in the embodiments will be briefly described below.

[0047] Figure 1 is a schematic diagram of an artificial intelligence framework provided in an embodiment of the present invention;

[0048] Figure 2 is a schematic diagram of the architecture of the model management system provided in an embodiment of the present invention;

[0049] Figure 3 is a schematic diagram of a model training scenario provided by an embodiment of the present invention;

[0050] Figure 4 is a schematic diagram of the structure of the AI ​​cluster provided in an embodiment of the present invention;

[0051] Figure 5a is a schematic diagram of data parallelism provided in an embodiment of the present invention;

[0052] Figure 5b is a schematic diagram of model parallelism provided in an embodiment of the present invention;

[0053] Figure 6 is a schematic diagram of the AI ​​model deployment provided in an embodiment of the present invention;

[0054] Figure 7 is a schematic diagram of AI model inference provided in an embodiment of the present invention;

[0055] Figure 8 is a flowchart illustrating a testing method for an artificial intelligence cluster provided in an embodiment of the present invention;

[0056] Figure 9a is a schematic diagram of the parallel pipeline provided in an embodiment of the present invention;

[0057] Figure 9b is a schematic diagram of the combination of pipeline parallelism and data parallelism provided in an embodiment of the present invention;

[0058] Figure 9c is a schematic diagram of fault testing using idle time periods in the scenario shown in Figure 9a;

[0059] Figure 10a is a schematic diagram of a computing unit deploying an expert model according to an embodiment of the present invention;

[0060] Figure 10b is a schematic diagram of the computing unit deploying two expert models according to an embodiment of the present invention;

[0061] Figure 10c is a schematic diagram of the parallel operation of expert models provided in an embodiment of the present invention;

[0062] Figure 10d is a schematic diagram of the combination of expert model parallelism and data parallelism provided in the embodiment of the present invention;

[0063] Figure 10e is a schematic diagram of a fault test performed on a computing unit with a small amount of data processed in the scenario shown in Figure 10c.

[0064] Figure 11a is a schematic diagram showing the deployment of AI models in N computing units according to an embodiment of the present invention;

[0065] Figure 11b is a schematic diagram of a fault test performed on a computing unit with a small amount of data processed in the scenario of Figure 11a;

[0066] Figure 12 is a schematic diagram showing the relationship between the architecture information and the fault test conditions of the target computing unit when running the AI ​​model, as provided in the embodiment of the present invention.

[0067] Figure 13 is a flowchart illustrating step 802 provided in an embodiment of the present invention;

[0068] Figure 14a is a schematic diagram of data allocation provided in an embodiment of the present invention;

[0069] Figure 14b is a schematic diagram of data allocation provided in an embodiment of the present invention;

[0070] Figure 14c is a schematic diagram of data allocation provided in an embodiment of the present invention;

[0071] Figure 14d is a schematic diagram of data allocation provided in an embodiment of the present invention;

[0072] Figure 15a is a schematic diagram of a resource usage information collection scenario provided in an embodiment of the present invention;

[0073] Figure 15b is a schematic diagram of a resource usage information collection scenario provided in an embodiment of the present invention;

[0074] Figure 16a is a schematic diagram of the fault test conditions for determining the target computing unit when running an AI model, provided by an embodiment of the present invention;

[0075] Figure 16b is a schematic diagram of determining idle time periods provided by an embodiment of the present invention;

[0076] Figure 16c is a schematic diagram of determining the data volume threshold provided in an embodiment of the present invention;

[0077] Figure 16d is a schematic diagram (2) illustrating the determination of the data volume threshold provided in an embodiment of the present invention;

[0078] Figure 16e is a schematic diagram of determining resource utilization rate provided by an embodiment of the present invention;

[0079] Figure 17 is a flowchart illustrating another testing method for an artificial intelligence cluster provided in an embodiment of the present invention;

[0080] Figure 18 is a flowchart illustrating another testing method for an artificial intelligence cluster provided in an embodiment of the present invention;

[0081] Figure 19 is a schematic diagram of the testing device for the artificial intelligence cluster provided in an embodiment of the present invention;

[0082] Figure 20 is a schematic diagram of the structure of a computing device provided in an embodiment of the present invention;

[0083] Figure 21 is a schematic diagram of a computing device cluster provided in an embodiment of the present invention;

[0084] Figure 22 is a schematic diagram of computing devices in a computer cluster connected via a network according to an embodiment of the present invention. Detailed Implementation

[0085] The following explanations cover some of the terms used in this embodiment. It should be noted that these explanations are for the convenience of those skilled in the art and are not intended to limit the scope of protection claimed by this invention.

[0086] Artificial Intelligence (AI) is a branch of computer science that attempts to understand the nature of intelligence and produce a new kind of intelligent machine that can react in a way similar to human intelligence. Research in this field includes robotics, speech recognition, image recognition, natural language processing, and expert systems.

[0087] An AI cluster refers to a collection of computing resources used to deploy AI models. It typically consists of multiple servers or computing nodes connected via a high-speed network to achieve efficient computing power. As AI technology continues to develop, the scale and performance of AI clusters are also constantly improving, making them a key infrastructure driving the advancement of AI technology.

[0088] AI models are systems trained using computer algorithms and data, capable of simulating human intelligent behavior. Through AI models, computers can perform a range of tasks, including image recognition, speech recognition, natural language processing, and machine translation. The core of an AI model is its algorithm and data; through continuous iterative learning and optimization, the model can improve its accuracy and efficiency.

[0089] Silent data errors refer to data being modified or corrupted without triggering system errors or error alerts. Silent data errors are dangerous because they are not immediately detected and may only be discovered after causing serious upper-level application incidents.

[0090] Large models refer to machine learning models with a large number of parameters and complex structures, typically built from deep neural networks, and possessing billions or even trillions of parameters. The purpose of designing large models is to improve their expressive power and predictive performance, enabling them to excel on a wide range of tasks.

[0091] Infrastructure refers to virtualized computing resources provided through computer technology, including servers, storage, networks, and other services. These resources can be allocated to users on demand. Infrastructure can be cloud infrastructure, which is typically managed and maintained by cloud service providers, and users can access these resources via the internet.

[0092] Neural networks are machine learning algorithms that mimic the connections between neurons in the human brain. By combining multiple layers of neurons and applying non-linear transformations of activation functions, neural networks can learn the features and patterns of data, enabling them to model and predict complex data.

[0093] Forward computation: The input data is processed through the layers of the neural network until the final output is obtained.

[0094] Backpropagation: Based on the loss function, the gradient is calculated using the backpropagation algorithm, and the parameters in the neural network are updated to reduce the loss function.

[0095] Batch: This refers to dividing the dataset into several batches, each containing a certain number of samples, and then using these batches to train the model. Reducing the batch size will increase memory consumption and may speed up model convergence to some extent, but it may lead to a decrease in the model's generalization ability.

[0096] The cloud is a software platform that uses application virtualization technology, integrating multiple functions such as software search, download, use, management, and backup.

[0097] Data volume: This term is usually used to describe the capacity or scale of digitized information. Therefore, data volume is the size or quantity of data stored or transmitted.

[0098] Pre-training (or trained) refers to training a model in advance or the process of training a model in advance. Pre-training is typically done using a large dataset to facilitate subsequent fine-tuning.

[0099] Fine-tuning refers to the process of applying a pre-trained model to a specific dataset and adapting its parameters to that dataset. Real-world tasks typically have relatively small datasets, so fine-tuning a pre-trained model can often yield good results.

[0100] A Graphics Processing Unit (GPU), also known as a display core, display chip, or video processor, is a coprocessor used for processing images and graphics computations. GPU resources include computing resources and storage resources. Computing resources may include the Compute Unified Device Architecture (CUDA) core, while storage resources may include RAM, which is used to store AI models, data, and the structures that the AI ​​models use to process that data.

[0101] Compute Unified Device Architecture (CUDA): NVIDIA designed and developed a parallel computing platform and programming model that includes the CUDA instruction set architecture and the parallel computing engine inside the GPU.

[0102] A Neural Network Processing Unit (NPU) uses circuits to simulate the structure of human neurons and synapses to process specific tasks. NPU resources include computational and storage resources. Computational resources include computational units, the core components of the NPU, specifically designed for neural network computation. NPU computational units typically employ matrix and vector computation methods, enabling rapid execution of matrix multiplication, convolution, and other calculations. Storage resources include memory, used to store AI models, data, and the structures of the data processed by the AI ​​models.

[0103] Central Processing Unit (CPU): As the core of a computer system for computation and control, it is the final execution unit for information processing and program execution.

[0104] Resource utilization rate: refers to the ratio of resources used by a computer system to its total resources within a given time period.

[0105] A test case is a set of test inputs, execution conditions, and expected results designed for a specific purpose, used to verify whether a particular requirement is met.

[0106] Figure 1 shows a schematic diagram of an artificial intelligence framework, which describes the overall workflow of an artificial intelligence system and is applicable to general artificial intelligence needs.

[0107] The above-mentioned artificial intelligence framework will be elaborated from two dimensions: "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis).

[0108] The "intelligent information chain" reflects a series of processes from data acquisition to processing. For example, it could be a general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, and intelligent execution and output. In this process, data undergoes a condensation process of "data—information—knowledge—wisdom."

[0109] The "IT value chain" reflects the value that artificial intelligence brings to the information technology industry, from the underlying infrastructure of artificial intelligence, information (provided and processed by technology) to the industrial ecosystem of systems.

[0110] (1) Infrastructure:

[0111] The infrastructure provides computing power to support the artificial intelligence system, enabling communication with the external world, and is supported by a basic platform (also known as a management platform). Communication with the outside world is achieved through sensors; computing power is provided by intelligent chips (hardware acceleration chips such as CPUs, NPUs, and GPUs). In this embodiment of the invention, the intelligent chip acts as a computing unit; the basic platform includes a distributed computing framework and related platform guarantees and support, which may include cloud storage and computing, interconnected networks, etc. For example, sensors communicate with the outside world to acquire data, and this data is provided to the intelligent chips in the distributed computing system provided by the basic platform for computation.

[0112] (2) Data

[0113] The data at the next layer of infrastructure is used to represent the data sources in the field of artificial intelligence. The data involves graphics, images, voice, text, and IoT data from traditional devices, including business data from existing systems and sensor data such as force, displacement, liquid level, temperature, and humidity.

[0114] (3) Data processing

[0115] Data processing typically includes methods such as data training, machine learning, deep learning, search, reasoning, and decision-making.

[0116] Among them, machine learning and deep learning can perform intelligent information modeling, extraction, data preprocessing, and training on data, including symbolization and formalization.

[0117] Reasoning refers to the process in which, in a computer or intelligent system, the machine thinks and solves problems by simulating human intelligent reasoning, based on reasoning control strategies and using formalized information. Typical functions include search and matching.

[0118] Decision-making refers to the process of making decisions based on intelligent information after reasoning, and it typically provides functions such as classification, sorting, and prediction.

[0119] (4) General ability

[0120] After the data processing mentioned above, the results of the data processing can be used to form some general capabilities, such as algorithms or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.

[0121] In this embodiment of the invention, the general capability can be a model-distributed architecture. This model-distributed architecture can include expert parallelism and pipeline parallelism architectures.

[0122] The Expert Parallel architecture distributes different parts of the model (called "experts") across different computational units, each responsible for processing a portion of the data and computational tasks. This allocation reduces the computational load on individual units and accelerates the overall training process through parallel processing.

[0123] Pipeline parallelism is a technique for achieving model parallelism in a distributed computing environment, primarily used in deep learning, especially when dealing with large-scale neural network models. By distributing different parts of the model (such as layers of a neural network) across different computing units, pipeline parallelism enables multiple machines in an AI cluster to collaboratively complete model training without sacrificing training efficiency.

[0124] (5) Smart Products and Industry Applications

[0125] Intelligent products and industry applications refer to products and applications of artificial intelligence systems in various fields. They encapsulate overall artificial intelligence solutions, productize intelligent information decision-making, and realize practical applications. Their application areas mainly include: intelligent manufacturing, intelligent transportation, smart home, intelligent healthcare, intelligent security, autonomous driving, safe city, and intelligent terminals.

[0126] First, the model management system to which the method provided in the embodiments of the present invention may be applied will be described. Figure 2 is a schematic diagram of the architecture of a model management system provided in an embodiment of the present invention. As shown in Figure 2, the system includes a terminal 210 and a computing device cluster 220. The terminal 210 and the computing device cluster 220 are connected via a network. The network can be a wired network or a wireless network. For example, the wired network can be a cable network, an optical fiber network, a Digital Data Network (DDN), etc., and the wireless network can be a telecommunications network, an internal network, the Internet, a Local Area Network (LAN), a Wide Area Network (WAN), a Wireless Local Area Network (WLAN), a Metropolitan Area Network (MAN), a Public Service Telephone Network (PSTN), a Bluetooth network, a ZigBee network, a Global System for Mobile Communications (GSM), a CDMA (Code Division Multiple Access) network, a CPRS (General Packet Radio Service) network, etc., or any combination thereof.Understandably, a network can use any known network communication protocol to enable communication between different client layers and gateways. These network communication protocols can be various wired or wireless communication protocols, such as Ethernet, Universal Serial Bus (USB), FireWire, Global System for Mobile Communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Time Division Code Division Multiple Access (TD-SCDMA), Long Term Evolution (LTE), New Radio (NR), Bluetooth, Wireless Fidelity (Wi-Fi), and other communication protocols.

[0127] The terminal 210 can be, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. Exemplary embodiments of the terminal 210 involved in this solution include, but are not limited to, electronic devices running iOS, Android, Windows, Harmony OS, or other operating systems. This embodiment of the invention does not specifically limit the type of terminal 210.

[0128] The computing device cluster 220 can be configured as a server cluster or a distributed system consisting of multiple physical servers, or as a cloud server cluster. The cloud server cluster provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery network (CDN), and big data and artificial intelligence platforms.

[0129] In practical use, the computing device cluster 220 can be configured as a management platform 221 and a data center 222. In one possible scenario, the computing device cluster 220 can act as a cloud, in which case the management platform 221 can act as a cloud management platform, and the terminal 210 interacts with the cloud through the cloud management platform. There can be several data centers 222, which can include a server cluster composed of multiple physical servers. The server cluster provides infrastructure, which includes various basic resources such as databases, virtual servers, physical servers, and computing resources, including central processing units (CPUs), graphics processing units (GPUs), and neural network processing units (NPUs). The management platform 221 can include an AI basic development platform 223. The AI ​​basic development platform 223 is a one-stop AI development platform for developers, providing various capabilities throughout the entire AI development process. For example, the capabilities provided by the AI ​​basic development platform 223 can include the following six parts: data preprocessing, model building and training, model management, model deployment, data optimization, and model optimization and updating. The various capabilities in the AI ​​basic development platform 223 can be integrated for users to use throughout the entire AI process, or they can be provided as independent functions for each user.

[0130] The AI ​​infrastructure development platform 223 can be deployed independently on servers or virtual machines in data center 222, or distributed across multiple servers or virtual machines in data center 222. It can also be partially deployed independently or distributedly on devices in an edge environment (also called edge devices), while another portion can be deployed independently or distributedly within data center 222. An edge environment is an environment geographically close to the user's terminal computing device, and includes edge devices such as edge servers and edge stations with computing capabilities.

[0131] In practical applications, users can access the AI ​​basic development platform 223, purchase cloud services, and conduct AI development.

[0132] Cloud services manifest as a package of software capabilities integrated with hardware virtualization infrastructure resources. Furthermore, the underlying resources supporting any process within the AI ​​infrastructure development platform 223 may be distributed across different physical devices. That is, the hardware devices actually executing a process are typically server clusters within the same data center 222, or server clusters distributed across different data centers 222. These data centers 222 can be the central cloud data center of the cloud service provider, or edge data centers provided by the cloud service provider to users. For example, in a scenario combining public and private clouds, resources in the public cloud can be used to run the model training and deployment functions provided by the AI ​​infrastructure development platform 223, while resources in the private cloud can be used to run the data storage and preprocessing functions provided by the AI ​​infrastructure development platform 223. This provides stronger security for user data. In this scenario, public cloud resources can come from the central cloud data center, and private cloud resources can come from the edge data center.

[0133] As shown in Figure 3, AI development can include building AI models, training AI models based on datasets, deploying AI models, and using AI models for inference. Training the AI ​​model can involve pre-training the AI ​​model and then fine-tuning it for multiple downstream tasks (e.g., N) to obtain downstream task models for each task. It should be noted that when users develop, train, deploy, and use AI models for inference on the AI ​​infrastructure development platform 223, it is based on the basic resources (mainly computing resources such as CPUs, GPUs, and NPUs) in the cloud service provider's data center 222.

[0134] The construction and training of AI models are key capabilities of the AI ​​Basic Development Platform 223. These capabilities primarily include: 1. Automatically selecting and training an initial model built into the AI ​​Basic Development Platform 223 based on the user's goals (e.g., task type, target accuracy) to obtain an AI model that meets the user's goals; or 2. Training an initial AI model based on the user's goals and the initial AI model provided by the user or selected by the user on the AI ​​Basic Development Platform 223 to obtain an AI model that meets the user's goals; or 3. Using a background neural network architecture search algorithm, the AI ​​Basic Development Platform 223 automatically searches for a suitable AI model based on the user's goals, trains it, and obtains an AI model that meets the user's goals.

[0135] Of the three methods mentioned above, the first and second methods mainly utilize the computing power of the cloud environment to train the AI ​​model; the third method includes both the search for the AI ​​model architecture and the training of the AI ​​model.

[0136] Typically, on the AI ​​infrastructure development platform 223, distributed parallel training can be used to improve the training efficiency of AI models. In this embodiment of the invention, as shown in Figure 4, the data center 222 can deploy an AI cluster. The AI ​​cluster can include multiple computing units interconnected, and these computing units can be hardware used for computation, such as CPUs, GPUs, and NPUs. In the data center 222, the computing units in the AI ​​cluster can be used for training AI models. When the AI ​​model is large in scale, distributed parallel training is employed. This distributed parallel training of the AI ​​model includes two types: data parallelism and model parallelism (also known as network parallelism).

[0137] As shown in Figure 5a, data parallelism specifically refers to deploying the same AI model to be trained on multiple computing units (e.g., computing units 1, 2, 3, ..., N), and dividing the training dataset into multiple data subsets, distributing them across the computing units. Each computing unit independently trains its AI model using the data from its corresponding subset. During each training iteration, the gradient values ​​calculated by each computing unit can be synchronized to other computing units, allowing each unit to obtain the average gradient of one iteration. The average gradient is the sum of the gradients across all computing units in the same iteration, and each computing unit can update the parameters of its respective AI model based on the average gradient. This is equivalent to aggregating data from many computing units to form a large dataset for training the AI ​​model, resulting in faster model convergence and improved training efficiency.

[0138] As shown in Figure 5b, model parallelism specifically refers to dividing an AI model into multiple sub-model parts, with each sub-model part deployed on different computing units. During training, following the structural order of the AI ​​model, the sub-model parts on different computing units compute the same data from a dataset. In each iteration, gradient values ​​are obtained through the computation of the sub-model parts on different computing units. Forward or backward propagation based on these gradient values ​​updates the parameters of each sub-model part. After multiple iterations, once the training requirements are met, the sub-model parts on each computing unit are reorganized according to the structure of the AI ​​model to obtain the trained AI model. When the model is large, splitting the model for training allows for the combined computing power of multiple computing units to train the AI ​​model. The trained AI model can also be distributed according to its split structure, and distributed inference is performed when using this AI model for inference.

[0139] Based on the above two ideas of distributed training, a variety of distributed training methods have also evolved. The core idea is to combine the computing power of multiple computing units to improve the accuracy and efficiency of model training.

[0140] It's worth noting that the AI ​​Basic Development Platform 223 incorporates various mainstream AI development frameworks. These frameworks are typically open-source and are commonly used for developing deep learning models, also known as deep learning frameworks. Examples include PaddlePaddle, TensorFlow, Caffe, Theano, MXNet, Torch, and PyTorch. The AI ​​Basic Development Platform 223 can perform distributed training of AI models based on the distributed training logic included in these frameworks. When training an AI model using these frameworks, users only need to create a training job. When creating a training job, users can select the specific AI framework to use, as well as the required dataset, training output path, etc., and configure the necessary resources. When the user selects multiple nodes, the AI ​​Basic Development Platform 223 will utilize the distributed training logic within the selected framework to perform distributed training across multiple nodes.

[0141] It is worth noting that AI development based on the AI ​​basic development platform 223 is only an example and not a specific limitation. In some possible cases, developers can also install the AI ​​development framework on the terminal 210 and then develop AI models locally, or they can use the AI ​​development framework to develop AI models on other online platforms (such as online open source framework platforms).

[0142] After training the AI ​​model, it needs to be deployed.

[0143] In some possible scenarios, as shown in Figure 6, AI models can be deployed on nodes in the cloud or on nodes in an edge environment. Nodes in the cloud can be virtual machine instances, container instances, physical servers, etc. On one hand, when the AI ​​model is large-scale, it can be distributed and deployed across multiple nodes, such as multiple computing units, based on the idea of ​​model parallelism. On the other hand, AI models can also be deployed independently on multiple nodes to support a large volume of online service access. Nodes in the edge environment can be various edge devices. Based on the application requirements of the AI ​​model, the AI ​​basic development platform 233 can deploy the AI ​​model to edge devices registered with the cloud management platform 231.

[0144] Once deployed, the AI ​​model can become an AI application or a part of an AI application. As shown in Figure 7, users can access the AI ​​application online via a webpage or a client app. When the AI ​​application is used, it can invoke the AI ​​model deployed in the edge or cloud environment to provide a response via online invocation.

[0145] Therefore, the AI ​​model developed and trained through the AI ​​infrastructure development platform 233 can perform inference on online request data and return the inference results. During the process of providing online services using the AI ​​model, the cloud management platform 231 can charge based on the number of times the AI ​​model is invoked, or based on the resource consumption of the AI ​​model's inference.

[0146] During the process of the AI ​​model providing online inference services, the AI ​​basic development platform 233 can continuously collect input and output data of the inference process, and continue to perform data optimization and model optimization updates based on the data of the inference process. That is, it can continue to enrich the training dataset with the data of the inference stage, and continue to optimize and train the AI ​​model based on the data of the inference stage and the corresponding results after manual confirmation.

[0147] It should be understood that in other cases, AI models developed and trained by the aforementioned AI infrastructure development platform 233 may not be deployed online, but rather users can download the trained AI model to their local machine for free local deployment. For example, users can choose to save the trained AI model to Object Storage Service (OBS), and then download the AI ​​model from OBS to their local machine.

[0148] In other cases, after users have trained an AI model using the aforementioned AI basic development platform 233, they can publish it to the AI ​​marketplace. The AI ​​model in the AI ​​marketplace can be subscribed to and used by other users. For example, the functionality of the AI ​​model can be integrated into other users' AI applications.

[0149] It should be noted that the training and inference processes of the AI ​​models mentioned above both require the use of multiple computing units in the AI ​​cluster, such as CPU, GPU, and NPU, to implement the training and inference of the AI ​​models.

[0150] However, in related technologies, using multiple computing units to train and infer AI models can lead to errors and poor stability of the AI ​​cluster. To detect faults promptly and minimize their impact, online fault testing is necessary during the training and inference processes of the AI ​​model. This can be achieved by running pre-designed test cases online within the computing units (e.g., GPUs, CPUs, NPUs) and checking if the unit produces the correct calculation results. If correct, the test case passes; otherwise, the unit is at risk of failure.

[0151] However, since test cases need to be run while the AI ​​model is being trained and inferenced, the operation of the AI ​​model needs to be interrupted for fault testing, which may affect the user experience.

[0152] Based on this, embodiments of the present invention provide the following testing method for artificial intelligence clusters. For a specific parallel framework of an AI model (used to illustrate the deployment architecture of the AI ​​model in the AI ​​cluster), performance analysis is used to identify the time periods during which fault testing can be performed when the AI ​​model is running normally, thereby improving the reliability of the AI ​​cluster without affecting the normal operation of the AI ​​model.

[0153] Next, in conjunction with the model management system provided above, a testing method for an artificial intelligence cluster provided by an embodiment of the present invention will be described in detail.

[0154] Figure 8 is a flowchart illustrating the testing method for an artificial intelligence cluster provided in an embodiment of the present invention. This embodiment can be applied to a management platform 221, such as a cloud management platform. The management platform 221 manages infrastructure, which includes an AI cluster. The AI ​​cluster includes multiple computing units, such as N (a positive integer greater than or equal to 2), and the N computing units are used to run AI models.

[0155] Step 801: The management platform 221 obtains the architecture information of the AI ​​model. The architecture information is used to describe the deployment architecture of the AI ​​model in N computing units.

[0156] In some possible implementations, terminal 210 can access management platform 221 and input the architecture information of the AI ​​model.

[0157] In some possible embodiments, the architecture information can be a pipelined parallel architecture. A pipelined parallel architecture describes dividing the AI ​​model into multiple parts and deploying them to different computing units within an AI cluster, such as N computing units. The output of the first part serves as the input of the second part, where the first part is any one of the multiple parts, and the second part is different from the first part. It should be noted that the principle of a pipelined parallel architecture is to divide the AI ​​model into multiple layers, each part being assigned to a different computing unit. This allows each computing unit to run only a portion of the AI ​​model, significantly reducing the memory consumption of a single computing unit and supporting the training of larger-scale AI models.

[0158] For example, as shown in Figure 9a, the AI ​​model is a multi-layered structure with interconnected layers, such as layer 1, layer 2, ..., layer 2N. Correspondingly, the multi-layered structure of the AI ​​model is divided into N parts and assigned to different computing units in N computing units. Each computing unit in the N computing units runs any one of the N parts, and the N parts run by the N computing units are different. Assume N computing units are denoted as computing unit 1, computing unit 2, ..., computing unit N, and the portions of the AI ​​model deployed in computing units 1, 2, ..., and N are denoted as model part 1, model part 2, ..., model part N. For example, as shown in Figure 9a, the AI ​​model consists of 2N layers. Layers 1 and 2 are deployed as model part 1 in computing unit 1, layers 3 and 4 are deployed as model part 2 in computing unit 2, ..., layers (2N-1) and 2N are deployed as model part N in computing unit N. If model part 1, model part 2, ..., model part N are sequentially connected to form an AI model, then the input data x is input to model part 1, the data output by model part 1 is used as the input of model part 2, the data output by model part 2 is used as the input of model part 3, ..., the data output by model part N-1 is used as the input of model part N.

[0159] In some possible scenarios, the AI ​​model consists of a gating network and multiple expert models connected by the gating network. The gating network is used to divide the input data of the AI ​​model into at least some or all of the multiple expert models. The AI ​​model is a Mixture of Experts (MoE), a common neural network architecture designed to improve the performance and efficiency of the model by integrating multiple independent expert models. Correspondingly, the architecture information can be an expert parallel architecture, which describes the deployment of multiple expert models into different computing units in N computing units and the deployment of the gating network in each of the N computing units.

[0160] It's important to note that the basic concept of expert parallel architecture is to decompose complex tasks into multiple subtasks, each handled by an expert model, thereby achieving more efficient learning and prediction. In expert parallel architecture, the gating network is responsible for deciding which expert model the input data goes into. Expert parallelism is typically used during the training and inference of large models in expert parallel architecture. It improves training efficiency and performance by distributing multiple expert models across different computing units, overcoming limitations in GPU memory and bandwidth while maintaining high performance.

[0161] When the architecture information is an expert parallel architecture, different expert models are deployed on different computing units, such as GPUs, within the AI ​​cluster. For example, the AI ​​model includes M (positive integers greater than or equal to 2) expert models and a gating network. The gating network is used to determine the data processed by at least some of the M expert models from the input data. The gating network is deployed in each of the N computing units, and is used to determine the data corresponding to each of the M expert models. The M expert models are deployed in different computing units within the N computing units. Specifically, each of the N computing units runs one or more of the M expert models, and the one or more expert models run by each of the N computing units are different. For example, as shown in Figure 10a, each of the N computing units deploys one expert model, and the one expert model deployed by each of the N computing units is different; as shown in Figure 10b, each of the N computing units deploys two different expert models, and the two expert models deployed by each of the N computing units are different.

[0162] In some possible cases, the architecture information can be a parallel architecture for AI models, for example, as shown in Figure 11a, where each of the N computing units deploys an AI model.

[0163] It should be noted that before or after step 801, the management platform 221 can obtain the type and number N of the computing units. Based on the type and number N of the computing units, it can determine whether the AI ​​model is a large model. In the case where the AI ​​model is a large model, step 802 is executed.

[0164] Step 802: The management platform 221 determines the first fault test information based on the architecture information. The first fault test information is used to indicate the conditions for the target computing unit in the N computing units to perform fault tests when running the AI ​​model.

[0165] In this embodiment of the invention, the management platform 221, based on the architecture information, can determine the conditions for a target computing unit among N computing units to perform fault testing when running an AI model. Based on these conditions, first fault test information is obtained. The target computing unit can be any of the N computing units. The conditions for the target computing unit to perform fault testing when running an AI model can include one or more of the following: resources are idle, an idle time period, resource utilization rate is less than or equal to the resource utilization rate, and the target data volume (also referred to as the second data volume) is less than or equal to a data volume threshold. The resource utilization rate describes the ratio of computing resources used by the computing unit to the total computing resources, and the ratio of storage resources to the total storage resources. The resource utilization rate threshold can be determined by combining the minimum resource utilization rate of the N computing units running the AI ​​model. The target data volume describes the amount of input data for the target computing unit, or the amount of input data for the expert model. The data volume threshold can be determined by combining the total amount of input data x from the N computing units.

[0166] In some possible scenarios, where the architecture information is used to indicate a pipelined parallel architecture as shown in Figure 9a, the conditions for the target computing unit to perform fault testing while running the AI ​​model may include resources being in an idle state or an idle period. If the condition for the target computing unit to perform fault testing while running the AI ​​model is that resources are in an idle state, then the target computing unit will check whether its own resources are in an idle state, and if so, perform fault testing. If the condition for the target computing unit to perform fault testing while running the AI ​​model is an idle period, then the target computing unit will check whether it is in an idle period, and if so, perform fault testing.

[0167] It should be noted that, in the case of a pipelined parallel architecture, considering that the data processing of the model parts deployed in different computing units has a sequential order, there may be computing units that need to wait for other computing units to complete their calculations. During the waiting process, the resources of the computing units are in an idle state, and the occurrence of the idle state is regular. Therefore, in order to improve resource utilization, the condition for the target computing unit to perform fault testing when running the AI ​​model can be that the resources are in an idle state or during an idle period.

[0168] When the AI ​​model is a neural network model, it needs to perform forward computation and backward computation. The following explains Forward and Backward. Assume the Forward order of the N computation units is computation unit 1, computation unit 2, ..., computation unit N; as shown in Figure 9a, during the Forward process, model part 1 deployed on computation unit 1 performs a Forward and sends the result to computation unit 2. Model part 2 deployed on computation unit 2 performs a Forward and sends the result to computation unit 3, and so on, completing the Forward process. After the Forward is complete, Backward is performed. Similarly, model part N deployed on computation unit N performs a Backward and sends the result to computation unit N-1. Model part N-1 deployed on computation unit N-1 performs a Backward and sends the result to computation unit N-2, and so on, until computation unit 1 completes the Backward. In the model training scenario, after completing the Backward, the gradients of each layer in the AI ​​model need to be updated uniformly.

[0169] Building upon model parallelism, data parallelism can be further introduced. For example, let Fij denote the forward computation of computation unit i for the j-th batch, and Bij denote the backward computation of computation unit i for the j-th batch, where i represents the computation unit number and j represents the batch number. As shown in Figure 9b, for computation unit 0, F00 is performed on the 0-th batch, and the result of F00 is sent to computation unit 1. Then, F01 is performed on the 1-th batch, and the result of F01 is sent to computation unit 1, repeating this process until the result of F03 is sent to computation unit 1. For computation unit 1, after receiving the result of F00, F10 is performed on the 0-th batch, and the result of F10 is sent to computation unit 2. Then, F11 is performed on the 2-th batch, and the result of F11 is sent to computation unit 2, repeating this process until the result of F13 is sent to computation unit 2. Computation units 2 and 3 are similar and will not be described further. After completing the calculation of F33, for calculation unit 3, perform B33 on the third batch, send the result of B33 to calculation unit 3, and perform B32 on the second batch, send the result of B32 to calculation unit 2, repeating this process until the result of F30 is sent to calculation unit 2; for calculation unit 2, after receiving the result of B33, perform B23 on the third batch, send the result of B23 to calculation unit 2, and perform B22 on the second batch, send the result of B22 to calculation unit 1, repeating this process until the result of B20 is sent to calculation unit 1; calculation unit 1 and calculation unit 0 are similar and will not be described further.

[0170] Therefore, in a pipelined parallel architecture, when N computing units run an AI model in parallel, the computations of different computing units have a sequential order. As a result, each computing unit will regularly have a bubble (idle time), which can be used to test the faults of the computing unit. Therefore, the condition for testing the faults of the target computing unit when it runs the AI ​​model can be an idle time period (bubble) or the resource being in an idle state. For example, as shown in Figure 9c, for computing unit 1, there is a bubble (idle time) after the Forward is completed. Computing unit 1 can use the bubble to test the faults of computing unit 1. The same applies to computing unit 2, and will not be described in detail.

[0171] It is worth noting that in a pipeline parallel scenario, although there is data parallelism, the data processing time still has a sequence. Therefore, the computing unit needs to wait for other computing units to complete their calculations. During the waiting process, the computing unit's resources are idle. In order to improve resource utilization, the condition for fault testing when the target computing unit is running the AI ​​model can be an idle time period (bubble) or resources being idle.

[0172] In some possible scenarios, architectural information is used to indicate an expert parallel architecture, such as as shown in Figure 10a or 10b. Considering the differences in the amount of input data for multiple expert models under an expert parallel architecture, the resource utilization of the N computing units will vary. Some computing units may have low resource utilization. In this case, the computing units with low resource utilization can undergo fault testing to improve resource utilization. Therefore, the conditions for the target computing unit to undergo fault testing when running the AI ​​model can be one or more of the following: resources are idle, resource utilization is less than or equal to the resource utilization threshold, and the target data volume is less than or equal to the data volume threshold. The target data volume describes the number or size of the input data of the expert model deployed in the target computing unit.

[0173] It should be noted that different expert models are deployed on different computing units, such as GPUs, and the amount of input data varies among these models. In one example, the target computing unit might have idle resources. To improve resource utilization, the condition for fault testing when running the AI ​​model could include resources being idle. In another example, considering the varying amounts of input data for different expert models, when processing less data, the computing unit has a lower load and more idle resources. To improve resource utilization, the condition for fault testing when running the AI ​​model could include the target data volume being less than or equal to a data volume threshold, and the resource utilization rate being less than or equal to a resource utilization rate threshold.

[0174] The target data volume is explained below.

[0175] For example, the data input to N computing units is the same, as shown in Figure 10c. The input data x is input into the gating networks of the N computing units respectively. Here, the input data x can be the data during training or the data sent by a large number of users when applying the AI ​​model for inference. Taking the scenario in Figure 10c as an example, during the operation of the AI ​​model, for the target computing unit of N computing units, the gating network will calculate the input data x to determine the data corresponding to each of the M expert models. For the expert model i deployed in the target computing unit, the data corresponding to expert model i is input into expert model i. Therefore, the amount of data processed by expert model i in the target computing unit is the number of data corresponding to expert model i determined by the gating network.

[0176] For example, if a data parallel approach is adopted, the input data of the N computing units are different. For instance, as shown in Figure 10d, the input data x is divided into N parts, resulting in x1, x2, ..., xN. x1, x2, ..., xN are then input into the gating networks of computing units 1, 2, ..., and N, respectively. Continuing with the scenario in Figure 10d, during the AI ​​model operation, for the target computing unit of the N computing units, the gating network will calculate the input data to determine the data corresponding to each of the M expert models. For expert model i deployed in the target computing unit, the data corresponding to expert model i can be directly input into expert model i. For other expert models, since they are located in other computing units, the data needs to be sent to those other computing units. Therefore, the amount of data processed by expert model i in the target computing unit is the total number of data corresponding to expert model i determined by the gating network of the target computing unit and the total number of data corresponding to expert model i sent by other computing units.

[0177] Therefore, when architectural information is used to indicate an expert parallel architecture, different expert models will be deployed on different computing units, such as GPUs. When the gating network computation of N computing units is completed in the scenarios shown in Figure 10c or Figure 10d, if the amount of data allocated to the expert model on a computing unit, such as a GPU, is small, then the computational load on that computing unit, such as the GPU, is low, and fault testing of the computing unit can be performed. For example, as shown in Figure 10e, during a certain round of training inference, after the gating network computation is completed, it is found that the amount of input data (10) allocated to GPU4 is less than or equal to a preset threshold (15), then GPU4 can undergo fault testing.

[0178] In some possible scenarios, architectural information is used to indicate the parallel architecture of the AI ​​model, as shown in Figure 11a. As shown in Figure 11a, each of the N computing units deploys an AI model, the AI ​​model is data-parallel, and at least some of the N computing units process different amounts of data. The condition for the target computing unit to perform a fault test when running the AI ​​model is that the target data volume (the amount of input data to the computing unit) is less than or equal to a data volume threshold, or the resource utilization rate is less than or equal to a resource utilization rate threshold. The target data volume describes the number or size of data input to the target computing unit. For example, as shown in Figure 11b, during a certain round of training inference, if the input data volume 10 allocated to the expert model on GPU1 is less than or equal to a preset threshold 15, then GPU1 can perform a fault test.

[0179] It should be noted that AI models can be trained by dividing the data into multiple batches with varying amounts of data. For each batch, the amount of data input to the N computational units can be different. For example, for each of the N computational units, the amount of data processed by that unit in each batch can increase or decrease sequentially. To ensure the effectiveness of AI model training, for any computational unit among the N computational units, that unit needs to traverse all the data in multiple batches. In addition, the amount of data processed by the N computational units remains consistent. For example, the AI ​​model can be trained by dividing the input data into N batches, with each batch having a different amount of data input to the N computational units. For any computational unit among the N computational units, that unit processes any amount of data from the N data sets in each of the N batches, and the amount of data processed by each of the N data sets in each of the N batches is different, thus traversing the N data sets.

[0180] In summary, as shown in Figure 12, when the architecture information indicates a pipelined parallel architecture, the conditions for fault testing of the target computing unit when running the AI ​​model can be that the resources are idle or the target computing unit is idle for a specific period of time. When the architecture information indicates an expert parallel architecture, the conditions for fault testing of the target computing unit when running the AI ​​model can include any one or more of the following: resources are idle, the target data volume is less than or equal to a data volume threshold, and resource utilization is less than or equal to a resource utilization threshold. Furthermore, when the architecture information indicates an AI model parallel architecture, and the data volumes input to the N computing units differ, the conditions for fault testing of the target computing unit when running the AI ​​model can include any one or more of the following: the target data volume is less than or equal to a data volume threshold, and resource utilization is less than or equal to a resource utilization threshold. It is worth noting that if the architecture information indicates a combination of pipelined parallel architecture and expert parallel architecture, the conditions for fault testing of the target computing unit when running the AI ​​model can include any one or more of the following: resources are idle, an idle period of time, the target data volume is less than or equal to a data volume threshold, and resource utilization is less than or equal to a resource utilization threshold.

[0181] The conditions for fault testing when the target computing unit runs the AI ​​model may include one or more of the following: resources are idle, resource utilization is less than or equal to the resource utilization rate, and the target data volume is less than or equal to the data volume threshold. In the following possible implementations, the management platform 221 stores the correspondence between the AI ​​model's description information (including at least architecture information) and the detection conditions (conditions for fault testing when running the AI ​​model). This correspondence can be carried by a table (which can be called a mapping table for ease of description and distinction). The detection conditions may include one or more of the following: resources are idle, data volume is less than or equal to the data volume threshold, and resource utilization is less than or equal to the resource utilization rate threshold. For example, the table entries of the mapping table may be the AI ​​model's description information (including at least architecture information) and the detection conditions. Subsequently, after obtaining the AI ​​model's description information (including at least architecture information), the management platform 221 matches it with the mapping table and determines the detection conditions of the row containing the matched AI model's description information as the conditions for fault testing when the target computing unit runs the AI ​​model.

[0182] In some other possible implementations, Figure 13 shows a flowchart of step 802 provided in an embodiment of the present invention. As shown in Figure 13, step 802 further includes at least the following steps:

[0183] Step 8021: The management platform 221 obtains the data allocation information of the AI ​​model. The data allocation information is used to explain the strategy of allocating the input data of the AI ​​model to N computing units.

[0184] In some possible implementations, terminal 210 can access management platform 221, input data allocation information of AI model and send it to management platform 221, thereby enabling management platform 221 to obtain data allocation information.

[0185] For example, as shown in Figure 14a, the data allocation information may include inputting the input data x of the AI ​​model into N computing units respectively.

[0186] For example, data allocation information may include data parallelism. For instance, as shown in Figure 14b, the input data x is divided into N parts, and the N parts are input into N computing units. The number of data in each of the N parts may be the same or different, and the specific design can be combined with the actual situation.

[0187] For example, data allocation information may include batch allocation, which can be understood as dividing the data into multiple parts, with each part of the data serving as the input for one training session. In practical applications, data allocation information may include the number of batches. For example, as shown in Figure 14c, the input data x is divided into multiple batches: batch0, batch1, ... For each batch, taking batch0 as an example, the data in batch0 is denoted as x0, and x0 is input into N computing units.

[0188] For example, data allocation information may include batch allocation and data parallelism. For instance, as shown in Figure 14d, the input data x is divided into multiple batches: batch0, batch1, ... For each batch, taking batch0 as an example, the data in batch0 is denoted as x0. x0 is divided into N parts: x01, x02, ..., x0N. x01, x02, ..., x0N are input into N computing units.

[0189] It should be noted that the input data x can be the data during training or the data sent by a large number of users when applying AI model inference.

[0190] Step 8022: The management platform 221 determines the resource usage information based on the architecture information and data allocation information. The resource usage information is used to describe the computing resource usage of the N computing units running the AI ​​model.

[0191] In practical implementation, the management platform can deploy AI models in the AI ​​cluster based on architecture information. Based on data allocation information, it inputs the AI ​​model's input data (x) into N computing units. Each of the N computing units can collect resource usage parameters, which describe the resource usage of the computing unit running the AI ​​model. For example, resource usage parameters can include the computing unit's resource utilization rate and / or resource status, where the resource status can be idle or running (indicating computation is in progress). Subsequently, each of the N computing units can send its resource usage parameters to the management platform, which then receives the resource usage information (describes the resource usage parameters of each computing unit when the N computing units are running the AI ​​model).

[0192] For example, as shown in Figure 15a (numbered 1, 2, 3, and 4 to indicate the order), in practical applications, terminal 210 can send a model training request to management platform 221. The model training request includes the number of computing units N, the architecture of the AI ​​model, the deployment architecture of the AI, data allocation information, and the type of computing unit. Based on the type of computing unit, management platform 221 selects a suitable AI cluster, deploys the AI ​​model in N computing units in the AI ​​cluster according to the deployment architecture of the AI ​​model, and allocates data to N computing units according to the data allocation information. Afterward, each of the N computing units can collect resource usage parameters and report them to management platform 221. It should be noted that the above model training request is merely an example. In other possible scenarios, as shown in Figure 15b (numbered 1, 2, 3, and 4 to indicate the order), the management platform 221 deploys the AI ​​model in N computing units in the AI ​​cluster according to the AI ​​model deployment architecture. Then, a large number of terminals 210 can send model call requests to the management platform 221. The model call request includes the AI ​​model identifier and user data. Afterward, for each model call request, the management platform 221 allocates the user data to N computing units according to the data allocation information during AI model training. The AI ​​models deployed in the N computing units perform inference. During the inference process, resource usage parameters can be collected and reported to the management platform 221.

[0193] The above implementation is merely an example. In other possible implementations, simulation can be used to simulate the training or inference process of an AI model based on architectural and data allocation information, thereby determining resource usage information.

[0194] Additionally, the management platform 221 can determine resource usage parameters based on architecture information. For example, when the architecture information is used to indicate a pipelined parallel architecture, such as that shown in Figure 9a, the resource usage parameter can be resource status. When the architecture information is used to indicate an expert parallel architecture, such as that shown in Figure 10a or 10b, the resource usage parameter can be resource utilization rate, or it can be both resource utilization rate and resource status.

[0195] Step 8023: Based on resource usage information, the management platform 221 determines the first fault test information corresponding to the target computing unit.

[0196] The management platform 221 performs performance analysis based on resource usage information to determine the conditions for fault testing when the target computing unit runs the AI ​​model, thereby obtaining the first fault test information.

[0197] In one example, as shown in Figure 16a, when the management platform 221 determines that the target computing unit has idle resources, it does not need to determine the idle time period of the resources being idle, and uses the idle state of the resources as a condition for the target computing unit to perform fault testing when running the AI ​​model.

[0198] For example, the architecture information is used to indicate the pipelined parallel architecture, such as that shown in Figure 9a. Considering that the model parts of different computing units are deployed in a sequential order of processing data, the computing unit needs to wait for other computing units to complete their calculations. During the waiting process, the computing unit's resources are idle. In order to improve resource utilization, the condition for the target computing unit to perform fault testing when running the AI ​​model may include that the resources are idle.

[0199] For example, the architecture information is used to indicate an expert parallel architecture, such as shown in Figure 10a or 10b. Considering that the amount of input data of multiple expert models under the expert parallel architecture is different, the resource utilization of N computing units is different. Some computing units may have idle resources. In order to improve resource utilization, the condition for the target computing unit to perform fault testing when running the AI ​​model may include that the resources are idle.

[0200] In another example, as shown in Figure 16a, if the management platform 221 determines that the target computing unit has idle resources and the idle time periods appear regularly, it can determine the idle time periods of the target computing unit based on the resource usage information. For example, the pattern of the occurrence of idle time periods can be discovered based on the resource usage information, and the subsequent idle time periods can be predicted based on the pattern of the occurrence of idle time periods. Subsequently, the idle time periods are used as a condition for fault testing when the target computing unit runs the AI ​​model.

[0201] For example, as shown in Figure 16b, when the management platform 221 determines that the architecture information is a pipelined parallel architecture, it determines the idle time model of the computing unit according to the pipelined parallel architecture. The idle time model is used to indicate the occurrence pattern of resource usage status and resource idle status, such as the alternation of resource usage status and resource idle status. Based on the idle time model and resource usage information, the idle time period for the target computing unit to run the AI ​​model is determined, and the idle time period is used as a condition for the target computing unit to perform fault testing when running the AI ​​model.

[0202] When the AI ​​model is a neural network model, the AI ​​model needs to perform forward computation and backward computation after the forward computation is completed. Correspondingly, as shown in 10b, the idle time model can be: Forward → Bubble → Backward → Bubble → Forward → Bubble. Therefore, if the current computation type of the computing unit is determined to be backward, the next Bubble time period is the sum of the backward computation time and the most recent Bubble time period. If the current computation type of the computing unit is determined to be forward, the next Bubble time period is the sum of the forward computation time and the most recent Bubble time period. In specific implementation, the management platform 221 takes the target computing unit as an example for each of the N computing units. Based on resource usage information, it can determine the backward computing duration, forward computing duration, the most recent idle time period A, and whether the current computing type is backward or forward. If the current computing type is backward, the next idle time period is the sum of the idle time period A and the backward computing duration. If the current computing type is forward, the next idle time period is the sum of the idle time period A and the forward computing duration.

[0203] In some possible scenarios, as shown in Figure 16a, the management platform 221 determines the resource usage of N computing units based on resource usage information. For example, if there are differences in resource utilization rates, the platform determines the difference between the maximum and minimum resource utilization rates of the N computing units based on resource usage information. If the difference is large, for example, greater than a set difference threshold, then it can be considered that there are differences in resource usage. Next, a target data volume threshold can be determined. The target data volume is used to indicate the amount of data processed by the target computing unit. Using a target data volume that is less than or equal to the data volume threshold is used as a condition for fault testing when the target computing unit runs the AI ​​model.

[0204] It should be noted that, when the architecture information is used to indicate an expert parallel architecture as shown in Figures 10a and 10b, or when the architecture information is used to indicate an AI model parallel architecture where the N computing units at least partially process different amounts of data as shown in Figure 10c, the amount of data processed by the N computing units may differ; when the architecture information is used to indicate an expert parallel architecture as shown in Figures 10a and 10b, the target data quantity is used to describe the number or size of the input data of the expert model deployed in the target computing unit; when the architecture information is used to indicate an AI model parallel architecture where the N computing units at least partially process different amounts of data as shown in Figure 11a, the target data quantity is the number or size of the data input to the target computing unit.

[0205] In some possible implementations, as shown in Figure 16c, when the management platform 221 determines that the architecture information is an expert parallel architecture, it determines the total amount of data (also known as the first data volume) and the number of batches based on the data allocation information. The total amount of data is used to indicate the amount of input data for N computing units. The input data for N computing units can be the total amount of all input data of the AI ​​model. When the data allocation scheme is shown in Figures 14a and 14b, the number of batches is 1. When the data allocation scheme is shown in Figures 14c and 14b, the number of batches is greater than 1. Based on the resource usage information, the total amount of data, and the number of batches, a data volume threshold for the target data volume is determined. The target data volume is used to indicate the amount of input data for the expert model deployed in the target computing unit. The target data volume being less than or equal to the data volume threshold is used as a condition for the target computing unit to perform fault testing when running the AI ​​model.

[0206] In practical implementation, the management platform 221 can determine the total amount of data and the number of batches based on data allocation information. For example, in the scenario shown in Figure 10c, the input data volume is the number of input data x, and in the scenario shown in Figure 10d, the input data volume is x1 + x2 + ... + xN. Based on the total amount of data and the number of batches, the batch data volume (used to describe the data volume of any batch) is determined. Based on resource usage information, the proportion of the batch data volume is determined, and the data volume threshold is obtained by multiplying the batch data volume by the proportion of the batch data volume. It should be noted that the resource usage information may include the resource utilization rate of each computing unit in N units. The higher the minimum resource utilization rate, the lower the proportion of the batch data volume, so that the target computing unit has sufficient resources for fault testing. For example, in practical applications, multiple resource utilization rate intervals and the proportions corresponding to the resource utilization rate intervals can be set. The management platform 221 determines the proportion corresponding to the resource utilization rate interval where the minimum resource utilization rate is located based on the resource usage information.

[0207] In some other possible implementations, as shown in Figure 16d, the management platform 221 determines the total data volume and the number of batches based on data allocation information; based on the total data volume, the number of batches, the number of N computing units, and the number of model parameters of the AI ​​model, it determines the target data volume threshold; for example, based on the total data volume and the number of batches, it determines the batch data volume; based on the batch data volume, it determines the initial ratio; based on the number of N computing units and the number of model parameters of the AI ​​model, it adjusts the initial ratio to determine the batch data volume ratio; and based on the batch data volume ratio multiplied by the total data volume, it obtains the data volume threshold. It should be noted that the larger the number of model parameters of the AI ​​model and the smaller the number of N computing units, the lower the batch data volume ratio.

[0208] It should be noted that the target data volume threshold is merely an example. In some possible implementations, the target data volume can be replaced by a data percentage, which describes the proportion of the target data volume within the batch data volume. In some possible cases, such as the scenario shown in Figure 11a, the data percentage describes the proportion of the input data volume of the target computing unit within the batch data volume; in some possible cases, such as the scenarios shown in Figure 10c or Figure 10d, the data percentage describes the proportion of the input data volume of the expert model within the target computing unit within the batch data volume.

[0209] It is worth noting that, in the scenarios shown in Figures 11a and 11b, the proportion of batch data can also be determined based on the minimum amount of input data for each of the N computing units, for example, slightly larger than the minimum amount of input data.

[0210] In some possible scenarios, as shown in Figure 16a, the management platform 221 determines the resource usage of N computing units based on resource usage information, such as differences in resource utilization rates. Then, it can determine a resource utilization rate threshold and use a resource utilization rate less than or equal to the resource utilization rate threshold as a condition for fault testing of the target computing unit when running the AI ​​model.

[0211] As shown in Figure 16e, the management platform 221, based on resource usage information, determines a resource usage threshold when there are differences in resource utilization rates among the N computing units. A resource utilization rate less than or equal to this threshold is used as a condition for fault testing of the target computing unit when running the AI ​​model. The resource usage threshold can be determined based on the minimum resource utilization rate of the N computing units. For example, the resource usage information may include the resource utilization rate of each of the N computing units. The management platform 221 determines the minimum resource utilization rate based on this information, and then determines the resource usage threshold based on this minimum resource utilization rate. The resource usage threshold can be slightly greater than the minimum resource utilization rate.

[0212] It should be noted that, as shown in Figure 16a, the management platform 221 determines the resource usage of N computing units based on resource usage information. For example, if there are differences in resource utilization rates, the target data volume can be less than or equal to the data volume threshold as a condition for the target computing unit to perform fault testing when running the AI ​​model; and / or, the resource utilization rate can be less than or equal to the resource utilization rate threshold as a condition for the target computing unit to perform fault testing when running the AI ​​model.

[0213] Step 803: The management platform 221 sends the first fault test information to the target computing unit among the N computing units.

[0214] In this solution, the computing unit performs fault testing while running the AI ​​model, eliminating the need for fault testing at other times during model training. This improves resource utilization and reduces the impact on subsequent model training and inference.

[0215] Figure 17 shows a flowchart illustrating another testing method for an artificial intelligence cluster provided by an embodiment of the present invention.

[0216] As shown in Figure 17, based on steps 801 to 803 shown in Figure 8, this embodiment of the invention further includes at least the following steps:

[0217] Step 804: The management platform 221 sends test cases to the target computing unit. The test cases are used to test whether the target computing unit has any faults.

[0218] In this embodiment of the invention, test cases are used to illustrate the input data, data processing logic, and expected results.

[0219] It should be noted that test cases can be designed based on the actual needs of fault testing.

[0220] In one example, when fault testing includes silent data error detection, test cases are designed based on silent data errors. For instance, test cases are used to test whether the computing unit can perform calculations correctly.

[0221] In one example, when fault testing includes performance testing, test cases are designed based on performance characteristics. For instance, test cases are used to test whether the computing unit's performance is normal. For example, test cases might check whether the computing unit's processing time is between 5 and 8 seconds.

[0222] Step 805: If the target calculation unit determines that the conditions for the first fault information indication are met, the test case is run.

[0223] In some possible scenarios, as shown in Figure 9a, the first fault information indicates that the target computing unit performs fault testing during an idle period when running the AI ​​model. The target computing unit runs test cases when it detects that it is in an idle period, and stops running test cases after the idle period ends. In this scenario, the management platform 221 can send the first fault test information to the target computing unit in real time while the AI ​​cluster is running the AI ​​model.

[0224] In some possible scenarios, as shown in Figure 9a, the first fault information indicates that the target computing unit performs fault testing when running the AI ​​model under the condition that the resources are in an idle state. The target computing unit runs test cases when it detects that the resources are idle, and stops running test cases when the resources are running. In this scenario, the management platform 221 can determine the first fault test information before the AI ​​cluster runs the AI ​​model and send the first fault test information to the target computing unit.

[0225] In some possible scenarios, the AI ​​model is as shown in Figure 10a or 10b. The first fault information is used to explain the conditions for fault testing when the target computing unit runs the AI ​​model: the resources are idle, or the target data volume is less than or equal to the data processing volume threshold. The target computing unit runs test cases when it detects that the target data volume input to the expert model is less than or equal to the data volume threshold, or the resources are idle. The target data volume is the amount of data allocated to the expert model deployed by the target computing unit by the gating network of N computing units. In this scenario, the management platform 221 can determine the first fault test information before the AI ​​cluster runs the AI ​​model and send the first fault test information to the target computing unit, or it can determine the first fault test information and send the first fault test information to the target computing unit during the running of the AI ​​model in the AI ​​cluster.

[0226] In some possible scenarios, the AI ​​model is as shown in Figure 10a or 10b. The first fault information is used to explain the conditions for the target computing unit to perform fault testing when running the AI ​​model: the resource is in an idle state, or the resource utilization rate is less than or equal to the resource utilization rate threshold. The target computing unit runs test cases when it detects that the resource utilization rate is less than or equal to the resource utilization rate threshold, or that the resource is in an idle state. In this scenario, the management platform 221 can determine the first fault test information before the AI ​​cluster runs the AI ​​model and send the first fault test information to the target computing unit, or it can determine the first fault test information and send the first fault test information to the target computing unit during the AI ​​cluster's AI model execution.

[0227] In some possible scenarios, the AI ​​model is as shown in Figure 11a. The first fault information is used to explain the conditions for the target computing unit to perform fault testing when running the AI ​​model. These conditions are that the target data volume is less than or equal to the data processing volume threshold, or the resource utilization rate is less than or equal to the resource utilization rate threshold. The target computing unit runs test cases when it detects that the data volume input to the AI ​​model is less than the data volume threshold, or the resource utilization rate is less than or equal to the resource utilization rate threshold. In this scenario, the management platform 221 can determine the first fault test information before the AI ​​cluster runs the AI ​​model and send the first fault test information to the target computing unit, or it can determine the first fault test information and send it to the target computing unit during the AI ​​cluster's operation of the AI ​​model.

[0228] In some cases, the target computing unit can persist test cases to a storage medium. Subsequently, the computing unit can read and run the test cases from the storage medium. Specifically, the persistent storage medium includes: a database, a Ceph storage device, a Hadoop Distributed File System, a Storage Area Network (SAN) storage device, a Network Attached Storage (NAS) storage device, a Redundant Arrays of Independent Disks (RAID), or an object storage system. For ease of description, this specification uses a database as the persistent storage medium and will illustrate this in detail. Of course, a relational database or an object-oriented database can be used. The test cases stored on the persistent storage medium can be invoked by the computing unit later. The computing unit runs the test cases to determine whether the computing unit has experienced a failure.

[0229] In this embodiment of the invention, test cases are used to describe the input data, data processing logic, and expected results. The computing unit runs the test cases, processes the input data according to the data processing logic, and obtains the processing results. If the processing results are inconsistent with the expected results, it indicates that the computing unit has malfunctioned.

[0230] In one example, when fault testing includes silent data error detection, test cases can be used to test whether the computing unit can perform correct calculations. Correspondingly, the target computing unit runs the test cases to check whether it can obtain the correct calculation results. If it is correct, the test case passes; if it is incorrect, the target computing unit has a fault risk.

[0231] In one example, when fault testing includes performance testing, test cases can be used to test whether the performance of the computing unit is normal. Correspondingly, the target computing unit runs the test cases and checks whether the processing time of the target computing unit is between 5 and 8 seconds. If it is, the test case is passed; if not, the target computing unit has a fault risk.

[0232] It should be noted that running test cases for fault testing is only an example. The specific fault testing plan can be flexibly designed according to actual needs. For example, if any two computing units out of N computing units run the same code and the results are different, the computing unit with the risk can be identified.

[0233] Figure 18 shows a flowchart illustrating another testing method for an artificial intelligence cluster provided by an embodiment of the present invention.

[0234] As shown in Figure 18, based on steps 801 to 803 shown in Figure 8, this embodiment of the invention further includes at least the following steps:

[0235] Step 806: The management platform 221 obtains fault indication information, which is used to indicate that at least one of the N computing units has failed.

[0236] In practical applications, the computing unit reports its own status (running, idle) to the management platform 221. If the management platform 221 does not receive the status of the computing unit, it determines that the computing unit is faulty.

[0237] Step 807: The management platform 221 determines the second fault test information based on the fault indication information. The second fault test information is used to instruct the computing units that have not experienced faults among the N computing units to perform fault tests.

[0238] Step 808: The management platform 221 sends the second fault test information to the computing units among the N computing units that have not experienced a fault.

[0239] When an AI cluster experiences a failure, the faulty server node or computing unit is typically isolated, and a new computing machine is then scheduled into the AI ​​cluster for training configuration. Therefore, when a computing unit fails, a new computing unit needs to be scheduled for configuration. During this time, other normal (non-failed) computing units will be idle. To improve resource utilization, these idle periods can be used for fault testing, such as running test cases to check for silent data errors.

[0240] It is worth noting that the example of the management platform 221 sending a second fault test message to the computing units that have not failed when a computing unit in the AI ​​cluster fails is merely an example. In other possible implementations, the management platform 221 may send the second fault test message to each computing unit in the AI ​​cluster in advance. Subsequently, when the AI ​​cluster fails, the management platform 221 sends the AI ​​cluster fault information to the computing units that have not failed, so that the computing units that have not failed are aware that the AI ​​cluster has failed and begin to perform fault testing.

[0241] It should be noted that the management platform 221 mentioned above is only an example of the execution entity. In some other possible implementations, the AI ​​cluster has a management node that can execute the steps performed by the management platform 221, such as steps 801 to 803 shown in Figure 8, steps 8021 to 8023 shown in Figure 13, steps 804 and 805 shown in Figure 17, or steps 806 to 808 shown in Figure 18.

[0242] In some possible scenarios, when terminal 210 sends a model training request as shown in Figure 15a to management platform 221, management platform 221 can select a suitable AI cluster based on the type of computing unit, determine the AI ​​cluster, and determine the model training task based on the model training request. The model training task is then sent to the management node of the AI ​​cluster. The management node deploys the AI ​​model in N computing units according to the deployment architecture of the AI ​​model, and allocates data to N computing units according to the data allocation information. Afterwards, each of the N computing units can collect resource usage parameters (such as resource utilization rate or resource status) and report them to the management node. Subsequently, after obtaining the resource usage information (the resource usage parameters of the N computing units), the management node can determine the conditions for each computing unit to perform fault testing when running the AI ​​model. For example, the resource is in an idle state, there is an idle time period, the resource utilization rate is less than or equal to the resource utilization rate, and the target data volume is less than or equal to the data volume threshold. For details, please refer to Figures 16a to 16e and their descriptions, which will not be repeated here.

[0243] In other possible scenarios, when the management platform 221 deploys the AI ​​model in N computing units of the AI ​​cluster according to the AI ​​model deployment architecture, the terminal 210 sends a model call request as shown in Figure 15b to the management node of the AI ​​cluster. In response to the model call request, the management node allocates user data to the N computing units according to the data allocation information during AI model training. The AI ​​models deployed in the N computing units perform inference. During the inference process, resource usage parameters can be collected and reported to the management node. Subsequently, after obtaining the resource usage information (resource usage parameters of the N computing units), the management node can determine the conditions for each computing unit in the N computing units to perform fault testing when running the AI ​​model. For example, the resources are in an idle state, there is an idle period, the resource utilization rate is less than or equal to the resource utilization rate, and the target data volume is less than or equal to the data volume threshold. For details, please refer to Figures 16a to 16e and their descriptions, which will not be repeated here.

[0244] As shown in Figure 19, this embodiment of the invention also provides a testing device for an artificial intelligence cluster. The method is applied to a management platform, which manages infrastructure, including an artificial intelligence cluster. The artificial intelligence cluster includes multiple computing units, which are used to run artificial intelligence models, i.e., AI models, including:

[0245] An architecture acquisition module is used to acquire the architecture information of the artificial intelligence model, the architecture information being used to describe the deployment architecture of the artificial intelligence model in the multiple computing units;

[0246] The condition acquisition module is used to determine first fault test information based on the architecture information. The first fault test information is used to indicate the conditions under which the target computing unit among the plurality of computing units performs fault testing when running the artificial intelligence model.

[0247] The sending module is used to send the first fault test information to the target computing unit among the plurality of computing units.

[0248] The architecture acquisition module, condition acquisition module, and sending module can all be implemented in software or hardware. For example, the architecture acquisition module will be used as an example to illustrate its implementation. Similarly, the implementation methods for the condition acquisition module and the sending module can refer to the implementation method of the architecture acquisition module.

[0249] As an example of a software functional unit, the architecture acquisition module may include code running on a computing instance. This computing instance may include at least one of a physical host (computing device), a virtual machine, or a container. Furthermore, the aforementioned computing instance may be one or more. For example, the architecture acquisition module may include code running on multiple hosts / virtual machines.

[0250] The code runs on a container. It's important to note that multiple hosts / virtual machines / containers used to run this code can be distributed within the same region or across different regions. Furthermore, these multiple hosts / virtual machines / containers can be distributed within the same availability zone (AZ) or across different AZs, each AZ comprising one or more geographically proximate data centers. Typically, a region can include multiple AZs.

[0251] Similarly, multiple hosts / virtual machines / containers used to run this code can be distributed within the same Virtual Private Cloud (VPC) or across multiple VPCs. Typically, a VPC is set up within a region. Communication between two VPCs within the same region, as well as between VPCs in different regions, requires a communication gateway to be set up within each VPC to enable interconnection between VPCs.

[0252] As an example of a hardware functional unit, the architecture acquisition module may include at least one computing device, such as a server. Alternatively, the architecture acquisition module may also be a device implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD). The PLD can be implemented using a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), generic array logic (GAL), or any combination thereof.

[0253] The architecture acquisition module includes multiple computing devices that can be distributed within the same region or in different regions. Similarly, these computing devices can be distributed within the same Availability Zone (AZ) or in different AZs. Likewise, they can be distributed within the same Virtual Private Cloud (VPC) or multiple VPCs. These multiple computing devices can be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, and GALs.

[0254] It should be noted that, in other embodiments, the architecture acquisition module can be used to execute any step of the above-mentioned artificial intelligence cluster testing method, such as shown in Figures 8, 13, 17, or 18; the condition acquisition module can be used to execute any step of the above-mentioned artificial intelligence cluster testing method, such as shown in Figures 8, 13, 17, or 18; and the sending module can be used to execute any step of the above-mentioned artificial intelligence cluster testing method, such as shown in Figures 8, 13, 17, or 18. The steps implemented by the architecture acquisition module, the condition acquisition module, and the sending module can be specified as needed. The architecture acquisition module, the condition acquisition module, and the sending module respectively implement the different steps of the above-mentioned artificial intelligence cluster testing method, such as shown in Figures 8, 13, 17, or 18, to realize all the functions of the artificial intelligence cluster testing device.

[0255] As shown in Figure 20, the present invention also provides a computing device 2000. The computing device includes a bus 2002, a processor 2004, a memory 2006, and a communication interface 2008. The processor 2004, the memory 2006, and the communication interface 2008 communicate with each other via the bus 2002. It should be understood that the present invention does not limit the number of processors and memories in the computing device 2000.

[0256] Bus 2002 can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. Bus 2004 can include pathways for transmitting information between various components of computing device 2000 (e.g., memory 2006, processor 2004, communication interface 2008).

[0257] Processor 2004 may include any one or more processors such as a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP), or a digital signal processor (DSP).

[0258] The memory 2006 may include volatile memory, such as random access memory (RAM). The processor 2004 may also include non-volatile memory, such as read-only memory (ROM), flash memory, hard disk drive (HDD), or solid-state drive (SSD). The memory 2006 stores executable program code, which the processor 2004 executes to implement the aforementioned testing method for the artificial intelligence cluster, as shown in Figures 8, 13, 17, or 18. Specifically, the memory 2006 stores a testing device for the artificial intelligence cluster used to execute the instructions for the aforementioned testing method for the artificial intelligence cluster, as shown in Figures 8, 13, 17, or 18.

[0259] The communication interface 2003 uses transceiver modules, such as, but not limited to, network interface cards and transceivers, to enable communication between the computing device 2000 and other devices or communication networks.

[0260] This invention also provides a computing device cluster. As shown in FIG21, the computing device cluster includes at least one computing device 2000. The computing device may be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device may also be a terminal device such as a desktop computer, a laptop computer, or a smartphone.

[0261] As shown in Figure 21, the computing device cluster includes at least one computing device 2000. The memory 2006 of one or more computing devices 2000 in the computing device cluster may store the same instructions for executing the test methods of the above-mentioned artificial intelligence cluster, such as those shown in Figures 8, 13, 17 or 18.

[0262] In some possible implementations, one or more computing devices 2000 in the computing device cluster can also be used to execute some of the instructions of the aforementioned artificial intelligence cluster testing methods, such as those shown in Figures 8, 13, 17, or 18. In other words, a combination of one or more computing devices 2000 can jointly execute the instructions of the aforementioned artificial intelligence cluster testing methods, such as those shown in Figures 8, 13, 17, or 18.

[0263] It should be noted that the memory 2006 in different computing devices 2000 within the computing device cluster can store different instructions, which are used to execute certain functions of the aforementioned artificial intelligence cluster testing methods, such as those shown in Figures 8, 13, 17, or 18. That is, the instructions stored in the memory 2006 of different computing devices 2000 can implement the functions of one or more modules among the architecture acquisition module, condition acquisition module, and sending module.

[0264] In some possible implementations, one or more computing devices in a computing device cluster can be connected via a network. This network can be a wide area network (WAN) or a local area network (LAN), etc. Figure 22 illustrates one possible implementation. As shown in Figure 22, two computing devices 2000A and 2000B are connected via a network. Specifically, they are connected to the network through communication interfaces in each computing device. In this type of possible implementation, the memory 106 in computing device 2000A stores instructions for executing the functions of the architecture acquisition module and the condition acquisition module. Simultaneously, the memory 106 in computing device 2000B stores instructions for executing the functions of the sending module.

[0265] The connection method between the computing device clusters shown in Figure 22 can be considered as follows: considering that the testing method of the artificial intelligence cluster provided by the present invention needs to send the first fault test information to the target computing unit, the functions implemented by the condition acquisition module and the sending module are considered to be executed by the computing device 2000A.

[0266] It should be understood that the functions of computing device 2000A shown in Figure 22 can also be performed by multiple computing devices 2000. Similarly, the functions of computing device 2000B can also be performed by multiple computing devices 2000.

[0267] This invention also provides a computer-readable storage medium. The computer-readable storage medium can be any available medium that a computing device can store, or a data storage device such as a data center containing one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state drive). The computer-readable storage medium includes instructions that instruct the computing device to execute the aforementioned testing method for the artificial intelligence cluster, as shown in Figures 8, 13, 17, or 18.

[0268] This invention also provides a computer program product containing instructions. The computer program product may be a software or program product containing instructions, capable of running on a computing device or stored on any available medium. When the computer program product is run on at least one computer device, it causes the at least one computer device to execute the aforementioned testing method for the artificial intelligence cluster, as shown in Figures 8, 13, 17, or 18.

[0269] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the protection scope of the technical solutions of the embodiments of the present invention.

Claims

1. A testing method for an artificial intelligence cluster, characterized in that, The method is applied to a management platform for managing infrastructure, including an artificial intelligence cluster. The artificial intelligence cluster includes multiple computing units for running artificial intelligence models. The method includes: Obtain the architecture information of the artificial intelligence model, which is used to describe the deployment architecture of the artificial intelligence model in the multiple computing units; Based on the architecture information, first fault test information is determined, which is used to indicate the conditions under which the target computing unit among the plurality of computing units performs fault testing when running the artificial intelligence model; The first fault test information is sent to the target computing unit among the plurality of computing units.

2. The method according to claim 1, characterized in that, The method further includes: Obtain the data allocation information of the artificial intelligence model, wherein the data allocation information is used to describe the strategy for allocating the input data of the artificial intelligence model to the plurality of computing units; The step of determining the first fault test information based on the architecture information includes: Based on the architecture information and the data allocation information, resource usage information is determined, which is used to describe the resource usage of the multiple computing units running the artificial intelligence model; Based on the resource usage information, the first fault test information is determined.

3. The method according to claim 2, characterized in that, When the architecture information is a pipelined parallel architecture, the method further includes: Based on the pipelined parallel architecture, the idle time model of the plurality of computing units is determined, wherein the pipelined parallel architecture is used to describe how multiple parts of the artificial intelligence model are deployed to different computing units in the artificial intelligence cluster, and the output of the first part of the plurality of parts is the input of the second part of the plurality of parts; The determination of the first fault test information based on the resource usage information includes: Based on the idle time model and the resource usage information, determine the idle time period of the target computing unit among the plurality of computing units when running the artificial intelligence model; The idle time period is used as a condition for the target computing unit to perform fault testing when running the artificial intelligence model, and the first fault test information is obtained.

4. The method according to claim 2, characterized in that, When the architecture information is an expert parallel architecture, the expert parallel architecture describes multiple expert models deployed to different computing units in the artificial intelligence cluster, wherein the multiple expert models are connected to a gating network, and the gating network is used to distribute the input data of the artificial intelligence model to at least some or all of the multiple expert models, and the gating network is deployed in each of the multiple computing units; the method further includes: Based on the data allocation information, the total amount of data and the number of batches are determined, wherein the total amount of data is used to indicate the amount of input data of the multiple computing units; The determination of the first fault test information based on the resource usage information includes: Based on the resource usage information, the total amount of data, and the number of batches, a data volume threshold for the target data volume is determined. The target data volume is used to indicate the amount of input data of the expert model deployed in the target computing unit among the multiple computing units. Using the target data volume being less than or equal to the data volume threshold as a condition for the target computing unit to perform fault testing when running the artificial intelligence model, first fault test information is obtained.

5. The method according to any one of claims 1 to 4, characterized in that, The method further includes: Test cases are sent to the target computing unit, and the test cases are used to test whether the target computing unit has any faults.

6. The method according to any one of claims 1 to 5, characterized in that, The method further includes: Obtain fault indication information, which is used to indicate that at least one of the plurality of computing units has failed; Based on the fault indication information, second fault test information is determined, which is used to instruct the computing units among the plurality of computing units that have not experienced a fault to perform a fault test. The second fault test information is sent to the computing unit among the plurality of computing units that has not experienced a fault.

7. A testing device for an artificial intelligence cluster, characterized in that, The device is applied to a management platform for managing infrastructure, including an artificial intelligence cluster comprising multiple computing units for running artificial intelligence models. An architecture acquisition module is used to acquire the architecture information of the artificial intelligence model, the architecture information being used to describe the deployment architecture of the artificial intelligence model in the multiple computing units; The condition acquisition module is used to determine first fault test information based on the architecture information. The first fault test information is used to indicate the conditions under which the target computing unit among the plurality of computing units performs fault testing when running the artificial intelligence model. The sending module is used to send the first fault test information to the target computing unit among the plurality of computing units.

8. The apparatus according to claim 7, characterized in that, The condition acquisition module is used to acquire the data allocation information of the artificial intelligence model, and the data allocation information is used to describe the strategy of allocating the input data of the artificial intelligence model to the multiple computing units; Based on the architecture information and the data allocation information, resource usage information is determined, which is used to describe the resource usage of the multiple computing units running the artificial intelligence model; Based on the resource usage information, the first fault test information is determined.

9. The apparatus according to claim 8, characterized in that, When the architecture information is a pipelined parallel architecture, the condition acquisition module is used to determine the idle time model of the plurality of computing units according to the pipelined parallel architecture, wherein the pipelined parallel architecture is used to indicate that multiple parts of the artificial intelligence model are deployed to different computing units in the artificial intelligence cluster, and the output of the first part of the plurality of parts is the input of the second part of the plurality of parts; based on the idle time model and the resource usage information, the idle time period of the target computing unit in the plurality of computing units when running the artificial intelligence model is determined; the idle time period is used as the condition for the target computing unit to perform fault testing when running the artificial intelligence model, and the first fault test information is obtained.

10. The apparatus according to claim 8, characterized in that, When the architecture information is an expert parallel architecture, the expert parallel architecture is used to describe that multiple expert models are deployed to different computing units in the artificial intelligence cluster, wherein the multiple expert models are connected to a gating network, and the gating network is used to distribute the input data of the artificial intelligence model to at least some or all of the multiple expert models, and the gating network is deployed in each of the multiple computing units; The condition acquisition module is used to determine the total amount of data and the number of batches based on the data allocation information, wherein the total amount of data is used to indicate the amount of input data of the multiple computing units; and to determine a target data amount threshold based on the resource usage information, the total amount of data, and the number of batches, wherein the target data amount is used to indicate the amount of input data of the expert model deployed in the target computing unit among the multiple computing units; and to use the target data amount being less than or equal to the data amount threshold as a condition for the target computing unit to perform fault testing when running the artificial intelligence model, thereby obtaining first fault test information.

11. The apparatus according to any one of claims 7 to 10, characterized in that, The sending module is also used to send test cases to the target computing unit, and the test cases are used to test whether the target computing unit has a fault.

12. The apparatus according to any one of claims 7 to 11, characterized in that, The condition acquisition module is further configured to acquire fault indication information, which indicates that at least one of the plurality of computing units has failed; and to determine second fault test information based on the fault indication information, which instructs the computing units that have not failed among the plurality of computing units to perform fault testing. The sending module is further configured to send the second fault test information to the non-faulty computing unit among the plurality of computing units.

13. A computing device cluster, characterized in that, It includes at least one computing device, each computing device including a processor and memory; The processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device to cause the cluster of computing devices to perform the method as described in any one of claims 1 to 6.

14. A computer program product containing instructions, characterized in that, When the instruction is executed by a cluster of computer devices, the cluster of computer devices causes the cluster of computer devices to perform the method as described in any one of claims 1 to 6.

15. A computer-readable storage medium, characterized in that, Includes computer program instructions, which, when executed by a cluster of computing devices, perform the method as described in any one of claims 1 to 6.