Task training method and device, equipment and storage medium
A training method and task technology, applied in the direction of program control device, program control design, instrument, etc., can solve the problems of inability to guarantee the normal operation of training tasks, waste of resources, etc., achieve the effect of realizing high-performance network card multiplexing and improving training efficiency
Pending Publication Date: 2022-06-21
SUZHOU LANGCHAO INTELLIGENT TECH CO LTD
0 Cites 0 Cited by
AI-Extracted Technical Summary
Problems solved by technology
However, for the Roce network card and Infiniband network card in the AI server, due to the situation that the bond cannot be performed and the resources are wasted, two ...
Method used
As can be seen, the embodiment of the present application performs virtualization processing on multiple physical network cards of the AI server respectively to obtain multiple virtual network cards; wherein, there is a virtual network card between each of the physical network cards and the corresponding virtual network cards relationship; then all the virtual network cards are grouped to obtain different resource groups; wherein, the virtual relationship between the virtual network card and the physical network card in each resource group is not repeated; finally, the target resource is used The virtual network cards in the group provide network resources for the target task on the AI server, so as to train the target task. In the embodiment of the present application, on the basis of virtualizing the physical network card into multiple virtual network cards, the virtual network cards of different physical network cards are bound by means of resource grouping. During the training process, when any network card in the resource group is abnormal, Other non-abnormal network cards in the resource group can be used to continue to provide network resources, so that high-performance network cards can be reasonably allocated to training tasks and high-performance network card reuse can be realized.
[007...
Abstract
The invention discloses a task training method and device, equipment and a storage medium, and the method comprises the steps: carrying out the virtualization processing of a plurality of physical network cards of an artificial intelligence server, so as to obtain a plurality of virtual network cards; wherein a virtual relationship exists between each physical network card and the corresponding virtual network card; all the virtual network cards are grouped, so that different resource groups are obtained; wherein the virtual relationship between the virtual network card and the physical network card in each resource group is not repeated; and providing network resources for a target task on the artificial intelligence server by using the virtual network card in a target resource group so as to train the target task. In the training process, when any network card in the resource group is abnormal, other network cards without abnormity in the resource group can be utilized to continuously provide network resources, so that high-performance network cards are reasonably allocated to the training task, high-performance network card multiplexing is realized, and the task training efficiency is improved.
Application Domain
Data switching detailsSoftware simulation/interpretation/emulation
Technology Topic
VirtualizationReal-time computing +7
Image
Examples
- Experimental program(1)
Example Embodiment
[0040] The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
[0041] In the prior art, the Roce NIC and Infiniband NIC in the AI server cannot be bonded and resources are wasted. Generally, two high-performance NICs are not bonded. When the high-performance NIC is abnormal, the normal operation of the training task cannot be guaranteed. In view of the above-mentioned technical defects, the present application provides a task training scheme, which can reasonably allocate high-performance network cards to training tasks, and realize multiplexing of high-performance network cards, thereby improving the efficiency of task training.
[0042] figure 1 This is a flowchart of a task training method provided in the embodiment of the present application. see figure 1 As shown, the task training method includes:
[0043] S11: Perform virtualization processing on multiple physical network cards of the artificial intelligence server, respectively, to obtain multiple virtual network cards; wherein, there is a virtual relationship between each of the physical network cards and the corresponding virtual network card.
[0044] In this embodiment, virtualization processing is performed on multiple physical network cards of an artificial intelligence server (hereinafter referred to as an AI server) to obtain multiple virtual network cards. The physical network cards include but are not limited to high-performance network cards such as ROCE and Infiniband. The following embodiments take ROCE network cards as an example for description, and other high-performance network cards can also achieve the same technical effect.
[0045]In this embodiment, during virtualization processing, the AI server is first controlled to enter a Basic Input Output System (BIOS, Basic Input Output System) to enable the first option ( VT-D option), then install the network card driver in the AI server and enable the second option (SRIOV option) of the virtualization technology supporting direct input/output (I/O) access of the AI server, so that all The multiple physical network cards of the AI server are respectively virtualized. For ROCE network card, the above network card driver is the driver of Mellanox ROCE network card. For example, a physical ROCE NIC is virtualized into 16 virtual ROCE NICs, that is, 1 PF is virtualized into 16 VFs. If the AI server has 2 physical ROCE NICs, you can finally see 32 virtual ROCE NICs. It should be noted that there is a virtual relationship between each of the physical network cards and the corresponding virtual network card, and resources are subsequently grouped according to the relationship. VT-D stands for Intel Virtualization Technology for Direct I/O, which is part of Intel Virtualization Technology. The full name of SRIOV is Single Root I/O Virtualization. Based on this technology, the physical network card is virtualized into multiple lightweight PCI-e physical devices, and these lightweight PCI-E devices can be allocated to containers or virtual machines.
[0046] In particular, this embodiment implements network card multiplexing based on an AI training platform built on a specific container orchestration platform, such as a Kubernetes platform. Kubernetes is an open source container orchestration project. For this purpose, it is necessary to deploy Kubernetes clusters and resource management components. The resource management components are built on the basis of the DevicePlugin mechanism of Kubernetes, which can abstract different numbers of physical network card resources of multiple AI servers into a virtual Network resources, which are used to abstract the network resources of physical servers. When different servers in the cluster have different numbers of the physical network cards, they can all be represented by a resource name, which simplifies the resource application method and usage method when creating training tasks. . For example, when creating a training task, you can apply for "roce-network" (resource name) to indicate that the task will apply for the physical network card resources that exist on the physical host where the container is located, which can be a single physical network card or multiple physical network cards. NICs, depending on the number of NICs on the physical host, do not need to apply for a resource for each physical NIC. When creating a training task, specify the application for virtual (ROCE) network card resources, and allocate the virtual (ROCE) network card to the training task (container) based on the resource scheduling and allocation mechanism of Kubernetes.
[0047] S12: Group all the virtual network cards to obtain different resource groups; wherein the virtual relationship between the virtual network cards and the physical network cards in each resource group is not repeated.
[0048] In this embodiment, all the virtual network cards are grouped to obtain different resource groups. Wherein, the virtual relationship between the virtual network card and the physical network card in each resource group is not repeated. During grouping, the virtual relationship between the virtual host and the physical host in the physical host is automatically detected, and the virtual relationship between each physical network card and the corresponding virtual network card and the resource group are reported to Kubernetes platform to allocate the resource group through the Kubernetes platform. That is, the virtual hosts belonging to different physical hosts are combined and reported to the Kubernetes platform, and the virtual network card belonging to one physical network card is combined and paired with the virtual network card of another physical network card. It can be understood that when there are more physical network cards in the physical machine, pairing can be continued according to this rule. The combined resource represents a virtual network resource. After the resource is reported to the Kubernetes platform, it will automatically apply for this type of resource when creating a training task container.
[0049] like figure 2 As shown, there are two physical ROCE NICs in an AI server, and each physical ROCE NIC is virtualized into two virtual ROCE NICs, that is, PF1 is virtualized as 1VF-a/2VF-b, and PF2 is virtualized as 2VF-a/ 2VF-b. During resource grouping, 1VF-a is selected from PF1 and 2VF-a is selected from PF2 to obtain a resource group "1VF-a/2VF-a". At this time, in the resource group, the virtual relationship between 1VF-a and PF1 and the virtual relationship between PF2 and 2VF-a do not overlap.
[0050] S13: Use the virtual network card in the target resource group to provide network resources for the target task on the artificial intelligence server, so as to train the target task.
[0051] In this embodiment, the virtual network card in the target resource group is used to provide network resources for the target task on the AI server, so as to train the target task. Before this, the target task and the target container for running the target task need to be created. For AI platforms, containers are usually used to allocate and isolate GPUs. When multiple tasks run on a single AI server at the same time, high-performance network cards need to be allocated to containers. For containers running on different servers, the number of the virtual network cards that can be used inside the container depends on the number of the physical network cards of the host. Then, the target resource group corresponding to the target task is determined from all the resource groups based on the Kubernetes platform, and the network resource is requested from the AI server where the target resource group corresponds to the physical network card. For scenarios where high-performance NICs are reused, the AI platform needs to provide a mechanism to reasonably allocate high-performance NICs to containers, and ensure that training tasks can use the correct high-performance NICs to transmit data.
[0052] It can be seen that the embodiment of the present application first performs virtualization processing on multiple physical network cards of the AI server to obtain multiple virtual network cards; wherein, there is a virtual relationship between each of the physical network cards and the corresponding virtual network card; and then All the virtual network cards are grouped to obtain different resource groups; wherein, the virtual relationship between the virtual network card and the physical network card in each resource group is not repeated; The virtual network card provides network resources for the target task on the AI server, so as to train the target task. In this embodiment of the present application, on the basis of virtualizing a physical network card into multiple virtual network cards, virtual network cards of different physical network cards are bound by means of resource grouping. During the training process, when any network card in the resource group is abnormal, Other non-abnormal network cards in the resource group can be used to continue to provide network resources, so that high-performance network cards can be reasonably allocated to training tasks and high-performance network cards can be reused.
[0053] image 3 This is a flowchart of a specific task training method provided in the embodiment of the present application. see image 3 As shown, the task training method includes:
[0054] S21: Perform virtualization processing on multiple physical network cards of the artificial intelligence server, respectively, to obtain multiple virtual network cards; wherein, there is a virtual relationship between each of the physical network cards and the corresponding virtual network card.
[0055] S22: Group all the virtual network cards to obtain different resource groups; wherein the virtual relationship between the virtual network cards and the physical network cards in each resource group is not repeated.
[0056] S23: Create the target task and a target container for running the target task;
[0057] S24: Determine a target resource group corresponding to the target task from all the resource groups based on the container orchestration platform, and request the network resource from the artificial intelligence server where the target resource group corresponds to the physical network card .
[0058] In this embodiment, for the specific process from step S21 to step S24, reference may be made to the corresponding content disclosed in the foregoing embodiments, which will not be repeated here.
[0059] S25: Acquire the NIC status of the virtual NIC in the target resource group corresponding to the physical NIC, and determine whether the NIC status is normal, and if it is normal, corresponding to the physical NIC in the target resource group whose NIC status is normal The virtual network card is configured;
[0060] S26: If it is abnormal, detect the NIC status of the other virtual NICs in the target resource group corresponding to the physical NICs.
[0061] In this embodiment, in order to enable the training task to use a normal network card to provide network resources, it is necessary to obtain the network card status of the virtual network card in the target resource group corresponding to the physical network card, and determine whether the network card status is normal. Then, configure the virtual network card corresponding to the physical network card whose network card status is normal in the target resource group. If it is abnormal, the network card state corresponding to the physical network card of the other virtual network cards in the target resource group is detected, so as to obtain a network card in a normal state. In this embodiment, when applying for a virtual resource for a training task, the real physical network card corresponding to the virtual resource is retrieved, and the virtual network card is configured to the container only when the state of the physical network card is UP. To this end, the abnormal events of the network card can be further monitored through the resource management component of the Kubernetes platform, and the network card status of the physical network card can be updated according to the monitoring result.
[0062] Taking the ROCE NIC as an example, when a physical ROCE NIC in the cluster is abnormal, the node resource management component will automatically recognize the event. When a new container is created, only the virtual NIC corresponding to the physical NIC in a normal state is used. Physical network card, do not configure the information of its virtual network card (for example, configure it in the nccl.conf file). When a physical ROCE NIC is abnormal, for a newly submitted task, when assigning a virtual ROCE NIC, only the normally usable virtual ROCE NIC is allocated to the training task. For a running training task, there will be an abnormal state of the task operation. Through the monitoring mechanism, the training task is resubmitted. The training task is based on the checkpoint mechanism, and the training will be restarted at the termination step without the user's perception. Specifically as Figure 4 shown.
[0063] S27: Configure the transmission bus address and device name of the virtual network card in the target resource group to obtain a configuration file.
[0064] S28: Use the virtual network card configured in the configuration file to communicate through a communication library built by a multi-card communication framework, so as to provide network resources for the target task on the artificial intelligence server.
[0065]In this embodiment, in the configuration process, first configure the transmission bus address and device name of the virtual network card in the target resource group to obtain a configuration file. If the system architecture adopts the high-speed serial computer expansion bus standard (PCI-Express , peripheral component interconnect express) for data transmission, the transmission bus address of the virtual network card is generally a PCI-E bus address. Then, the virtual network card configured in the configuration file is used to communicate through a communication library built by a multi-card communication framework, so as to provide network resources for the target task on the AI server. The communication library built by the multi-card communication framework can be the NCCL communication library, the full name of which is the nvidia Collectivemulti-GPU Communication Library, which is a collective communication (all-gather, reduce, broadcast) library that implements multiple GPUs. In the low-version Linux Kernel scenario, when there are multiple physical ROCE NICs on the physical machine, more virtual ROCE NICs will appear, and NCCL cannot find the correct virtual ROCE NIC. That is, the ROCE network card that cannot be virtualized is isolated in the container. At this time, when the training task uses NCCL for communication acceleration, the correct virtual network card cannot be selected, resulting in the training task not running normally.
[0066] Therefore, when a training task applies for virtual ROCE network card resources, in addition to allocating the PCI-E address of the virtual ROCE network card, the device name corresponding to the virtual ROCE network card is automatically configured. Virtual ROCE network card to communicate, such as Figure 5 shown. Further, when a training task applies for virtual network card resources, based on the container orchestration mechanism of the Kubernetes platform, the distributed training task is scheduled to different physical nodes, and the "/etc/nccl.conf" file is automatically generated for the training task. And configure the virtual network card name that can be used to this file, thereby informing NCCL to use the correct virtual network card. In the above process, when the training task applies for network resources, it does not need to perceive the physical network resource information of the node, and by setting the correct NCCL environment, the training task is notified to use the correct virtual network card. When the ROCE network card is abnormal, the abnormality of the network card can be detected in real time to ensure that a normal ROCE network card is allocated for the newly submitted training task.
[0067] It can be seen that when a virtual network card is selected in the embodiment of the present application, the network card status of the virtual network card in the target resource group corresponding to the physical network card is first obtained, and it is judged whether the network card status is normal. The virtual network card corresponding to the physical network card whose network card state is normal in the group is configured; if abnormal, the network card state corresponding to the physical network card of the other virtual network cards in the target resource group is detected. While realizing the flexible allocation of physical network cards to multiple training tasks, the effective management of many virtual network cards ensures that the training tasks can use the correct virtual network cards. When a node in the cluster has a network card abnormality, it does not affect the training task. Submit as normal.
[0068] see Image 6 As shown, the embodiment of the present application also discloses a task training device correspondingly, including:
[0069] The virtualization module 11 is used to perform virtualization processing on multiple physical network cards of the artificial intelligence server respectively, so as to obtain multiple virtual network cards; wherein, there is a virtual relationship between each of the physical network cards and the corresponding virtual network card;
[0070] The pairing module 12 is configured to group all the virtual network cards to obtain different resource groups; wherein, the virtual relationship between the virtual network cards and the physical network cards in each of the resource groups is not repeated;
[0071] The training module 13 is configured to provide network resources for the target task on the artificial intelligence server by using the virtual network card in the target resource group, so as to train the target task.
[0072] It can be seen that in this embodiment of the present application, multiple physical network cards of the artificial intelligence server are respectively virtualized to obtain multiple virtual network cards; wherein, there is a virtual relationship between each of the physical network cards and the corresponding virtual network card; Then all the virtual network cards are grouped to obtain different resource groups; wherein, the virtual relationship between the virtual network cards and the physical network cards in each resource group is not repeated; The virtual network card provides network resources for the target task on the artificial intelligence server, so as to train the target task. In this embodiment of the present application, on the basis of virtualizing a physical network card into multiple virtual network cards, virtual network cards of different physical network cards are bound by means of resource grouping. During the training process, when any network card in the resource group is abnormal, Other non-abnormal network cards in the resource group can be used to continue to provide network resources, so that high-performance network cards can be reasonably allocated to training tasks and high-performance network cards can be reused.
[0073] In some specific embodiments, the virtualization module 11 specifically includes:
[0074] a first enabling unit, configured to control the artificial intelligence server to enter the basic input output system to enable the first option of the virtualization technology supporting direct input/output access;
[0075] A second enabling unit, configured to install a network card driver in the artificial intelligence server and enable the second option of the virtualization technology supporting direct input/output access of the artificial intelligence server, so that multiple The physical network cards are respectively virtualized.
[0076] In some specific embodiments, the task training device further includes:
[0077] a reporting module, configured to report the virtual relationship between each physical network card and the corresponding virtual network card and the resource group to the container orchestration platform, so as to allocate the resource group through the container orchestration platform;
[0078] A creation module for creating the target task and a target container for running the target task;
[0079] A determination request module, configured to determine the target resource group corresponding to the target task from all the resource groups based on the container orchestration platform, and to the artificial intelligence where the physical network card is located corresponding to the target resource group The server requests the network resource;
[0080] A judging module, configured to obtain the NIC status of the virtual NIC in the target resource group corresponding to the physical NIC, and determine whether the NIC status is normal; The virtual network card corresponding to the physical network card is configured;
[0081] A detection module, configured to detect the network card state of the other virtual network cards in the target resource group corresponding to the physical network card if it is abnormal;
[0082] a configuration module, configured to obtain a configuration file by configuring the transmission bus address and device name of the virtual network card in the target resource group;
[0083] A communication module, configured to use the virtual network card configured in the configuration file to communicate through a communication library built by a multi-card communication framework, so as to provide network resources for the target task on the artificial intelligence server;
[0084] The updating module is used for monitoring the abnormal events of the network card through the resource management component of the container orchestration platform, and updating the network card state of the physical network card according to the monitoring result.
[0085] Further, the embodiments of the present application also provide an electronic device. Figure 7 It is a structural diagram of an electronic device 20 according to an exemplary embodiment, and the content in the diagram should not be considered as any limitation on the scope of application of the present application.
[0086] Figure 7 This is a schematic structural diagram of an electronic device 20 provided in an embodiment of the present application. The electronic device 20 may specifically include: at least one processor 21 , at least one memory 22 , a power supply 23 , a communication interface 24 , an input and output interface 25 and a communication bus 26 . Wherein, the memory 22 is used to store a computer program, and the computer program is loaded and executed by the processor 21 to implement the relevant steps in the task training method disclosed in any of the foregoing embodiments.
[0087] In this embodiment, the power supply 23 is used to provide working voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and external devices, and the communication protocol it follows is applicable Any communication protocol in the technical solution of the present application is not specifically limited here; the input and output interface 25 is used to obtain external input data or output data to the outside world, and its specific interface type can be selected according to specific application needs, here No specific limitation is made.
[0088] In addition, the memory 22, as a carrier for resource storage, can be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc. The resources stored on it can include an operating system 221, a computer program 222, and data 223, etc., and the storage method can be short-term storage. or permanent storage.
[0089] The operating system 221 is used to manage and control each hardware device and computer program 222 on the electronic device 20, so as to realize the operation and processing of the massive data 223 in the memory 22 by the processor 21, which can be Windows Server, Netware, Unix, Linux etc. The computer program 222 may further include a computer program that can be used to complete other specific tasks in addition to the computer program that can be used to complete the task training method performed by the electronic device 20 disclosed in any of the foregoing embodiments. Data 223 may include training tasks collected by electronic device 20 .
[0090] Further, an embodiment of the present application further discloses a storage medium, where a computer program is stored in the storage medium, and when the computer program is loaded and executed by a processor, the steps of the task training method disclosed in any of the foregoing embodiments are implemented.
[0091] The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments may be referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method.
[0092] Finally, it should also be noted that in this document, relational terms such as first and second are used only to distinguish one entity or operation from another, and do not necessarily require or imply these entities or there is any such actual relationship or sequence between operations. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion such that a process, method, article or device comprising a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.
[0093] The task training method, device, device and storage medium provided by the present invention have been introduced in detail above. The principles and implementations of the present invention are described with specific examples in this paper. The method of the invention and its core idea; at the same time, for those skilled in the art, according to the idea of the present invention, there will be changes in the specific implementation and application scope. In summary, the content of this description should not be understood to limit the present invention.
PUM


Description & Claims & Application Information
We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
Similar technology patents
Information pushing processing method and device and storage medium
Owner:CHINA UNITED NETWORK COMM GRP CO LTD
Nuclear power starting water pump rolling bearing fault detection method and system
Owner:XI AN JIAOTONG UNIV
Extraocular muscle training device for vision rehabilitation
Owner:HUNAN NORMAL UNIVERSITY
Powered car train-set rescue coupling practical training device
ActiveCN113744595AImprove training efficiencyOvercome the traditional training device simulation training model fixed
Owner:CHENGDU YUNDA TECH CO LTD
BERT model training method and system based on multiplier alternating direction method
Owner:NAT UNIV OF DEFENSE TECH
Classification and recommendation of technical efficacy words
- Improve training efficiency
Auxiliary volleyball training device for sports
Owner:SUZHOU YIKADI SPORTS EQUIP CO LTD
AI (Artificial Intelligence) based low-confidence sample processing method and system of board sorting
Owner:BEIJING WOOD AI TECH LTD
Basketball training installation and monitoring system
Owner:河南师范大学新联学院
A short-term load prediction method considering somatosensory temperature and radiation intensity
Owner:NARI TECH CO LTD +4