Ai task failure rescheduling method and system, storage medium and electronic device

By optimizing the rescheduling strategy for AI tasks using deep reinforcement learning algorithms, the problems of high latency and uneven resource utilization of AI tasks after cluster node failures are solved, achieving rapid recovery and efficient utilization, and improving the stability and efficiency of the cluster.

CN122285263APending Publication Date: 2026-06-26CHINA MOBILE (SUZHOU) SOFTWARE TECH CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHINA MOBILE (SUZHOU) SOFTWARE TECH CO LTD
Filing Date
2026-03-05
Publication Date
2026-06-26

Smart Images

  • Figure CN122285263A_ABST
    Figure CN122285263A_ABST
Patent Text Reader

Abstract

This application provides an AI task fault rescheduling method, an AI task fault rescheduling system, a computer-readable storage medium, and an electronic device, relating to the field of cloud computing technology. The method includes: when a faulty node is detected in the cluster, determining healthy nodes and executing a fault handling process to form a set of AI tasks to be rescheduled; obtaining input parameters including task scheduling priority, health index of healthy nodes, and scheduling latency; determining an optimal scheduling strategy based on a deep reinforcement learning algorithm, with minimizing scheduling latency and minimizing cluster resource imbalance as joint optimization objectives; and executing task rescheduling according to the optimal scheduling strategy. This application can solve the problems of high latency and unbalanced cluster resource utilization in AI task rescheduling after cluster node failure.
Need to check novelty before this filing date? Find Prior Art