Methods, systems, and devices for scheduling and computing power optimization of multi-task large models in medical data
By employing a dual-order priority algorithm for diagnosis and treatment and a three-pool isolation mechanism, the scheduling and computing power optimization problems of multi-task large models in medical scenarios are solved, achieving resource guarantee and collaborative optimization of computing power for critical tasks, and improving scheduling stability and resource utilization efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- UNIV OF SCI & TECH BEIJING
- Filing Date
- 2026-04-23
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies struggle to effectively unify constraints on medical timeliness, resource isolation, and heterogeneous model access in healthcare scenarios. This leads to high-risk tasks being blocked by ordinary tasks, online inference interfering with backend training, and a mismatch between computing power and task complexity, affecting the continuity of doctor interactions and the achievement rate of key time limits.
The system employs a dual-sequence priority algorithm for diagnosis and treatment, dual-anchor resource templates, and a three-pool isolation and elastic recycling mechanism. By generating isomorphic task elements for medical and computational tasks, it performs standardized task access, priority calculation, resource isolation, template mapping, and model routing to form a closed-loop control chain, ensuring resource guarantee and computational optimization for critical tasks such as emergency care.
It improves the scheduling stability and computing power utilization of multi-task large models in medical scenarios, reduces the cost of heterogeneous model access and switching, enhances the resource guarantee capability of critical tasks such as emergency care, and improves the overall computing power utilization efficiency and scheduling stability.
Smart Images

Figure CN122309084A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of medical artificial intelligence scheduling and control and heterogeneous computing power orchestration technology, and in particular to a method, system and device for scheduling and computing power optimization of multi-task large models in medical data that can be used in hospital intelligent computing platforms. Background Technology
[0002] As large-scale models continue to be deployed in medical scenarios such as critical value interpretation, emergency and outpatient question answering, medical record generation, medical record quality control, consultation support, and incremental model training, in-hospital intelligent computing platforms often simultaneously handle online clinical inference tasks, real-time doctor interaction tasks, batch processing tasks, and background training tasks. These tasks differ significantly in terms of timeliness, risk level, model capability requirements, and resource consumption patterns. Among them, emergency assistance and critical value interpretation tasks require low latency and high stability, while training, fine-tuning, and batch processing tasks typically have long durations and high GPU memory consumption.
[0003] In existing technologies, common first-in-first-out (FIFO), static priority, or single global scoring scheduling methods struggle to translate medical constraints such as the urgency of diagnosis and treatment, clinical risk level, remaining time limits, and clinical workflow dependencies into directly executable scheduling rules. This can easily lead to high-risk tasks being blocked by ordinary tasks. Furthermore, online inference and background training often run concurrently on the same computing cluster, lacking resource pool isolation, elastic borrowing boundaries, and recoverable preemption mechanisms, resulting in latency jitter in clinical inference. In medical scenarios, this jitter directly impacts the continuity of doctor interactions and the timeliness of key procedures.
[0004] Meanwhile, different vendors' models differ in input / output protocols, context constraints, health checks, and canary-scale switching methods. Without a unified service interface and capability description mechanism, model access and switching costs are high. Existing complexity estimations mostly remain at the task type or input length level, making it difficult to achieve fine-grained allocation of computing power based on task complexity. Therefore, a multi-task large-model priority scheduling and computing power collaborative optimization scheme is needed that can uniformly bear the constraints of medical timeliness, resource isolation, and heterogeneous model access, in order to improve overall computing power utilization efficiency and scheduling stability while ensuring the response speed of critical tasks such as emergency care. Summary of the Invention
[0005] The purpose of this invention is to provide a method, system, and device for priority scheduling and computational power collaborative optimization of multi-task large models in medical scenarios, so as to solve the problems in the prior art, such as priority evaluation remaining at a single score, mutual interference between online inference and background training, difficulty in switching between multiple vendor models, and mismatch between computational power configuration and task complexity.
[0006] Firstly, this invention provides a scheme for scheduling and computing power optimization of multi-task large models in medical data under medical scenarios. This scheme takes medical and computing isomorphic task elements as unified input, the diagnosis and treatment dual-order escort priority algorithm as the scheduling core, the three-pool isolation and elastic recycling mechanism as the resource guarantee framework, and the dual-anchor resource template as the resource allocation carrier to generate candidate model constraints and resource request vectors, and respectively pass them to the model routing and resource pool control links. The unified service interface of the model hub and the model capability fingerprint book are used as the basis for model routing, thereby forming a closed-loop control chain of "standardized task access - priority calculation - resource isolation - template mapping - model routing - dynamic recycling - restricted update".
[0007] In one implementation, task requests from the hospital information system, electronic medical record system, doctor workstation, bedside terminal, and backend training platform are first subjected to field mapping, validity verification, and standardized encapsulation to generate a medical computing isomorphic task element T. This task element includes at least a source identifier, task category, urgency level E, clinical risk level R, and remaining time limit T. rem Waiting time W, process dependency depth D, input length L in Context round N ctx Modal complexity M mod Depth of Reasoning P dep Target output length L out Target delay T lat Capability requirement identifier, degradability identifier, and interruptibility identifier.
[0008] In one implementation, the dual-order escort algorithm for medical treatment employs a "dual-order calculation, hierarchical comparison" method to calculate priority. The first order is the medical treatment escort rank G, used to express the level of medical safety protection; the second order is the performance order value S within the rank, used to express the scheduling order within the same protection level. Preferably, the medical treatment escort rank G is determined according to... Calculation, where Q est X represents the estimated queuing delay. est Indicates the estimated execution delay. The smoothing term is represented by a1 to a4, which are preset weights; the intra-order efficiency value S is determined according to... Calculate, where C represents the estimated execution cost. ~ The weights are preset. More preferably, when performing scheduling, G is compared first, then S is compared, and waiting compensation is only introduced within adjacent levels to avoid long-term starvation of low-level tasks.
[0009] Preferably, the urgency level E, clinical risk level R, and process dependence depth D can be discrete values of 1 to 5 levels or values within the [0,1] interval after calibration using historical samples; the estimated queuing delay Q est And estimated execution delay X est It can be generated from the moving average, quantile, or exponentially smoothed value of the most recent n similar tasks; when T rem When ≤0, U is directly set to 1. To avoid aberrant amplification of individual features, the system performs calculations on E, R, W, D, and L before computation. in N ctx and L out Normalization is performed, and upper bound constraints are set for a1~a4 and λ1~λ4, so that the dual-order escort algorithm for diagnosis and treatment can not only reflect the differences in medical risks, but also maintain comparability and stability across different departments, time periods and model versions.
[0010] In one implementation, the dual-order priority index I=(G,S) does not directly determine the resource template, but first drives hierarchical enqueueing and intra-queue sorting, and then drives resource pool control. Specifically, high-priority tasks enter the high-escort queue, while mid-to-low-priority tasks enter the regular queue or background queue; the queue scheduler determines the dequeue order based on the I value and sends the resource request level to the resource pool controller, so that the priority evaluation result can be propagated to the subsequent resource reclamation and model routing stages.
[0011] In one implementation, the computing power used in the execution of the solution (e.g., the computing power of a medical intelligent computing platform applicable to in-hospital data processing) is divided into an online guaranteed inference resource pool, an offline training / batch processing resource pool, and a shared elastic resource pool. The online guaranteed inference resource pool handles time-sensitive tasks such as emergency assistance, critical value interpretation, and real-time doctor Q&A, and is configured with a minimum guaranteed quota. The offline training / batch processing resource pool handles deferred tasks such as model training, incremental fine-tuning, knowledge extraction, index reconstruction, and batch analysis. The shared elastic resource pool is used for limited borrowing between online and offline resources. Shared resources are only lent to the offline side when the idle resources on the online side are higher than the recycling threshold, and are preferentially reclaimed when the idle resources on the online side are lower than the recycling threshold. If necessary, checkpoint saving, pausing, or migration are performed on interruptible tasks in the offline pool.
[0012] In one implementation, a dual-anchor resource template is jointly mapped based on the task complexity score K and the medical escort level G. Preferably, the complexity score K is determined according to... The system pre-configures at least four types of dual-anchor resource templates: high-risk, high-complexity templates, high-risk, low-complexity templates, regular-risk, high-complexity templates, and regular-risk, low-complexity templates. Each template includes at least the target model category, acceleration resource quota, memory budget, concurrency threshold, output length limit, timeout control parameters, and fallback strategy. Therefore, high-risk, high-complexity tasks are prioritized for stable, high-configuration resources, while ordinary, low-complexity tasks are prioritized for shared or lightweight resources.
[0013] Furthermore, the dual-anchor resource template can be implemented using a two-dimensional mapping matrix F(G,K), where the row dimension represents the diagnostic and escort level interval, and the column dimension represents the complexity interval. When G exceeds a preset high escort threshold and K exceeds a preset high complexity threshold, an exclusive high-configuration template is selected and batch merging is prohibited. When G is high and K is low, a low-latency lightweight template is preferred. When G is low and K is high, a shared high-performance template is selected and a concurrency limit is set. When both G and K are low, a shared lightweight template or a batch processing template is selected. Thus, the priority calculation result does not stop at the queue sorting layer, but is linked to the specific computing power allocation and model type selection.
[0014] In one implementation, the unified service interface for model hubs includes at least a unified task input field, a unified context field, a unified security control field, a unified call parameter field, a unified result output field, and a unified audit field. The model capability fingerprint register records at least the modal support capabilities, maximum context length, average latency range, resource consumption range, health status, compliance label, deployment domain label, and standby relationship of candidate models. The unified service interface for model hubs is used to unify the calls of large models from different vendors, versions, and deployment domains into a consistent input and output protocol. The model capability fingerprint register records the modal support capabilities, maximum context length, average latency range, resource consumption range, health status, compliance label, deployment domain label, and standby relationship of the model. The scheduler first filters candidate models that meet resource constraints based on the dual-anchor resource template, then performs capability matching, compliance verification, and health verification based on the model capability fingerprint register, and finally determines the target model and generates a unified call request.
[0015] In one implementation, when a high-priority diagnostic and treatment support task arrives and the online backup inference resource pool is insufficient, the system performs dynamic scheduling in the following order: priority recycling from the shared pool, saving checkpoints for interruptible tasks in the offline pool, pausing or migrating the task, and restoring according to recovery priority after the pressure is relieved. For tasks with a clinical risk level higher than a preset threshold, model downgrading is prohibited solely for the purpose of releasing computing power; when the preferred model is unavailable, switching to a backup model that has passed capability verification and meets minimum safety requirements is only permitted; if no backup model exists, a controlled failure status is output and manual review is triggered.
[0016] In a further implementation, the dual-order priority index, the three-pool isolation mechanism, the dual-anchor resource template, and the model routing are associated through a unified control token. When a task is dequeued, the scheduler generates a control token containing fields such as rank level, template identifier, minimum resource requirement, whether demotion is allowed, and whether preemption of other tasks is allowed. This control token is simultaneously transmitted to the resource pool controller and the model router. The resource pool controller performs backup, reclamation, or preemption based on the control token, while the model router constrains the candidate model set based on the control token. This avoids the separation between priority evaluation, resource allocation, and model switching, thus forming a single-link control from diagnostic risk identification to computational execution.
[0017] In one implementation, the platform continuously records queuing time, startup latency, total inference latency, GPU utilization, peak memory usage, shared resource borrowing time, preemption count, checkpoint write time, model switching success rate, and time limit achievement rate. In the offline environment, it generates new rank thresholds, complexity thresholds, template mapping rules, and candidate values for resource pool quotas. The candidate values must be subject to change boundary pruning, replay verification, or sandbox verification before they can be released in a version-restricted manner to suppress scheduling parameter oscillations.
[0018] Secondly, based on the same inventive concept, this application also provides a multi-task large model scheduling and computing power optimization system for medical data, which includes at least a task element generation module, a dual-order priority index module, a three-pool isolation and elastic reclamation module, a dual-anchor resource template module, a unified service interface module for the module and hub, a dynamic scheduling execution module, and a log auditing and restricted update module. The above modules work together to implement the aforementioned multi-task large model scheduling and computing power optimization method for medical data.
[0019] Thirdly, based on the same inventive concept, this application also provides an electronic device and a computer-readable storage medium. The electronic device includes a processor, a memory, and a communication interface. When the processor executes program instructions stored in the memory, it implements the aforementioned method. The computer-readable storage medium stores a program, and when the program is executed by the processor, it implements the aforementioned method for scheduling and optimizing computing power for multi-task large-scale models in medical data.
[0020] By adopting the above technical solution, the present invention has at least the following beneficial effects: First, it replaces the single total score mixed sorting with the dual-order escort priority algorithm for diagnosis and treatment, enabling priority evaluation to have a clear algorithmic expression and the ability to bear medical safety constraints; second, it enhances the resource guarantee capability for critical tasks such as emergency treatment by using the dual-order priority index to drive the three-pool isolation and elastic recycling mechanism; third, it maps diagnosis and treatment risks and task complexity to resource quotas simultaneously with the dual-anchor resource template, improving the utilization rate of computing power; fourth, it reduces the cost of rapid access and switching of heterogeneous models by using the unified service interface and model capability fingerprint book of the model hub; and fifth, it improves the stability, interpretability, and auditability of scheduling behavior by using restricted updates. Attached Figure Description
[0021] To more clearly illustrate the specific embodiments of the present invention or the technical solutions in the prior art, the accompanying drawings used in the specific embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without any creative effort.
[0022] Figure 1 This is a flowchart illustrating the overall process of the method provided in this embodiment of the invention. Figure 2 The flowchart of the dual-order escort algorithm for diagnosis and treatment provided in this embodiment of the invention; Figure 3 A flowchart illustrating the three-pool isolation and flexible recovery mechanism provided in this embodiment of the invention; Figure 4 This is a flowchart of the dual-anchor resource template mapping process provided in an embodiment of the present invention; Figure 5 A flowchart illustrating the collaborative routing process between the unified service interface for model hubs and the model capability fingerprint book, as provided in this embodiment of the invention. Figure 6 A flowchart of dynamic scheduling, preemption, and recovery provided in this embodiment of the invention; Figure 7 This is a structural diagram of a multi-task large model priority scheduling and computing power collaborative optimization system in a medical scenario provided by an embodiment of the present invention. Detailed Implementation
[0023] The technical solution of the present invention will now be clearly and completely described with reference to the accompanying drawings. It should be noted that the following embodiments are used to explain the present invention, and not to limit the scope of protection of the present invention. Based on the disclosure of the present invention, any equivalent substitutions or modifications obtained without inventive effort should fall within the scope of protection of the present invention.
[0024] In the description of this invention, it should be noted that the terms "task element", "hierarchy", "resource pool", "template", "fingerprint book", "module", etc. are only used to distinguish different functional logical objects or logical sets, and should not be construed as limiting their physical deployment location, hardware form, calling order or one-to-one correspondence; in specific implementation, the above functions can be implemented by a single computing node or by multiple service nodes working together.
[0025] It should also be noted that the following specific embodiments are optimized configurations for further explanation of the present invention. These configurations can be combined or used in conjunction with each other. As long as there is no contradiction or conflict between their technical features, they should be considered to fall within the scope of the present invention.
[0026] Example 1 like Figures 1 to 6 As shown, this embodiment provides a method for scheduling and computing power optimization of multi-task large models in medical data, which can be deployed in the scheduling and control plane of a hospital intelligent computing platform. The method sequentially completes task element generation, dual-order priority index calculation, three-pool resource control, dual-anchor resource template mapping, model routing, dynamic recycling, and restricted updates.
[0027] Step 1: Receive task requests from the hospital information system, electronic medical record system, doctor workstation, bedside terminal, model gateway, and backend training platform; perform field mapping, legality verification, desensitization processing, and standardized encapsulation to generate medical computing isomorphic task elements.
[0028] In this embodiment, tasks such as emergency and outpatient auxiliary question answering, critical value interpretation, consultation assistance, inpatient medical record summary, progress record generation, quality control analysis, knowledge base update, model incremental fine-tuning, and batch report generation are all normalized into the same type of scheduling input object to eliminate the impact of inconsistent field structures of different business systems on the scheduler.
[0029] Preferably, the medical computational isomorphic task element is denoted as T={id,src,type,E,R,T}. rem ,W,D,L in N ctx M mod ,P dep ,L out ,T lat The expression `cap, deg, intf` is used, where `id` is the task identifier, `src` is the source system, `type` is the task category, `E` is the urgency of diagnosis / treatment, `R` is the clinical risk level, and `T` is the task type. rem Where W is the remaining time limit, D is the current waiting time, and L is the process dependency depth. in N is the input length. ctx For context rounds, M mod For modal complexity, P dep For reasoning depth, L out For the target output length, T lat The target latency is defined by cap, the capability requirement identifier is deg, the degradability identifier is indicative, and the interruptibility identifier is intf.
[0030] In actual deployment, the E in the task element can be mapped from emergency identification, critical value identification, medical order time limit level, or departmental preset weight; R can be mapped from factors such as the risk level involved in the clinical task, whether it affects treatment decisions, and whether it enters the manual review link; and D can be calculated from the number of predecessor nodes, the number of successor dependencies, or the critical path length of the current task in the clinical flowchart. For missing fields, the system can use hierarchical default values to fill them and record the filling source in the audit field to ensure the completeness of subsequent algorithm inputs. Specifically: for fields with default values defined by the source system, the default value of the source system is directly used; for fields that can be obtained from the statistics of similar historical tasks, the median or mode of the most recent n similar tasks is used as the default value; for risk-related fields E and R, conservative default values of the department or task category are used; for complexity-related fields L... in N ctx L out The historical percentile values of similar tasks are used to fill in the blanks; for fields that cannot be determined in the above way, a preset safety default value is written and marked as "to be reviewed". For example, if an emergency task is missing R, the higher risk default value of the task's category is used first to avoid underestimating the scheduling priority.
[0031] Step 2: Calculate the diagnostic and care escort level G and the level-internal efficacy order value S based on the task element.
[0032] In this embodiment, instead of using a single total score to sort all tasks, a dual-order calculation method of "first determining the rank, then sorting within the rank" is adopted. The system first forms a basic escort volume based on the urgency level E and the clinical risk level R, and then combines the urgency level U of the remaining time limit and the process dependency depth D to raise or lower the rank to obtain the medical escort rank G; then, the effectiveness order value S within the same rank is calculated only.
[0033] Preferably, the remaining time limit urgency U satisfies ; Treatment and support level G satisfies The intra-rank efficiency order value S satisfies Q est X represents the estimated queuing delay. est C represents the estimated execution delay, and C represents the estimated execution cost. Represents the smoothing terms, a1 to a4 and ~ All weights are preset. This calculation method can simultaneously transform treatment risks, time constraints, and execution costs into an executable scheduling index I=(G,S).
[0034] To improve the feasibility of the algorithm, the basic escort quantity can be calculated first. Then calculate the urgent correction amount. Ultimately by Obtain the level of medical support; at the same time, define C as Where v1 to v5 are offline calibration coefficients. For tasks of the same order, the scheduler sorts them by S from largest to smallest; when the difference between the S values of two tasks is less than the preset threshold δ, the task with lower estimated memory usage and that meets the target latency is selected first, in order to reduce fragmentation and improve throughput within the online window.
[0035] Under a set of example parameters, the emergency critical value interpretation task can be set to E=5, R=5, the consultation summary task to E=3, R=3, and the nighttime training task to E=1, R=1. If the emergency critical value interpretation task also has a small T... rem If the input length is higher than D, its G value is promoted to the highest order; even if its input length is longer, as long as the order is higher, it is still dequeued with priority over ordinary summarization and training tasks.
[0036] Step 3: Perform hierarchical enqueueing and in-queue sorting according to the double-order priority index I.
[0037] Specifically, the system establishes at least a high-priority escort queue, a regular queue, and a background queue. During scheduling, the medical escort priority G is compared first, with higher-priority tasks entering the scheduling window first. When priorities are the same, the efficiency order value S within each priority level is compared. For low-priority tasks that have been waiting for a long time, a waiting compensation term can be added within adjacent priorities, but this must not violate the basic requirement of prioritizing higher-priority tasks. Preferably, the waiting compensation term can be denoted as B. wait =min(β·W / W0, B) max ), where β is the compensation gain coefficient, W is the current waiting time, W0 is the compensation start-up waiting threshold, and B max To compensate for the upper limit; when the task satisfies G=G cur When -1 and W≥W0, use S'=S+B wait It participates in scheduling comparisons at adjacent level boundaries, but maintains strict priority for levels one level higher than the task. For example, when the waiting time of a regular queue task exceeds a set threshold, its waiting compensation item gradually increases, but it can only participate in scheduling comparisons at adjacent level boundaries and cannot cross higher escort level tasks.
[0038] Step 4: Divide the computing resources into an online guaranteed inference resource pool, an offline training / batch processing resource pool, and a shared elastic resource pool to form a three-pool isolation and elastic reclamation mechanism.
[0039] like Figure 3 As shown, the online guaranteed inference resource pool undertakes time-sensitive tasks such as emergency assistance, critical value interpretation, and real-time doctor interaction, and is set with a minimum guaranteed quota G. minThe offline training / batch processing resource pool supports deferred tasks such as model training, incremental fine-tuning, knowledge extraction, index reconstruction, and batch analysis; the shared elastic resource pool is used to perform controlled borrowing between online and offline operations.
[0040] In this embodiment, the resource controller first generates a resource demand vector R based on the dual-anchor resource template. req =(gpu_req, mem_req, slot_req, timeout_req), and obtain the online side idle resource vector R in real time. avl_on and the reclaimable resource vector R in the shared elastic resource pool rec_sh Only when the online idle resources are higher than the reclamation threshold Th borrow Only when the idle resources in the shared pool are below the recycling threshold (Th) can they be lent to the offline side; when the idle resources on the online side are below the recycling threshold (Th)... recall Or R appears avl_on <R req And R avl_on +R rec_sh ≥R req In such cases, the system immediately stops new offline borrowing and prioritizes reclaiming resources from the shared pool. The specific process for reclaiming resources from the shared pool is as follows: First, freeze new borrowing requests in the shared pool; then, traverse the borrowing instances in the shared pool in the order of "shortest borrowing duration, lowest business impact, and imminent natural termination"; issue reclamation commands to instances that meet the reclamation conditions one by one, and release the corresponding GPU quota, video memory quota, and concurrent slots; when the accumulated available resources meet R... req The recycling process stops at a certain point. After recycling is complete, the released resources are directly bound to the resource request record of the current high-priority task, and the resource pool controller adds them to the online guaranteed inference resource pool for priority use by that task. If the shared pool still meets R after recycling... avl_on +R rec_sh <R req If the result is "still insufficient after recycling", the system will further select interruptible task execution checkpoints from the offline training / batch processing resource pool to save, pause, or migrate them.
[0041] To reduce the jitter caused by frequent preemption, the preemption selection value can be further calculated: Where Chk represents checkpoint storage cost, Mig represents migration cost, Rec represents recovery cost, and b1 to b4 are preset weights. The system prioritizes preempting P. pre The system has a high rate of offline tasks and sets a cooldown window for tasks that have already been preempted to prevent continuous re-preemption.
[0042] When preemption occurs, first determine if the offline task is in a saveable phase. If saveable, write a checkpoint and then release resources. If saveable but migration is supported, migrate to the remaining offline nodes. If neither condition is met, skip the task and continue searching for the next candidate task. Recovery priority values can be defined during the recovery phase. , where P rog Indicates the completed percentage, G org This indicates the original diagnostic and support level of the task; Rec represents the restart and recovery cost. The system follows R... rec Restore interrupted tasks from high to low priority, taking into account both the protection of completed workloads and the priority restoration of high-value tasks.
[0043] Step 5: Calculate the task complexity score K, and jointly map the task complexity score K and the diagnosis and treatment escort level G into a dual-anchor resource template.
[0044] Preferably, the complexity score K satisfies .in, ~ L in N represents the input length. ctx Indicates the context round, M mod P represents modal complexity. dep L represents the depth of reasoning. out T represents the target output length. lat Indicates the target delay. For smoothing terms; the M mod P is used to characterize the differences in processing load caused by different modal inputs. dep T is used to characterize the additional computational overhead caused by chained reasoning, retrieval enhancement, or multi-turn tool calls. lat Used to characterize the strength of time delay constraints for a task.
[0045] This embodiment pre-defines four types of dual-anchor resource templates, namely Tpl HH Tpl HL Tpl CH and Tpl CL Among them, Tpl HH Designed for high-capacity, highly complex tasks, it offers high-quota GPUs, large memory budgets, and strict timeout controls; Tpl HL For high-escort, low-complexity missions, low-latency, lightweight models should be prioritized; Tpl CH For routine escort missions with high complexity, a shared high-performance model is used with concurrency limits; Tpl CL For routine, low-complexity escort missions, a shared lightweight model or batch processing mode is used.
[0046] Preferably, the dual-anchor resource template includes at least the target model category. classAccelerate resource quota allocation quota Memory budget (mem), concurrency threshold (conc), batch processing parameter (batch), and maximum output length (out) limit The system includes a timeout period and a fallback strategy. When determining the template category based on (G,K), G is first compared with the high escort threshold θ. G Compare K with the highly complex threshold θ K The comparison is performed, and then the template category is determined by the two-dimensional mapping matrix F(G,K): when G≥θ G And K≥θ K Time mapping to Tpl HH When G≥θ G And K < θ K Time mapping to Tpl HL When G < θ G And K≥θ K Time mapping to Tpl CH When G < θ G And K < θ K Time mapping to Tpl CL To avoid frequent template jitter, hysteresis intervals ΔG and ΔK can be further set; when G or K falls into a hysteresis interval near the threshold, the template from the previous scheduling cycle is used preferentially. Once the template is determined, the system uses it as the direct basis for subsequent model selection and resource allocation. Specifically, a set of candidate model constraints is first generated based on modelclass, mem, conc, outlimit, timeout, and fallback in the template; only models that meet the criteria of "supporting the required modalities and having a maximum context length not less than L" are retained. in +L out Candidate models were selected based on the following criteria: predicted memory usage not exceeding mem, predicted latency not exceeding timeout, and deployment domain and compliance tags meeting requirements. Subsequently, a resource requirement vector R was generated based on accquota, mem, conc, and batch from the template. req The resource allocation is completed by the resource pool controller in the following order: online guaranteed inference resource pool first, shared elastic resource pool second, offline pool preemption and reclamation third.
[0047] In a set of optional implementations, Tpl HH It can support 2 GPU instances, a high memory budget, and a single-task dedicated queue; Tpl HL It can support one GPU instance and more stringent timeout control; Tpl CH It can support 2-4 concurrent users in a shared high-performance model; Tpl CLIt supports shared lightweight models and allows batch processing. The `fallback` field in the template specifies the list of backup large models and their switching order in case the preferred large model times out, fails a health check, or the deployment domain becomes unavailable. The specific process is as follows: First, the preferred model category and the ordered list of backup models in the template are read; second, the candidate models are checked sequentially to see if they meet the context, memory, latency, deployment domain, and compliance conditions given by the template; third, the routing score is calculated from the candidate set that meets the conditions, and the model with the highest score is selected as the preferred execution model; if the preferred model fails the health check before the call or times out during the call, the next backup model is tried according to the order given by `fallback`; if the backup model also does not meet the template constraints, a controlled failure status is returned. Thus, the template not only constrains resource allocation but also directly constrains model selection, switching order, and failure fallback path.
[0048] Step 6: Select the target model through the unified service interface of the model hub and the model capability fingerprint book.
[0049] The unified service interface of the model hub encapsulates input fields, context fields, security control fields, call parameter fields, output fields, and audit fields, enabling models from different vendors and deployment domains to be invoked using a unified protocol. The model capability fingerprint records the model name, version number, vendor identifier, deployment domain, supported modalities, maximum context length, average latency range, resource consumption range, health status, compliance tags, and standby relationships.
[0050] During model selection, the system first filters out candidate models that do not meet resource constraints based on the dual-anchor resource template, and then performs capability matching, deployment domain matching, and health status verification based on the model capability fingerprint. For tasks with clinical risk levels higher than a preset threshold, a hard constraint gating is performed using Gate(task, model) before entering RouteScore scoring, retaining only candidate models that meet the minimum safety requirements; RouteScore scoring and selection are only performed within this candidate set. For example, for high-risk escort tasks requiring in-hospital deployment and support for long contexts, if a cloud-based model has the capability but does not meet the deployment domain constraints, it will not enter the final candidate set.
[0051] Preferably, a large model selection score can be defined: Here, CapFit represents the match between model capabilities and task requirements, Health represents the health status score, Locality represents the deployment domain match, LatencyFit represents the latency adaptability, ResCost represents the resource consumption cost, and SwitchCost represents the switching cost from the currently running model to the candidate model. The system selects the model with the highest RouteScore only from the candidate set that has passed compliance verification and deployment domain verification.
[0052] Step 7: Perform dynamic scheduling when high-priority tasks arrive and online pool resources are insufficient.
[0053] The system first reclaims reclaimable resources from the shared elastic resource pool. Specifically, the resource controller generates a resource demand vector R based on the template corresponding to the current high-level task. req With the current idle resources R in the online pool avl_on Comparison by dimension; when R avl_on ≥R req When, online resources are directly allocated; when R avl_on <R req And R avl_on +R rec_sh ≥R req When, perform shared pool reclamation; when R avl_on +R rec_sh <R req If insufficient resource reclamation is detected, the offline task preemption process is triggered. For shared pool reclamation, the system first stops new shared resource lending, then reclaims already lent resources according to the low-impact priority principle. The reclaimed GPU quota, video memory quota, and concurrent slots are immediately transferred to the resource binding record of the current high-priority task. If insufficient resources are still detected after reclamation, interruptible tasks are selected from the offline training / batch processing resource pool to perform checkpoint saving, pause, or migration. The criteria for determining an interruptible task include at least: the task's interruptible flag intf is true, the current execution stage supports checkpoint saving or migration, it is not within the preemption cooldown window, and the historical preemption count has not exceeded the threshold. After tasks meeting the conditions form an interruptible candidate set, the preemption order is determined according to the sorting rules of "low checkpoint overhead priority, low migration overhead priority, low recovery cost priority," and the released resources are redistributed to high-priority tasks. When the online pressure decreases, the system resumes the interrupted tasks according to the save time, remaining workload, and original priority.
[0054] Regarding medical safety constraints, this embodiment stipulates that: for tasks with a clinical risk level higher than a preset threshold, model downgrading should not be performed solely for the purpose of releasing computing power; if the preferred model is unavailable, switching to a backup model that has passed capability verification and meets minimum safety requirements is only permitted; if no backup model meets the requirements, a controlled failure status is output and a manual takeover process is triggered. The controlled failure status includes at least the reason for failure, recommended manual handling actions, and an audit record number. To ensure that the above functional constraints are executed in the underlying implementation, the system sets a hard constraint gating rule Gate(task, model) before model routing. For tasks with a clinical risk level higher than the preset threshold, the downgradability flag deg is first forcibly set to 0; secondly, only the candidate model set that meets the minimum safety requirements is allowed to enter. The minimum safety requirements include at least: CapCov=1 (capability coverage flag, i.e., the candidate model fully supports the modalities and context length required for the task); Comp=1 (compliance flag); Local=1 (deployment domain matching flag); and Stable (historical stability not lower than the safety threshold T). safe The outlier rate Err within the most recent statistical window is not higher than the threshold T. err A model is only allowed to enter the final candidate set when Gate(task, model) = 1.
[0055] Step 8: Record the execution process and perform limited updates.
[0056] The system uniformly records metrics such as queuing time, startup latency, total inference latency, GPU utilization, peak memory usage, shared resource borrowing time, preemption count, checkpoint write time, model switching count, backup model hit rate, target completion timeout rate, and anomaly rate. It also generates new rank thresholds, complexity thresholds, template mapping rules, and resource pool quota candidate values in an offline environment. Preferably, the new rank threshold candidate values can be configured according to... The calculation can be performed by assigning new candidate values for the complexity threshold. Calculate, where Q p Let {G} represent the quantile function at a given quantile p. i} represents the set of diagnostic and treatment support hierarchy samples within the statistical window, {K i} represents the set of task complexity score samples within the statistical window, and pG and pK represent the target quantiles used for the rank threshold and complexity threshold, respectively. By adopting quantile statistics, the thresholds can be updated smoothly as the workload distribution changes, without being excessively affected by individual extreme samples.
[0057] The restricted update process may include: offline statistical analysis of metrics from the previous version at different time levels, complexity ranges, and time periods; generating candidate parameters; performing change boundary pruning on the candidate parameters; releasing the version after successful replay or sandbox verification; and retaining the version number, release time, trigger reason, and rollback information. The candidate parameters must include at least the high protection threshold θ. G Highly complex threshold θ K The interval boundaries in the template mapping matrix F(G,K), and the online guaranteed quota G. min The shared pool borrowing threshold (Thborrow), shared pool reclamation threshold (Threcall), concurrency threshold (conc), and standby model switching order are specified. Preferably, when performing change boundary pruning on candidate parameters, the following can be used: , where x cand x represents the candidate parameters generated by offline analysis. old This represents the current online parameters, where 'r' represents the maximum allowable change percentage in a single release, and 'clip' indicates that the parameters will be clipped to within given upper and lower boundaries. For integer parameters, the clipped values are rounded and subject to preset minimum and maximum limits. This avoids scheduling policy oscillations caused by sudden parameter changes, ensuring stable, interpretable, and auditable scheduling behavior.
[0058] For example, among three tasks arriving at the same time, task A is the interpretation of critical values in the emergency department, task B is the summary of inpatient medical records, and task C is nighttime incremental training. The system calculates that task A has the highest G and medium-high K, which is mapped to Tpl. HH or Tpl HL B's G is moderate and K is moderate, mapping to Tpl CH If C has the lowest G and a high K, it enters the offline pool. If the available resources in the online pool are insufficient at this time, the resources borrowed from the shared pool are first reclaimed; if the resources are still insufficient after reclamation, a checkpoint is performed on C to save and release resources, allowing A to be prioritized for execution in the online backup inference resource pool. After A finishes, C is restored according to the recovery priority value. This process shows that the dual-order priority index is not an isolated scoring, but continuously drives the three-pool isolation, template mapping, and model routing, ensuring that the response speed of emergency tasks and the overall resource utilization efficiency of the system are improved in sync.
[0059] Example 2 like Figure 7As shown, this embodiment provides a multi-task large-scale model scheduling and computing power optimization system for medical data in a medical scenario. It includes a task element generation module, a dual-order priority index module, a three-pool isolation and elastic reclamation module, a dual-anchor resource template module, a unified service interface module, a dynamic scheduling execution module, and a log auditing and restricted update module. Each module can be deployed in a unified control plane or as a microservice, and interface with a container orchestration system, a GPU resource management system, a model gateway, and a log platform. Specifically, the task element generation module receives task requests from multiple source business systems, performs field mapping, validity verification, desensitization, and standardized encapsulation, and outputs a medical computing isomorphic task element T. The dual-order priority index module receives the task element T, calculates the diagnosis and treatment escort rank G, the intra-rank efficiency order value S, and the dual-order priority index I, and sends I to the three-pool isolation and elastic reclamation module and the dual-anchor resource template module. The dual-anchor resource template module generates a template identifier and a resource requirement vector R based on G and K. req The candidate model constraints are then sent to the three-pool isolation and elastic reclamation module and the unified service interface module, respectively. The three-pool isolation and elastic reclamation module, based on priority information and resource requirement vector R, performs the necessary adjustments. req The module completes resource allocation, shared pool reclamation, or offline task preemption, and sends available resource information to the dynamic scheduling execution module. The unified service interface module of the model hub completes model screening, protocol adaptation, and call encapsulation based on candidate model constraints and model capability fingerprints, and sends target model information to the dynamic scheduling execution module. The dynamic scheduling execution module is responsible for initiating model calls, monitoring execution status, triggering rollback or recovery processes, and writing execution results and running indicators to the log audit and restricted update module. The log audit and restricted update module performs offline analysis of historical indicators, generates threshold and quota candidate values, and writes them back to the dual-order priority index module, dual-anchor resource template module, and three-pool isolation and elastic reclamation module after verification.
[0060] Example 3 This embodiment provides an electronic device. The electronic device includes a processor, a memory, a communication interface, and an acceleration resource management interface. When the processor executes program instructions stored in the memory, it implements the multi-task large-model scheduling and computing power optimization method for medical data in a medical scenario as described in Embodiment 1 above. The computer-readable storage medium stores a program, and when the program is executed by the processor, it implements the aforementioned method.
[0061] Furthermore, this solution can also be implemented using a computer-readable storage medium that can be read by a device configured with a processor to perform the various steps of the method disclosed in Embodiment 1.
[0062] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.
Claims
1. A method for scheduling and optimizing computing power of multi-task large-scale models in medical data, characterized in that, The method includes: S1. Receive task requests for medical data from multiple sources and encapsulate them into medical computing isomorphic task elements; S2. Based on the urgency level E, clinical risk level R, and remaining time T in the task element. rem The diagnostic and care escort rank G is calculated based on the process dependency depth D, and the intra-rank efficiency order value S is calculated based on the waiting time W, the remaining time limit urgency U, the process dependency depth D and the estimated execution cost C, forming a double-order priority index I=(G,S). S3. Perform hierarchical enqueueing and in-queue sorting on the task elements according to the dual-order priority index; S4. Divide the computing power into an online guaranteed inference resource pool, an offline training / batch processing resource pool, and a shared elastic resource pool; jointly map the dual-anchor resource template based on the task complexity score K and the diagnosis and treatment escort level G, and generate resource application vectors and candidate model constraints from the dual-anchor resource template; S5. Based on the candidate model constraints generated in step S4, select the target large model through the unified service interface of the model hub and the model capability fingerprint book; and initiate a resource request to the resource pool controller based on the resource request vector generated in step S4, and initiate a call after the resources are satisfied. S6. When a high-level task arrives and the online pool resources are insufficient, perform dynamic scheduling: prioritize the reclamation of shared elastic resource pool resources, and if resources are still insufficient after reclamation, perform checkpoint saving, pausing or migration of interruptible tasks in the offline training / batch processing resource pool; and complete resource reallocation and standby model switching according to the resource quota and rollback strategy in the dual-anchor resource template. S7. Record the scheduling execution results and perform a restricted update after the replay verification or sandbox verification is passed.
2. The method according to claim 1, characterized in that, The task element includes at least the source identifier, task category, urgency level (E), clinical risk level (R), and remaining time limit (T). rem Waiting time W, process dependency depth D, input length L in Context round N ctx Modal complexity M mod Depth of Reasoning P dep Target output length L out Target delay T lat Capability requirement identifier, degradability identifier, and interruptibility identifier.
3. The method according to claim 1, characterized in that, The diagnostic and treatment escort level G is according to The calculation shows that the remaining time limit urgency U satisfies... Q est X represents the estimated queuing delay. est Indicates the estimated execution delay. For the smoothing term, a1 to a4 are preset weights. Indicates the remaining time limit.
4. The method according to claim 1, characterized in that, The intra-rank efficiency order value S is according to Calculate, where C represents the estimated execution cost. ~ The preset weights are used; when scheduling tasks, the medical escort rank G is compared first, then the performance order value S within the rank is compared, and waiting compensation is applied to long-waiting tasks.
5. The method according to claim 1, characterized in that, S4 further includes: setting a minimum guaranteed quota for the online guaranteed inference resource pool that cannot be occupied by the offline training / batch processing resource pool; allowing idle resources in the shared elastic resource pool to be lent to the offline side only when the idle resources on the online side are higher than the recycling threshold; and immediately stopping new offline borrowing and recycling shared resources when the pressure on the online side increases and the idle resources are lower than the recycling threshold.
6. The method according to claim 1, characterized in that, The task complexity score K is calculated as follows: in, ~ L in N represents the input length. ctx Indicates the context round, M mod P represents modal complexity. dep L represents the depth of reasoning. out T represents the target output length. lat Indicates the target delay. For smoothing terms; the M mod P is used to characterize the differences in processing load caused by different modal inputs. dep T is used to characterize the additional computational overhead caused by chained reasoning, retrieval enhancement, or multi-turn tool calls. lat Used to characterize the strength of time delay constraints for a task; The task complexity score K and the diagnosis and treatment escort level G are jointly mapped into a dual-anchor resource template. The dual-anchor resource template includes at least the target model category, acceleration resource quota, memory budget, concurrency threshold, output length limit, timeout control parameters, and fallback strategy.
7. The method according to claim 1, characterized in that, In S6, dynamic scheduling also includes: determining the preemption order of offline tasks based on interruptibility flags, checkpoint overhead, migration overhead, and recovery cost; prohibiting model downgrading for tasks with clinical risk levels higher than a preset threshold, solely for the purpose of releasing computing power; and allowing switching to a backup large model that has passed capability verification and meets minimum safety requirements only when the preferred large model is unavailable.
8. The method according to claim 1, characterized in that, In S7, the restricted update includes: performing offline analysis on queuing time, execution latency, resource consumption, model routing results and preemption recovery results to generate candidate values for the diagnosis and treatment escort level threshold, complexity threshold, dual-anchor resource template mapping rules and resource pool quota; performing change boundary pruning on the candidate values, and releasing, rolling back and auditing them by version after passing replay verification or sandbox verification.
9. A multi-task large-scale model scheduling and computing power optimization system for medical data, characterized in that, It includes a task element generation module, a dual-order priority index module, a three-pool isolation and elastic reclamation module, a dual-anchor resource template module, a module-hub unified service interface module, a dynamic scheduling execution module, and a log auditing and restricted update module. Each module works together to implement the method described in any one of claims 1 to 8.
10. A computer-readable storage medium, characterized in that, The method includes a processor, a memory, and a computer program stored in the memory, wherein the computer program, when executed by the processor, implements the method as described in any one of claims 1 to 8.