A memory perception dynamic task scheduling method for a multi-modal large model of a cultural and tourism-oriented travel

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By adopting an edge-cloud collaborative architecture and a memory-aware dynamic task scheduling method based on graph attention networks, the problems of memory overflow and low resource utilization in cloud computing are solved. This achieves memory load balancing and inference latency reduction, thereby improving the computational efficiency and user experience of the multimodal large model of cultural tourism.

CN121900919BActive Publication Date: 2026-06-23UNIV OF ELECTRONICS SCI & TECH OF CHINA

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: UNIV OF ELECTRONICS SCI & TECH OF CHINA
Filing Date: 2026-03-24
Publication Date: 2026-06-23

Application Information

Patent Timeline

24 Mar 2026

Application

23 Jun 2026

Publication

CN121900919B

IPC: G06F9/48; G06F9/455; G06N5/04; G06N3/0455; G06N3/092

AI Tagging

Application Domain

Program initiation/switching Biological models

Technology Topics

ShardDecision networks

Technical Efficacy Phrases

high speed increase profit

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Dwa algorithm implementation circuit and analog-to-digital converter
CN122293087ASimple structure reduce area Algorithm A d converter
A new type of robot built-in light guide arm rotating unit
CN224391194UMeet lightweight needsSynchronized motion in real timeManipulator Laser processing Light guide
A gob coal bed gas rapid plugging device
CN224413702UImprove the blocking effectSolidification effectThermodynamicsCoalbed methane
A video image splicing and unfolding method, device, medium and equipment
CN122222812Aavoid misalignmentavoid stitchingImage enhancement Image analysis
Water treatment apparatus
CN224411436UImprove cooling effectAchieve pre-accumulationWater storage Water storage tank

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing cloud computing task scheduling technologies lack memory awareness and ignore modal differences and security constraints, leading to problems such as memory overflow, high inference latency, and low resource utilization in large model inference clusters.

Method used

It adopts an edge-cloud collaborative architecture, extracts the memory supply and demand features of the task-resource heterogeneous graph through graph attention network, and combines deep reinforcement learning with action mask to generate the optimal scheduling strategy to avoid memory overflow, including memory-aware graph attention network, memory safety mask mechanism and multi-objective hybrid reward function.

Benefits of technology

Effectively prevent service crashes caused by GPU memory fragmentation, improve the response speed of large model inference and the utilization of cluster resources, ensure balanced GPU memory load of computing cluster, reduce inference latency, and enhance the interactive experience for visitors.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN121900919B_ABST

Patent Text Reader

Abstract

The application discloses a display memory perception dynamic task scheduling method for a multi-modal large model of a cultural and tourist industry, and relates to the technical field of cloud computing and artificial intelligence. In the task-resource heterogeneous graph modeling stage, the application introduces display memory peak estimation, display memory fragmentation index, and maximum continuous display memory block size and other display memory specific features, so that the scheduler can perceive the change in display memory demand caused by the dynamic growth of the key-value cache; in the decision generation stage, through the action mask mechanism, the scheduling actions that do not meet the display memory constraints are shielded before the decision network output, so as to fundamentally eliminate the system avalanche problem caused by the output of illegal actions by the end-to-end reinforcement learning model in the initial exploration stage; in the reward function design, the application adopts the first character generation time directly related to the user experience as the main optimization target, and introduces a cross-node data migration penalty term, fully considering the engineering practice of large model reasoning, that is, the large volume of key-value cache and the high migration cost.

Need to check novelty before this filing date? Find Prior Art

Description

TECHNICAL FIELD

[0001] The present application relates to the technical field of cloud computing and artificial intelligence, and particularly relates to a GPU memory-aware dynamic task scheduling method for a multi-modal large model in the tourism and travel industry. BACKGROUND

[0002] With the development of artificial intelligence technology, intelligent guide and digital human interaction based on large language models (LLM) have rapidly popularized in the tourism and travel industry. These applications usually involve real-time processing of multi-modal data such as text, voice, and images, which puts high demands on the inference performance of the background computing cluster.

[0003] Current computing power scheduling techniques mainly include traditional heuristic algorithms (such as round robin and minimum connection number) and intelligent scheduling algorithms based on deep reinforcement learning (DRL).

[0004] Although traditional algorithms are simple, they are difficult to cope with the highly heterogeneous task dependency relationships in the tourism and travel industry (for example, text must be generated first, then the mouth shape is driven, and finally the video is rendered). Although existing scheduling methods based on deep reinforcement learning (such as DQN and PPO) have certain adaptive ability, they are mostly designed for general cloud computing tasks. These methods usually only focus on CPU / memory utilization or task completion time, ignoring the "GPU memory bottleneck" problem unique to large model inference. Specifically, the existing technology has the following shortcomings:

[0005] (1) Lack of GPU memory awareness: Large model inference requires continuous high-speed GPU memory (such as VRAM), and the growth of KV-Cache (key-value cache) is dynamic. Existing schedulers cannot estimate GPU memory fragmentation, and are prone to allocate long text tasks to nodes with severe GPU memory fragmentation, resulting in OOM (GPU memory overflow) during inference.

[0006] (2) Ignoring modal differences: Existing methods usually treat all tasks as homogeneous computing packages, ignoring the huge differences in hardware requirements between "text-to-image" (computationally intensive) and "text-to-text" (memory-intensive), leading to mismatched resource allocation.

[0007] (3) Lack of security constraints: Existing end-to-end reinforcement learning models are prone to output illegal actions (such as assigning to fully loaded nodes) during the initial exploration period, leading to service avalanches. SUMMARY

[0008] This invention proposes a memory-aware dynamic task scheduling method for large-scale multimodal models in the cultural and tourism industry. It aims to solve the problems of memory overflow (OOM), high inference latency, and low resource utilization in large-scale model inference clusters caused by the lack of memory awareness, neglect of modal differences, and lack of security constraints in traditional cloud computing task scheduling technology. The method of this invention adopts an "edge-cloud" collaborative architecture, extracts the memory supply and demand features of tasks from heterogeneous graphs through graph attention networks, and combines deep reinforcement learning (DRL) with action masking to generate an optimal scheduling strategy to avoid memory overflow.

[0009] The technical solution adopted in this invention is as follows:

[0010] A memory-aware dynamic task scheduling method for a multimodal large-scale model of cultural tourism includes the following steps:

[0011] Step 1: Analyze the multimodal inference request for the cultural tourism big model and construct a task-resource heterogeneous graph containing task nodes, resource nodes and dependency edges; among them, the features of task nodes include input token length and modality type, and the features of resource nodes include memory fragmentation index and tensor core utilization.

[0012] Step 2: Use a multi-layer graph attention network to extract features from the task-resource heterogeneous graph and generate graph state features that integrate topological structure information and memory supply and demand relationship.

[0013] Step 3: Input the graph state features into the near-end policy optimization PPO agent, and combine it with the action masking mechanism based on memory capacity to output the optimal task-node mapping action that avoids memory overflow through the policy network of the PPO agent.

[0014] Step 4: Based on the first character generation time (TTFT) corresponding to the optimal task-node mapping action, the cluster memory load variance, and the computational reward, the model parameters for the cultural tourism big model are updated through reinforcement learning closed loop.

[0015] Furthermore, the task-resource heterogeneity graph is represented as: H=(O,V,C,E);

[0016] Where O represents the set of task nodes, and each task node is represented by a task feature vector. The representation includes: task type, input token length, modality type, modality embedding dimension, and estimated peak memory usage; the subscript i is used to identify the task node, and the subscript j is used to identify the feature dimension.

[0017] V represents a set of resource nodes, where each resource node is represented by a memory feature vector. The representation includes: current video memory utilization, maximum contiguous video memory block size, tensor core utilization, and the subscript k is used to identify resource nodes;

[0018] C represents the set of constraints for nodes (including resource and task nodes), which includes memory constraint information for resource nodes / task nodes. The constraints for task nodes include the estimated peak memory usage; the constraints for resource nodes include memory capacity, maximum contiguous memory block size, and current available memory.

[0019] E represents the set of connecting edges, where each weighted edge represents a potential mapping relationship between tasks and resources.

[0020] Furthermore, in step 2, before feature extraction, an attention mechanism is used to calculate the weight coefficient of each weighted edge in the task-resource heterogeneous graph:

[0021]

[0022] in, It is a non-linear activation function. This is an attention mechanism network, where the superscript T denotes the transpose operation. For memory characteristic index, For memory feature vectors, For task feature vectors, This is a task characteristic index. For memory fragmentation index; memory characteristic index and task characteristic index These are learnable parameters that are optimized during model training; GPU memory fragmentation index. ;

[0023] And the coefficient of repetition Perform Softmax normalization to obtain the normalized weight coefficients. .

[0024] Furthermore, in step 2, feature extraction of the task-resource heterogeneous graph includes:

[0025] Weighting coefficients We perform weighted aggregation of the features of the resource nodes to obtain the resource context vector of task node i in the j-th feature dimension. : ;in, It is a linear transformation function used to transform resource feature vectors Projected into a higher-dimensional space, , is the number of feature dimensions, and n is the number of resource nodes;

[0026] A multi-layer graph attention network is used to extract the task state embedding vector of each task node. Any third task node in the multi-layer graph attention network... The task state embedding vector output by the layer is: ,in, For multi-layer graph attention networks Layer task state embedding vector, The initial value (i.e.) )for , For activation function, For splicing operations, For aggregate functions, For multi-layer graph attention networks Layer weight matrix;

[0027] Let L be the number of layers in the multi-layer graph attention network. After passing through L layers of the multi-layer graph attention network, the final task state embedding vector of task node i is: For all task nodes Global aggregation yields the graph state features of the task-resource heterogeneous graph: .

[0028] Furthermore, step 3 includes:

[0029] Policy networks use graph state features As input, the output, after nonlinear mapping, is a probability distribution matrix of resource node scheduling actions. ,in, For the i-th task node, For the k-th resource node, Each element is used to characterize a resource node. For task nodes The probability of scheduling actions;

[0030] Based on the current remaining video memory of the resource node With task-estimated memory Generative Dimensions and Matrices dimensionally consistent mask matrix If the remaining video memory of resource node k Less than the i-th task node Then the mask corresponding to the current resource node and task node will be set to 0. Otherwise, set it to 1;

[0031] mask matrix With matrix Perform the Hadamard product to obtain the probability distribution matrix of legitimate resource node scheduling actions. Then, based on this matrix, the optimal task-node mapping action is output.

[0032] Furthermore, the optimal task-node mapping action is: in the matrix In this process, the scheduling action probabilities of all task nodes for each resource node are traversed, and the one with the highest probability is selected as the allocation object for the current resource node.

[0033] Furthermore, in step 4, the reward adopts a multi-objective hybrid reward function, the expression of which is:

[0034]

[0035] in, For the time of generation of the first character, Let be the standard deviation of the memory utilization rate of each resource node. The penalty for data migration across resource nodes is positively correlated with the amount of data migrated from the resource node.

[0036] Furthermore, cross-resource node data migration penalties The calculation formula is:

[0037]

[0038] in, These are preset hyperparameters. This refers to the size of the migration data across resource nodes.

[0039] The technical solution provided by this invention brings at least the following beneficial effects:

[0040] This invention introduces technologies such as Memory-Aware Graph Attention Network (GAT), Safety Masking, and a multi-objective hybrid reward function based on First-Word Generation Time (TTFT). It can effectively prevent service crashes caused by memory fragmentation in complex cultural and tourism interaction scenarios, and significantly improve the inference response speed of large models and the utilization rate of cluster resources through a multi-stage reinforcement learning mechanism.

[0041] First, in the task-resource heterogeneous graph modeling stage, this invention breaks through the limitation of traditional schedulers that only focus on CPU / GPU utilization. It introduces memory-specific features such as "estimated peak VRAM", "memory fragmentation index", and "maximum contiguous memory block size", enabling the scheduler to perceive changes in memory demand caused by the dynamic growth of the key-value cache. Second, in the decision generation stage, through the action masking mechanism, scheduling actions that do not meet memory constraints are hard-constrained and masked before the output of the decision network (Actor network), fundamentally preventing the system avalanche problem caused by illegal actions output by the end-to-end reinforcement learning model in the early exploration stage. Finally, in the design of the reward function, this invention abandons the average completion time (Makespan) metric in general cloud computing and instead adopts the first character generation time, which is directly related to user experience, as the main optimization objective. It also introduces a cross-node data migration penalty term, fully considering the engineering reality of large key-value cache volume and high migration cost in large model inference. Attached Figure Description

[0042] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0043] Figure 1 The flowchart illustrates a memory-aware dynamic task scheduling method for a multimodal large model of cultural tourism, as provided in this embodiment of the invention.

[0044] Figure 2 This is a schematic diagram illustrating the implementation process of an embodiment. Detailed Implementation

[0045] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be described in detail and completely below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them. Generally, the components of the embodiments of the present invention described and shown in the accompanying drawings can be arranged and designed using different configurations. Therefore, the following detailed description of the embodiments of the present invention provided in the accompanying drawings is not intended to limit the scope of the claimed application, but merely represents selected embodiments of the present invention.

[0046] This embodiment provides a memory-aware dynamic task scheduling method for a multimodal large-scale model of cultural tourism, which includes the following steps:

[0047] First, we analyze the multimodal inference requests for the cultural tourism big model and construct a task-resource heterogeneous graph containing task nodes, resource nodes and dependency edges. Among them, the features of task nodes include the length and modality type of the input token, and the features of resource nodes include the memory fragmentation index and tensor core utilization.

[0048] Secondly, a graph attention network is used to extract features from the task-resource heterogeneous graph, generating graph state features that integrate topological information and memory supply and demand relationships.

[0049] Subsequently, the graph state features are input into the Proximal Policy Optimization (PPO) agent. Combined with the action mask mechanism based on memory capacity, the policy network of the PPO agent outputs the optimal task-node mapping action to avoid memory overflow.

[0050] Finally, based on the first character generation time (TTFT) corresponding to the optimal task-node mapping action, the cluster memory load variance, and the computational reward, the model parameters are updated through a closed-loop reinforcement learning process, that is, the model parameters for the cultural tourism big model are updated.

[0051] The method proposed in this embodiment can effectively solve the problems of large differences in memory consumption and large fluctuations in concurrent traffic for multimodal tasks in cultural tourism scenarios. While ensuring the balanced memory load of the computing cluster, it can significantly reduce the inference latency of large models and improve the interactive experience of tourists.

[0052] In one embodiment, see Figure 1 and Figure 2 The specific implementation steps of the memory-aware dynamic task scheduling method for a multimodal large model of cultural tourism provided in this embodiment include:

[0053] Step 1: Construct a task-resource heterogeneity graph for large model inference

[0054] The computational environment of the target scenic area is modeled as a task-resource heterogeneous graph H=(O,V,C,E), with a special reconstruction tailored to the inference characteristics of large models:

[0055] (1.1) Task node set (O): Represents the large model inference subtask to be processed (e.g., Wenshengtu's "CLIP (contrastive language-image pre-training) encoding task", Wenshengwen's "Prefill task", and Digital Human's "speech synthesis task").

[0056] Constructing task feature vectors This includes not only the task type, but also the "input token sequence length," "modal embedding dimension," and "peak VRAM estimation." The subscript 'i' identifies the task node, i.e., the task index; the subscript 'j' identifies the feature dimension.

[0057] (1.2) Resource node set (V): represents heterogeneous accelerator cards (such as A100, 4090, etc.) in the computing cluster.

[0058] Constructing memory feature vectors This includes "current memory utilization", "maximum contiguous memory block size (used to assess fragmentation level)" and "Tensor Core utilization". The subscript 'k' identifies the resource node. Tensor core utilization refers to the percentage of time that the GPU's (Graphics Processing Unit) Tensor Cores are actually active within a specified time period.

[0059] (1.3) Node constraint set (C): contains video memory constraint information. The constraints of task nodes include the estimated peak video memory; the constraints of resource nodes include: video memory capacity, maximum contiguous video memory block size, and current available video memory.

[0060] (1.4) Connection edge set (E): Represents the potential mapping relationship between tasks and resources, as well as the logical dependencies between tasks (such as the transmission path of KV-Cache).

[0061] After constructing the task-resource heterogeneous graph, the original feature vectors of tasks and resources (token length, peak memory usage, fragmentation index, etc.) have been obtained. However, these features are isolated node information and cannot reflect the "matching relationship between tasks and resources" and the "mutual constraints between heterogeneous nodes". In order to integrate these isolated features into a global state representation that can guide scheduling decisions, it is necessary to use a graph attention network for message passing and feature aggregation, which leads to step 2.

[0062] Step 2: State embedding based on Memory-Aware Graph Attention Network (GAT), that is, using the message passing mechanism of the graph attention network to aggregate memory constraint information into node features:

[0063] (2.1) Resource bottleneck identification: Calculate the "memory carrying capacity score" of each computing node. Utilize an improved graph attention mechanism to calculate the weight coefficient of each weighted edge in the set of connected edges E. It depends not only on the matching degree of computing power, but also introduces a memory fragmentation penalty term.

[0064] Its expression is:

[0065]

[0066] Among them, the weighting coefficient The feature dimension j used to characterize the correlation strength between task node i and resource node k is used to measure the degree of adaptation of the resource node to a specific dimension of the task. It is a non-linear activation function. This is an attention mechanism network, where the superscript T denotes the transpose operation. For memory characteristic index, For memory feature vectors, For task feature vectors, This is a task characteristic index. This refers to the memory fragmentation index. Among them, the memory characteristic index... and task characteristic index These are learnable parameters that are optimized during model training.

[0067] Among them, the memory fragmentation index It is a non-parametric, real-time updated scalar value, based on the current memory status of the resource node, and is calculated using the following formula:

[0068]

[0069] The range of values for: ,when At that time, the video memory has good continuity and no fragmentation; At that time, the video memory is extremely fragmented, and the unusable contiguous blocks are very small.

[0070] After calculation Then, Softmax normalization is performed on all resource nodes to obtain the attention weight distribution of task i on the j-th feature dimension for each resource node, and the corresponding weights. The larger the value, the more noteworthy the mapping relationship is. The formula for calculating this value is:

[0071]

[0072] (2.2) Context-dependent aggregation: based on weights We perform weighted aggregation of the features of resource nodes to obtain the resource context vector of task i in the j-th feature dimension. :

[0073]

[0074] in, It is a linear transformation function used to transform resource feature vectors Projected into a higher-dimensional space, where n is the number of resource nodes.

[0075] (2.3) Task state embedding vector Update

[0076] In this embodiment, an L-layer GAT is used. In any j-th (j=1,…,L) layer, the features of the task node itself (i.e., the task feature vector) are... The aggregated resource context vector is merged to generate a new task state embedding.

[0077]

[0078] in, For the task state embedding vector of the (j-1)th layer, The initial value (i.e.) )for , For activation functions (such as ReLU). This is for splicing operations.

[0079] Through multiple iterations of GAT, the task state embedding vector of the task node gradually incorporates the memory constraint information of the resources, enabling the PPO agent to see the complete memory supply and demand relationship.

[0080] (2.4) After L layers of GAT, the final task state embedding vector of task node i is The final task state embeddings of all task nodes are globally aggregated (e.g., summed or averaged) to obtain the graph state features of the entire task-resource heterogeneous graph:

[0081]

[0082] Calculated This represents the current complete supply and demand situation for video memory.

[0083] Step 3: Dynamic policy generation based on Masked-PPO, which utilizes the Proximity Policy Optimization (PPO) algorithm as the decision-making brain and introduces a security constraint mechanism:

[0084] (3.1) Actor Network: Input Graph State Features After passing through a nonlinear mapping neural network (such as a multilayer perceptron MLP), the probability distribution of resource node scheduling actions is output, and the corresponding formula is as follows:

[0085]

[0086] in, For the i-th task node, For the k-th resource node, This represents the probability of assigning the task corresponding to the i-th task node to the k-th resource node.

[0087] (3.2) Memory Safety Mask: Before the Actor network outputs an action, it first determines the remaining memory of each node. and task-estimated memory Generate a mask matrix whose dimension is consistent with the probability distribution matrix of resource node scheduling actions. .

[0088] If the remaining memory of a certain resource node k Then set the corresponding mask to Otherwise, set it to 1.

[0089] This approach forces the probability of illegal actions to zero, fundamentally eliminating memory overflow errors caused by improper scheduling.

[0090] Furthermore, by performing a Hadamard product between the mask matrix and the resource node scheduling action probability distribution matrix, the probability of illegal actions is forcibly set to zero, fundamentally eliminating memory overflow errors caused by improper scheduling. The processing formula is as follows:

[0091]

[0092] in, This is the probability distribution matrix of legitimate resource node scheduling actions. It is the product of Hadama.

[0093] Finally, traverse the matrix From each resource node, select the task node with the highest probability and use it as the allocation target for that resource node.

[0094] Step 4: Design of a multi-objective hybrid reward function

[0095] The Critic network evaluates the long-term reward (i.e., the reward function) of the current state, focusing on predicting the scarcity of memory resources in future time steps. To balance inference speed and system stability, the following reward function is designed. :

[0096]

[0097] in,

[0098] TTFT (Time to First Token): The first token generation time, which is the delay in the output of the first token by the large model, is directly related to the visitor's interactive experience.

[0099] Standard deviation of memory utilization across all resource nodes. The smaller this value, the more balanced the memory load, preventing single-point memory exhaustion.

[0100] Cross-node data migration penalty. Given the large size of the KV-Cache in large models, frequent migrations can severely slow down inference; therefore, a penalty is imposed.

[0101] In this embodiment, considering the large size of the large model KV-Cache, frequent migrations would severely slow down inference; therefore, a penalty is imposed on cross-node data migration. The calculation formula is as follows:

[0102]

[0103] in, These are hyperparameters, i.e., preset values. The size of the data to be migrated across nodes.

[0104] Step 5: Model Training and Hot-Swap Adaptation

[0105] Offline pre-training: Simulation training is performed using historical multimodal request logs from the scenic area.

[0106] Dynamic node adaptation: When a scenic area temporarily adds computing nodes (such as renting cloud computing power), the inductive learning ability of graph attention networks can be used to generate embedding vectors directly based on the memory features of the new nodes without retraining the network, thus realizing hot-swappable scheduling of computing resources.

[0107] Example

[0108] Taking a scenic area's service interaction system as an example, the system simultaneously receives concurrent requests from three tourists:

[0109] (1) Tourist A (voice interaction): Asks "What is the sunrise time at a certain scenic spot today?"

[0110] Task characteristics: It is a memory-bound task with a short input token, but it relies on knowledge base retrieval and requires fast context loading.

[0111] (2) Visitor B (Image Recognition): Upload a photo of a plant and ask for recognition.

[0112] Task characteristics: It is a computationally intensive task that relies on a visual encoder and has high requirements for video memory bandwidth.

[0113] (3) Visitor C (Video Generation): Requests the generation of a 3D guided video of a specified "attraction".

[0114] Task characteristics: It is a memory-sensitive task, with large memory usage and long duration.

[0115] The specific implementation process for the above tasks includes:

[0116] 1. State Awareness and Graph Construction: The system breaks down these three requests into multiple subtasks and, in conjunction with the current status of the five background GPU servers (including different models), constructs a real-time heterogeneous graph.

[0117] Key actions: The system collects the remaining video memory and video memory fragmentation index of each server in real time. For example, server 1 has 12GB of remaining video memory but is severely fragmented; server 3 has 10GB of remaining video memory but it consists of large, contiguous blocks.

[0118] 2. Safety Masking mechanism intervention: Before the decision is made, the memory safety mask is activated.

[0119] For visitor C's video generation task (estimated to require 11GB of video memory), the masking mechanism directly blocks the action probabilities of servers 1 and 3 (making them unselectable), because server 1 has too much fragmentation to allocate large blocks of video memory, and server 3 has insufficient total capacity. This effectively prevents OOM crashes caused by blind scheduling.

[0120] 3. Intelligent Decision-Making and Distribution: Based on the features extracted by the graph attention network and the action space after mask filtering, the PPO model outputs the following strategy:

[0121] (1) Visitor A -> Server 1: Although Server 1 has a lot of video memory fragments, text tasks only require a small amount of non-contiguous video memory, and Server 1's Tensor Core is relatively idle, which can meet the high-speed requirements of first-letter generation (TTFT).

[0122] (2) Visitor B -> Server 2: Server 2 has high-bandwidth video memory, which is suitable for handling visual coding tasks.

[0123] (3) Visitor C -> Server 4 (high-performance node): The system recognizes that although Server 4 is currently in a queue, its video memory space is complete and it has the conditions for KV-Cache reuse (a video of a similar scene was just generated before). Scheduling to this server can reduce the model weight loading time.

[0124] 4. Execution Feedback and Reward Calculation: After the task is completed, the system records the actual metrics:

[0125] (1) Calculate the first character generation time (TTFT) and peak memory usage.

[0126] (2) If the scheduling strategy minimizes the fluctuation in video memory usage ( (Reduce), give the model a high positive reward to strengthen its "memory awareness" ability.

[0127] 5. Dynamic expansion scenario:

[0128] When a sudden surge in visitors to a scenic area (such as during holidays) necessitates the temporary connection of a new leased server (node 6), thanks to GAT's inductive learning capability, the model does not need to be retrained. It can directly read the memory specifications of node 6 and incorporate it into the scheduling network, achieving seamless hot-swapping of computing power.

[0129] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

[0130] The above descriptions are merely some embodiments of the present invention. Those skilled in the art can make various modifications and improvements without departing from the inventive concept of the present invention, and these all fall within the scope of protection of the present invention.

Claims

1. A memory-aware dynamic task scheduling method for a multimodal large-scale model of cultural tourism, characterized in that, Includes the following steps: Step 1: Analyze the multimodal inference request for the cultural tourism big model and construct a task-resource heterogeneous graph containing task nodes, resource nodes and dependency edges. Among them, the features of task nodes include input token length, modality type and estimated peak memory, and the features of resource nodes include memory fragmentation index and tensor core utilization. Step 2: Use a multi-layer graph attention network to extract features from the task-resource heterogeneous graph and generate graph state features that integrate topological structure information and memory supply and demand relationship. Step 3: Input the graph state features into the near-end policy optimization PPO agent, and combine it with the action masking mechanism based on memory capacity to output the optimal task-node mapping action that avoids memory overflow through the policy network of the PPO agent. The action masking mechanism based on video memory capacity is as follows: Based on the current remaining video memory of the resource node With task-estimated memory Generation dimension and resource node scheduling action probability distribution matrix dimensionally consistent mask matrix If the remaining video memory of resource node k Less than the i-th task node Then the mask corresponding to the current resource node and task node will be set to 0. Otherwise, set it to 1; where, For the i-th task node, For the k-th resource node, Each element is used to characterize a resource node. For task nodes The probability of scheduling actions; Step 4: Calculate the reward based on the first character generation time, cluster memory load standard deviation, and cross-resource node data migration penalty corresponding to the optimal task-node mapping action, and update the model parameters for the cultural tourism big model through reinforcement learning closed loop. The reward employs a multi-objective hybrid reward function, the expression of which is: ； in, For the time of generation of the first character, Let be the standard deviation of the memory utilization rate of each resource node. The penalty for data migration across resource nodes is positively correlated with the amount of data migrated from the resource node.

2. The memory-aware dynamic task scheduling method for a multimodal large-scale model of cultural tourism as described in claim 1, characterized in that, The task-resource heterogeneity graph is represented as: H=(O, V, C, E); Where O represents the set of task nodes, and each task node is represented by a task feature vector. The representation includes: task type, input token length, modality type, modality embedding dimension, and estimated peak memory usage; the subscript i is used to identify the task node, and the subscript j is used to identify the feature dimension. V represents a set of resource nodes, where each resource node is represented by a memory feature vector. The representation includes: current video memory utilization, maximum contiguous video memory block size, tensor core utilization, and the subscript k is used to identify resource nodes; C represents the set of node constraints, including memory constraint information for resource nodes / task nodes. The constraints for task nodes include the estimated peak memory usage; the constraints for resource nodes include memory capacity, maximum contiguous memory block size, and current available memory. E represents the set of connecting edges, where each weighted edge represents a potential mapping relationship between tasks and resources.

3. The memory-aware dynamic task scheduling method for a multimodal large-scale model of cultural tourism as described in claim 1, characterized in that, In step 2, before feature extraction, an attention mechanism is used to calculate the weight coefficient of each weighted edge in the task-resource heterogeneous graph: ； in, It is a non-linear activation function. This is an attention mechanism network, where the superscript T denotes the transpose operation. For memory characteristic index, For memory feature vectors, For task feature vectors, This is a task characteristic index. For memory fragmentation index; memory characteristic index and task characteristic index These are learnable parameters that are optimized during model training; GPU memory fragmentation index. ; And the weighting coefficients Perform Softmax normalization to obtain normalized weight coefficients. .

4. The memory-aware dynamic task scheduling method for a multimodal large-scale model of cultural tourism as described in claim 3, characterized in that, Step 2, feature extraction of the task-resource heterogeneity graph includes: Weighting coefficients We perform weighted aggregation of the features of the resource nodes to obtain the resource context vector of task node i in the j-th feature dimension. : ;in, It is a linear transformation function used to transform the memory feature vector Projected into a higher-dimensional space, , is the number of feature dimensions, and n is the number of resource nodes; A multi-layer graph attention network is used to extract the task state embedding vector of each task node. Any third task node in the multi-layer graph attention network... The task state embedding vector output by the layer is: ,in, For multi-layer graph attention networks Layer task state embedding vector, The initial value is , For activation function, For splicing operations, For aggregate functions, For multi-layer graph attention networks Layer weight matrix; Let L be the number of layers in the multi-layer graph attention network. After passing through L layers of the multi-layer graph attention network, the final task state embedding vector of task node i is: For all task nodes Global aggregation yields the graph state features of the task-resource heterogeneous graph: .

5. The memory-aware dynamic task scheduling method for a multimodal large-scale model of cultural tourism as described in claim 1, characterized in that, Step 3 includes: Policy networks use graph state features As input, the output, after nonlinear mapping, is a probability distribution matrix of resource node scheduling actions. ; mask matrix With matrix Perform the Hadamard product to obtain the probability distribution matrix of legitimate resource node scheduling actions. Then, based on this matrix, the optimal task-node mapping action is output.

6. The memory-aware dynamic task scheduling method for a multimodal large-scale model of cultural tourism as described in claim 5, characterized in that, The optimal task-node mapping action is: in the matrix In this process, the scheduling action probabilities of all task nodes for each resource node are traversed, and the one with the highest probability is selected as the allocation object for the current resource node.

7. The memory-aware dynamic task scheduling method for a multimodal large-scale model of cultural tourism as described in claim 1, characterized in that, Cross-resource node data migration penalty The calculation formula is: ； in, These are preset hyperparameters. This refers to the size of the migration data across resource nodes.