Resource allocation method and apparatus for terminal device, electronic device, and storage medium
By acquiring operational status information from edge devices and utilizing deep Q-networks to evaluate resource allocation actions, the problem of resource allocation mismatch in existing technologies is solved, thereby improving the adaptability and real-time performance of resource allocation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- RDA MICROELECTRONICS SHANGHAICO LTD
- Filing Date
- 2026-04-10
- Publication Date
- 2026-06-16
Smart Images

Figure CN122220106A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, and in particular to a resource allocation method, apparatus, electronic device, and storage medium for a terminal device. Background Technology
[0002] With the development of artificial intelligence, models are being deployed more and more in edge devices such as mobile terminals, embedded devices, and IoT devices. These devices, while performing intelligent computing tasks such as speech recognition, image processing, and natural language processing, typically also need to undertake multiple business functions such as communication, display, media processing, data transmission, and interactive response.
[0003] To address the multi-task processing needs of edge devices, existing technologies typically employ methods such as preset priorities, fixed quota allocation, rule control, load monitoring and adjustment, and predictive scheduling based on historical data to manage the resource consumption of different services during operation. For example, processing order or usage ratios can be preset according to service categories and scheduled according to preset rules; relevant operating parameters can also be adjusted based on load changes during device operation; and historical operating data can be combined to estimate subsequent service demands and arrange the execution process of each service accordingly.
[0004] However, when edge devices run model tasks and other concurrent services simultaneously, the resource consumption relationship between different services changes over time. Existing technologies struggle to generate allocation results that match the current load in a timely manner, affecting the efficiency of coordinated processing of multiple services. Summary of the Invention
[0005] This application provides a resource allocation method, apparatus, electronic device, and storage medium for a terminal device, which aims to improve the adaptability of resource coordination and allocation between model inference tasks and concurrent business tasks in the terminal device, and to improve the overall operating efficiency of the terminal device.
[0006] In a first aspect, embodiments of this application provide a resource allocation method for a terminal device, including:
[0007] Obtain the operating status information of the terminal device, and obtain the current status characteristics based on the operating status information; wherein, the operating status information includes device resource status information, task operating status information of the model inference task, and task operating status information of at least one concurrent business task;
[0008] Based on the current state characteristics, the value evaluation results corresponding to each candidate resource allocation action in the preset candidate resource allocation action set are obtained through a pre-trained action value evaluation model.
[0009] Based on the value assessment results corresponding to each of the candidate resource allocation actions, a target resource allocation action is determined, and the resources of the model inference task and each of the concurrent business tasks are allocated and adjusted according to the target resource allocation action.
[0010] In one possible implementation, obtaining the operating status information of the terminal device and obtaining the current status characteristics based on the operating status information includes:
[0011] By using pre-set sensors and system logs, the device resource status information of the terminal device, the task running status information of the model inference task in the terminal device, and the task running status information of at least one concurrent business task in the terminal device are obtained.
[0012] Based on the device resource status information, the task execution status information of the model inference task, and the task execution status information of the concurrent business task, the current status features of the terminal device are obtained through feature extraction processing.
[0013] In one possible implementation, the action value evaluation model is a deep Q-network action value evaluation model;
[0014] Accordingly, based on the current state characteristics, the value evaluation results corresponding to each candidate resource allocation action in the preset candidate resource allocation action set are obtained through a pre-trained action value evaluation model, including:
[0015] Obtain a preset set of candidate resource allocation actions, and obtain a pre-trained deep Q-network action value evaluation model;
[0016] The current state features are input into the action value evaluation model, so that the action value evaluation model performs approximate calculations on the current state features and each candidate resource allocation action in the preset candidate resource allocation action set according to the state action value function, and outputs the value evaluation results corresponding to each candidate resource allocation action under the current state features.
[0017] In one possible implementation, training the deep Q-network action value evaluation model includes:
[0018] Construct an initial deep Q-network action value evaluation model and a target network model corresponding to the initial deep Q-network action value evaluation model;
[0019] Acquire the current status characteristics of the terminal device, the resource allocation action, the reward value corresponding to the execution of the resource allocation action, and the running status information after the execution of the resource allocation action in multiple resource allocation cycles;
[0020] Based on the running status information after the resource allocation action is executed, the adjusted status features are generated;
[0021] The current state features, the resource allocation action, the reward value, and the adjusted state features are combined as an experience sample and stored in the sample library;
[0022] Training samples are randomly sampled from the sample library, and the current state features in the training samples are input into the initial deep Q network action value evaluation model to obtain the action value corresponding to each candidate resource allocation action under the current state features, and the current action value is determined based on the resource allocation action.
[0023] The adjusted state features from the training samples are input into the target network model to obtain the action value of each candidate resource allocation action in the preset candidate resource allocation action set under the adjusted state features.
[0024] The target action value is determined based on the reward value and the action value corresponding to each candidate resource allocation action;
[0025] A loss function is constructed based on the target action value and the current action value, and the initial deep Q-network action value evaluation model is trained by minimizing the loss function to obtain the pre-trained deep Q-network action value evaluation model.
[0026] In one possible implementation, determining the target resource allocation action based on the value assessment results corresponding to each of the candidate resource allocation actions includes:
[0027] Based on a preset exploration probability, a candidate resource allocation action is randomly selected from the preset candidate resource allocation action set, and the randomly selected candidate resource allocation action is determined as the target resource allocation action;
[0028] Alternatively, the value assessment results corresponding to each of the candidate resource allocation actions can be compared, and the candidate resource allocation action with the highest value assessment result can be determined as the target resource allocation action.
[0029] In one possible implementation, the step of allocating and adjusting the resources of the model inference task and each of the concurrent business tasks according to the target resource allocation action includes:
[0030] Based on the target resource allocation action, the operating frequency of at least one heterogeneous computing resource among the processor, graphics processor and neural network processor in the terminal device is increased or decreased.
[0031] And / or adjust the memory resources allocated to at least one of the model inference tasks or concurrent business tasks according to the target resource allocation action, either by increasing or decreasing them;
[0032] And / or adjust the network bandwidth allocated to at least one of the model inference tasks or concurrent service tasks according to the target resource allocation action, either by increasing or decreasing the bandwidth.
[0033] And / or, based on the target resource allocation action, adjust the execution priority of at least one of the model inference tasks or concurrent business tasks by raising or lowering it.
[0034] In one possible implementation, after adjusting the resource allocation for the model inference task and each of the concurrent business tasks according to the target resource allocation action, the method further includes:
[0035] Obtain runtime status information and runtime feedback information after the resource allocation action is executed;
[0036] Based on the operational feedback information, determine the reward value corresponding to the target resource allocation action, and obtain the adjusted state characteristics based on the operational state information after the resource allocation action is executed;
[0037] Based on the current state characteristics, the target resource allocation action, the reward value, and the adjusted state characteristics, a model update training sample is constructed, and the action value evaluation model is updated based on the model update training sample.
[0038] The updated model parameters of the action value evaluation model are copied to the target network model to update the target network model.
[0039] Secondly, embodiments of this application provide a resource allocation device for a terminal device, comprising:
[0040] The data acquisition module is used to acquire the operating status information of the terminal device and acquire the current status characteristics based on the operating status information; wherein, the operating status information includes device resource status information, task operating status information of the model inference task, and task operating status information of at least one concurrent business task;
[0041] The value assessment module is used to obtain the value assessment results of each candidate resource allocation action in the preset candidate resource allocation action set based on the current state characteristics and through a pre-trained action value assessment model.
[0042] The resource adjustment module is used to determine the target resource allocation action based on the value assessment results corresponding to each of the candidate resource allocation actions, and to adjust the resource allocation of the model inference task and each of the concurrent business tasks according to the target resource allocation action.
[0043] Thirdly, embodiments of this application provide an electronic device, including: a memory and a processor;
[0044] The memory stores computer-executed instructions;
[0045] The processor executes computer execution instructions stored in the memory, causing the processor to perform the first aspect and / or various possible implementations of the first aspect as described above.
[0046] Fourthly, embodiments of this application provide a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, are used to implement the first aspect and / or various possible implementations of the first aspect.
[0047] Fifthly, embodiments of this application provide a computer program product, including a computer program that, when executed by a processor, implements the first aspect and / or various possible implementations of the first aspect.
[0048] In a sixth aspect, this application provides a chip including at least one processor for executing program instructions to perform the methods involved in the first aspect and any possible implementation.
[0049] This application provides a resource allocation method, apparatus, electronic device, and storage medium for a terminal device. By acquiring the terminal device's operating status information and obtaining current status characteristics based on that information, it can uniformly represent the terminal device's current resource occupancy, as well as the operation status of model inference tasks and at least one concurrent business task. This allows resource status and task status to be comprehensively reflected on the same processing basis. Subsequent resource allocation is based on a perception of the terminal device's overall current operating status, improving the correspondence and relevance between resource allocation decisions and the actual operating status. Based on the current status characteristics, a pre-trained action value evaluation model obtains the value evaluation results corresponding to each candidate resource allocation action in a preset set of candidate resource allocation actions, enabling a quantitative expression of the adaptation relationship between the terminal device's current operating status and different candidate resource allocation actions. This eliminates reliance on fixed quotas or static rules in the resource allocation process, allowing for targeted screening of multiple candidate resource allocation actions based on current status characteristics, improving the flexibility and accuracy of resource allocation action selection. Next, based on the value assessment results corresponding to each candidate resource allocation action, the target resource allocation action is determined. Then, the resources for the model inference task and each concurrent business task are allocated and adjusted according to the target resource allocation action, so that different tasks receive resource support that better meets their execution needs under the current operating state. In summary, this application improves the real-time performance and adaptability of resource coordination between the model inference task and concurrent business tasks, solving the technical problem of difficulty in timely generating resource allocation results that match the current operating state when resources are limited and task loads change dynamically. Attached Figure Description
[0050] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.
[0051] Figure 1 A schematic diagram of the application architecture provided in the embodiments of this application;
[0052] Figure 2 A flowchart illustrating the resource allocation method for a terminal device provided in this application embodiment;
[0053] Figure 3 This is a schematic diagram of the method for training a deep Q-network action value evaluation model provided in an embodiment of this application;
[0054] Figure 4 This is a schematic diagram of the structure of the resource allocation device for the terminal device provided in the embodiments of this application;
[0055] Figure 5 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application.
[0056] The accompanying drawings illustrate specific embodiments of this application, which will be described in more detail below. These drawings and descriptions are not intended to limit the scope of the concept in any way, but rather to illustrate the concept of this application to those skilled in the art through reference to particular embodiments. Detailed Implementation
[0057] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.
[0058] Figure 1 A schematic diagram of the application architecture provided in the embodiments of this application, such as Figure 1 As shown, the application architecture of this application is deployed inside the terminal device. During operation, the terminal device 11 acquires operational status information and uses this information as the input basis for resource allocation decisions. The terminal device 11 processes the operational status information to obtain current status characteristics that characterize the current operational status, and then inputs these current status characteristics into the action value assessment model. The action value assessment model evaluates the value of candidate resource allocation actions in a preset set of candidate resource allocation actions based on the current status characteristics, outputs the value assessment results corresponding to each candidate resource allocation action, and determines the target resource allocation action accordingly. The terminal device 11 executes resource allocation adjustments based on the target resource allocation action, implementing the processing chain of operational status information, action value assessment model, and resource allocation actions into actual resource control behavior. This allows the resource occupancy relationship between model inference tasks and concurrent business tasks within the terminal device to be coordinated and adjusted according to changes in the current operational status of the terminal device 11.
[0059] The inventive concept of this application lies in providing a resource allocation method for terminal devices. Addressing the situation where resource occupancy dynamically changes with task load when a terminal device simultaneously runs a model inference task and at least one concurrent business task under resource-constrained conditions, the method first performs a unified perception of the terminal device's current operating state. This integrates device resource status information with the task operating state information of the model inference task and concurrent business tasks, further transforming it into current state features that characterize the overall operating state of the terminal device. Based on this, a pre-trained action value evaluation model is introduced to evaluate different candidate resource allocation actions from a preset set of candidate resource allocation actions, establishing a quantifiable value correspondence between the terminal device's current state and each candidate resource allocation action.
[0060] Furthermore, based on the value assessment results corresponding to each candidate resource allocation action, a target resource allocation action that better matches the current state characteristics is determined. Then, the resources for the model inference task and each concurrent business task are allocated and adjusted according to the target resource allocation action. Thus, this application establishes a processing chain encompassing operational state information, current state characteristics, value assessment results of candidate resource allocation actions, target resource allocation actions, and resource allocation adjustments, enabling the resource allocation process to make dynamic decisions and actually execute based on the current operational state of the terminal device. It is evident that this application uses an action value assessment model to make targeted judgments among multiple candidate resource allocation actions, forming a resource allocation result adapted to the current operational state. The terminal device can dynamically coordinate and allocate resources based on changes in the current state when the model inference task and concurrent business tasks are running in parallel.
[0061] The technical solution of this application and how the technical solution of this application solves the above-mentioned technical problems are described in detail below with specific embodiments. These specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments. The embodiments of this application will now be described with reference to the accompanying drawings.
[0062] Figure 2 A flowchart illustrating the resource allocation method for a terminal device provided in this application embodiment is shown below. Figure 2 As shown, the method includes:
[0063] S21, obtain the operating status information of the terminal device, and obtain the current status characteristics based on the operating status information; wherein, the operating status information includes device resource status information, task operating status information of model inference task and task operating status information of at least one concurrent business task.
[0064] In this embodiment, the terminal device typically carries out model inference tasks and at least one concurrent business task simultaneously during operation. The occupancy of processing power, storage capacity, and transmission capacity by different tasks dynamically changes during operation. To ensure that the subsequent resource allocation process reflects the actual operating status of the terminal device, the terminal device's operating status information is first acquired. This operating status information characterizes the overall operating status of the terminal device at the current moment. Specifically, device resource status information characterizes the occupancy and availability of various resources within the terminal device; the model inference task's task operating status information characterizes the execution status of the model inference task at the current moment; and the concurrent business task's task operating status information characterizes the execution status of each concurrent business task at the current moment. By uniformly acquiring these various types of information, the terminal device's resource status and task status can be comprehensively reflected on the same processing basis, providing a data foundation for determining subsequent resource allocation actions and improving the correspondence between subsequent resource allocation decisions and the actual operating status.
[0065] After acquiring the operational status information, current status features are obtained based on this information. Current status features are the state representation results formed after organizing, representing, and extracting the operational status information. They are used to map device resource status information, model inference task operational status information, and at least one concurrent business task operational status information into data that can be used for subsequent processing. Specifically, operational status information from different sources and of different types can be standardized to eliminate differences between different information representation methods, and data features reflecting the current operational status can be formed based on this standardized processing. Current status features retain both the current resource occupancy relationships and task execution relationships of the terminal device, and facilitate subsequent evaluation of different candidate resource allocation actions.
[0066] In summary, by acquiring current state characteristics based on the current operational status information, the subsequent resource allocation process can be based on the real-time operational status of the terminal devices, making the resource allocation results more in line with the actual needs of model inference tasks and concurrent business tasks. This achieves the perception of the current operational status of the terminal devices, providing a data foundation for subsequently obtaining the value assessment results corresponding to candidate resource allocation actions and determining the target resource allocation action.
[0067] S22, Based on the current state characteristics, obtain the value evaluation results corresponding to each candidate resource allocation action in the preset candidate resource allocation action set through the pre-trained action value evaluation model.
[0068] In this embodiment, the pre-trained action value evaluation model refers to a model that has been trained and possesses state evaluation capabilities before resource allocation is applied. The preset candidate resource allocation action set is a pre-defined set of multiple candidate resource allocation actions that can be selected. The value evaluation result is used to characterize the applicability of each candidate resource allocation action under the current state characteristics. By using the pre-trained action value evaluation model, the current state characteristics can be associated with the preset candidate resource allocation action set, so that the resource demand relationship between the model inference task and at least one concurrent business task in the current running state is mapped to the evaluation result corresponding to each candidate resource allocation action. This realizes the transformation from state representation to action evaluation result, providing a basis for subsequent determination of target resource allocation actions.
[0069] Specifically, the pre-trained action value assessment model takes the current state features as input and comprehensively processes the device resource status, the task operation status of the model inference task, and the task operation status of at least one concurrent business task represented by the current state features. It then combines this with the adaptation relationship of each candidate resource allocation action in the pre-set set of candidate resource allocation actions to obtain the value assessment result corresponding to each candidate resource allocation action. The value assessment result can reflect the degree of matching, adaptability, or expected effect between different candidate resource allocation actions and the current state features, reflecting the resource coordination situation that may result from adopting different candidate resource allocation actions in the current operating state. By providing value assessment results for each candidate resource allocation action in the pre-set set of candidate resource allocation actions, subsequent processing can compare and filter among multiple candidate resource allocation actions, avoiding direct resource allocation based on a single rule, and improving the pertinence and distinguishability of the resource allocation action determination process.
[0070] In summary, the value assessment results obtained based on the current state characteristics can reflect the response relationship of the terminal device to different candidate resource allocation actions in the current operating environment. Thus, when the model inference task and at least one concurrent business task are running simultaneously, the applicability of different candidate resource allocation actions to resource coordination can be reflected in the value assessment results, ensuring that subsequent resource allocation adjustments are based on the current operating state. This achieves quantitative processing of actions oriented towards the current operating state, which is beneficial to improving the accuracy and real-time performance of resource allocation action determination.
[0071] S23. Based on the value assessment results corresponding to each candidate resource allocation action, determine the target resource allocation action, and adjust the resource allocation for the model inference task and each concurrent business task according to the target resource allocation action.
[0072] In this embodiment, the target resource allocation action is a resource allocation action selected based on the current operating state of the terminal device, serving as the basis for subsequent resource allocation adjustments. Since the value assessment results corresponding to each candidate resource allocation action already characterize the adaptation relationship between different candidate resource allocation actions and the current state characteristics, comparisons, screenings, or selections can be made among the candidate resource allocation actions to determine the target resource allocation action that best matches the current operating state from the preset set of candidate resource allocation actions. This process of determining the target resource allocation action moves resource allocation processing beyond simply evaluating candidate resource allocation actions; it further generates executable resource allocation results. This achieves a transformation from action evaluation to action decision-making, providing a clear execution target for resource allocation adjustments.
[0073] After determining the target resource allocation action, the resources for the model inference task and each concurrent business task are adjusted according to this action. This resource allocation adjustment refers to adjusting the resource usage relationship of the model inference task and each concurrent business task on the terminal device according to the allocation relationship corresponding to the target resource allocation action, so that different tasks receive resource support appropriate to their execution needs in the current running state. In this way, the resource allocation relationship between the model inference task and each concurrent business task can be adjusted according to changes in the current running state, ensuring that resource usage corresponds to the task execution state.
[0074] Furthermore, by adjusting the resource allocation for the model inference task and each concurrent business task based on the target resource allocation action, the terminal device can respond to the resource requirements of different tasks in the current operating scenario. For the model inference task, it can obtain corresponding resource support based on its execution status and resource requirements; for each concurrent business task, it can also obtain corresponding resource allocation based on its own operating status. By considering both the model inference task and each concurrent business task in the same resource allocation process, different tasks in the terminal device can form a coordinated resource occupation relationship under the condition of sharing limited resources. This realizes dynamic resource coordination of the terminal device in multi-tasking scenarios and improves the consistency between resource allocation and the current operating status.
[0075] In one embodiment, the operation status information of the terminal device is obtained, and the current status characteristics are obtained based on the operation status information, including:
[0076] S211, by using pre-set sensors and system logs, obtain the device resource status information of the terminal device, the task running status information of the model inference task in the terminal device, and the task running status information of at least one concurrent business task in the terminal device.
[0077] Specifically, pre-set sensors are used to collect resource and environmental status information during the operation of the terminal device, while system logs record task scheduling, resource usage, and operational feedback information during the terminal device's operation. Through the combined collection of data from pre-set sensors and system logs, both resource-side and task-side information of the terminal device can be obtained simultaneously, enabling the subsequently generated current status characteristics to more comprehensively reflect the terminal device's operational status at the current moment. By collecting operational data from multiple sources, synchronous perception of the terminal device's current resource usage and task execution status is achieved, providing fundamental data support for the evaluation of subsequent resource allocation actions.
[0078] The device resource status information includes the current usage and availability of various resources within the terminal device. This information can include the real-time utilization of the processor, graphics processing unit (GPU), and neural processing unit (NPU). It can also include the load status, memory usage, network bandwidth usage, battery status, and temperature status of each computing core. Memory usage can include total memory, used memory, and available memory; network bandwidth usage can include uplink and downlink speeds; battery status can include battery level, charging status, and battery health; and temperature status can include the internal and external temperatures of the terminal device. By acquiring this device resource status information, the current resource infrastructure of the terminal device can be characterized from multiple dimensions, including computing resources, storage resources, transmission resources, power supply status, and thermal status. Detailed collection of device resource status information provides a comprehensive description of the terminal device's resource-side operating status, which is beneficial for improving the matching degree between subsequent resource allocation adjustments and actual resource conditions.
[0079] The task execution status information of the model inference task and at least one concurrent business task is used to characterize the execution status of different tasks at the current moment. The task execution status information of the model inference task and the concurrent business task can include task type and task execution status. The task execution status can include active, standby, and paused states. Concurrent business tasks can include communication tasks, display tasks, media processing tasks, network request tasks, or real-time data processing tasks, etc. By acquiring the task execution status information of the model inference task and the concurrent business task, the execution activity level, resource consumption tendency, and processing priority requirements of different tasks at the current moment can be reflected. This allows the terminal device to not only perceive the resource consumption itself, but also to perceive which tasks are demanding resources and what stage of execution the tasks are currently in.
[0080] S212, based on the device resource status information, the task operation status information of the model inference task, and the task operation status information of the concurrent business task, the current status features of the terminal device are obtained through feature extraction processing.
[0081] Specifically, feature extraction processing refers to the process of organizing, normalizing, encoding, fusing, and vectorizing the collected multi-source operational status data to form data representation results that can be used for subsequent action value assessment. Current state features can be represented in the form of feature vectors, used to comprehensively characterize the terminal device's current resource usage, task execution status, battery status, thermal status, and network status. For data of different dimensions, types, and sources, a unified processing method can be performed first to form current state features suitable for input into the subsequent action value assessment model. In this way, raw operational status information can be transformed from scattered monitoring data into structured, computable state representation results, improving the data availability and state representation completeness of subsequent value assessment processing.
[0082] Furthermore, the current state feature is a comprehensive extraction result of device resource status information, model inference task execution status information, and concurrent business task execution status information. It retains both key resource information from the current operating state of the terminal device and execution information from the task side. By simultaneously introducing device resource status information and task execution status information, and obtaining the current state feature through feature extraction processing, the current state feature can more comprehensively characterize the real-time load and resource competition relationship of the terminal device. By constructing the current state feature for multi-task collaboration scenarios, a direct input is provided for subsequently obtaining the value assessment results corresponding to each candidate resource allocation action based on the current state feature.
[0083] In one embodiment, the action value evaluation model is a Deep Q-Network action value evaluation model. Here, Deep Q-Network (DQN) is the core of which utilizes deep neural networks to approximate the state-action value function, establishing a mapping relationship between the current state characteristics and the values of candidate resource allocation actions. By setting the action value evaluation model to a Deep Q-Network model, the model can process complex operational states formed under the combined action of model inference tasks and at least one concurrent business task in the resource allocation scenario of the terminal device, and output the value evaluation results corresponding to each candidate resource allocation action. By adopting the Deep Q-Network action value evaluation model, a non-linear approximation of the state-action value function is achieved, enhancing the action value evaluation model's ability to express and fit complex state information.
[0084] Specifically, the deep Q-network action value evaluation model uses a deep neural network as its basic structure and learns the state-action value function through model parameters. The state-action value function can be expressed as Q(S... t A t ;θ), where S t Let A represent the state at time t.t Let S represent the action at time t, and θ represent the model parameters of the deep neural network. In a resource allocation scenario, S... t It can be characterized by the features of the current state, A t It can be represented by candidate resource allocation actions from a preset set of candidate action sets. By inputting the current state features into the deep Q-network action value evaluation model, the value evaluation results corresponding to each candidate resource allocation action of the terminal device in the current operating state can be obtained, providing a quantitative basis for determining the subsequent target resource allocation action. By representing the resource allocation problem as a state-action value evaluation problem, the resource allocation decision-making process becomes computable and comparable.
[0085] Furthermore, the operational status of a terminal device typically involves device resource status information, model inference task operational status information, and concurrent business task operational status information simultaneously. The Deep Q-Network Action Value Evaluation Model leverages the hierarchical feature representation capabilities of deep neural networks to jointly process multi-dimensional information from the current state features, extracting key features related to resource allocation actions. Employing the Deep Q-Network Action Value Evaluation Model also enhances the adaptability of the action value evaluation process. Since the Deep Q-Network Action Value Evaluation Model does not directly rely on fixed rules for judgment, but rather establishes a mapping relationship between states and action values through training, it can re-evaluate candidate resource allocation actions based on current state characteristics when the terminal device's operational status changes. For scenarios involving resource competition, load fluctuations, and task switching between model inference tasks and concurrent business tasks, the Deep Q-Network Action Value Evaluation Model can provide different action value evaluation results based on different state combinations, making the determination of resource allocation actions more closely aligned with the actual operational situation of the terminal device. By adopting the Deep Q-Network Action Value Evaluation Model, the adaptability of the action value evaluation process to dynamic load changes is improved, which is beneficial for enhancing the pertinence and real-time nature of resource allocation adjustments, providing a stable model foundation for subsequent resource allocation decisions.
[0086] In one embodiment, based on the current state characteristics, a pre-trained action value evaluation model is used to obtain the value evaluation results corresponding to each candidate resource allocation action in a preset set of candidate resource allocation actions, including:
[0087] S221, obtain a preset set of candidate resource allocation actions, and obtain a pre-trained deep Q-network action value evaluation model.
[0088] In this embodiment, a preset candidate resource allocation action set is used to characterize multiple candidate resource allocation actions that the terminal device can choose from during the resource allocation process. A pre-trained deep Q-network action value evaluation model is used to evaluate the value of each candidate resource allocation action under the current state characteristics. The preset candidate resource allocation action set can be pre-established before the resource allocation strategy is executed and set according to the adjustable resource types and task management methods of the terminal device. The candidate resource allocation actions in the preset candidate resource allocation action set may include actions to adjust the operating frequency of the processor, graphics processor, and neural network processor; actions to adjust memory resource allocation; actions to adjust network bandwidth allocation; and actions to adjust task execution priority. By pre-setting the candidate resource allocation action set, the resource allocation decision-making process can be limited to the range of actions that the terminal device can execute, ensuring that the subsequent output value evaluation results can directly correspond to the executable resource allocation adjustment scheme.
[0089] S222, Input the current state features into the action value assessment model so that the action value assessment model can perform approximate calculations on the current state features and each candidate resource allocation action in the preset candidate resource allocation action set according to the state action value function, and output the value assessment results corresponding to each candidate resource allocation action under the current state features.
[0090] Specifically, after receiving the current state characteristics, the action value assessment model combines them with a pre-defined set of candidate resource allocation actions to quantitatively evaluate the potential resource coordination effects of different candidate resource allocation actions in the current state. Approximate calculation refers to using a deep Q-network action value assessment model to approximate the state-action value function to obtain the value correspondence between the current state characteristics and each candidate resource allocation action. In this way, multiple candidate resource allocation actions that were originally difficult to compare directly can be transformed into measurable and comparable value assessment results. This achieves the transformation from state representation to quantitative action value results, providing a direct basis for determining subsequent target resource allocation actions.
[0091] Furthermore, the value assessment results output by the action value assessment model are used to characterize the applicability of each candidate resource allocation action under the current state characteristics. Since the current state characteristics comprehensively reflect the resource occupancy and task operation status of the terminal device at the current moment, the value assessment results corresponding to each candidate resource allocation action can reflect the degree of matching between different candidate resource allocation actions and the current resource competition relationship, task operation relationship, and resource demand relationship. In scenarios where model inference tasks and concurrent business tasks run simultaneously, one candidate resource allocation action may be more suitable for a state with high resource pressure, while another candidate resource allocation action may be more suitable for a state with rapidly changing task load. By outputting value assessment results for each candidate resource allocation action separately, the action value assessment model allows the differences between different candidate resource allocation actions to be reflected within a unified evaluation framework. By outputting the value assessment results corresponding to each candidate resource allocation action under the current state characteristics, comparable evaluation of candidate resource allocation actions is achieved, which is beneficial to improving the accuracy of subsequent target resource allocation action selection.
[0092] The calculation process between the current state characteristics and the preset set of candidate resource allocation actions essentially establishes a mapping relationship between the current operating state of the terminal device and the executable resource allocation actions. This mapping relationship uses an action value assessment model to uniformly evaluate the expected effects of each candidate resource allocation action in the current state, enabling subsequent resource allocation decisions to be based on quantitative analysis. It can better reflect the coupling relationship between device resource status, model inference task status, and concurrent business task status, making the value assessment results of candidate resource allocation actions closer to the actual operating conditions of the terminal device.
[0093] In one embodiment, Figure 3 This is a schematic flowchart illustrating the method for training a deep Q-network action value evaluation model provided in an embodiment of this application. Based on the above embodiments, as follows... Figure 3 As shown, it includes:
[0094] S31, Construct the initial deep Q-network action value evaluation model and the target network model corresponding to the initial deep Q-network action value evaluation model;
[0095] S32, acquire the current state characteristics of the terminal device, the resource allocation action, the reward value corresponding to the execution of the resource allocation action, and the running status information after the execution of the resource allocation action in multiple resource allocation cycles;
[0096] S33, Generate adjusted state characteristics based on the running status information after the resource allocation action is executed;
[0097] S34, store the current state characteristics, resource allocation actions, reward values and adjusted state characteristics as an experience sample in the sample library;
[0098] S35, randomly sample training samples from the sample library, input the current state features in the training samples into the initial deep Q network action value evaluation model, obtain the action value of each candidate resource allocation action under the current state features, and determine the current action value based on the resource allocation action;
[0099] S36, Input the adjusted state features from the training samples into the target network model to obtain the action value of each candidate resource allocation action in the preset candidate resource allocation action set under the adjusted state features;
[0100] S37. Determine the target action value based on the reward value and the action value corresponding to each candidate resource allocation action;
[0101] S38. Construct a loss function based on the target action value and the current action value, and train the initial deep Q network action value evaluation model by minimizing the loss function to obtain a pre-trained deep Q network action value evaluation model.
[0102] In this embodiment, the initial deep Q-network action value evaluation model serves as the main evaluation network during training, outputting the action value corresponding to each candidate resource allocation action under the current state features. The target network model serves as the auxiliary evaluation network during training, providing a relatively stable basis for calculating the target action value during the training phase. Both can adopt the same or similar network structures, differing only in their parameter update timing. The parameters of the initial deep Q-network action value evaluation model are used for continuous training optimization, while the parameters of the target network model are used to participate in the calculation of the target action value during the training phase. By simultaneously constructing the initial deep Q-network action value evaluation model and the target network model, the main network and target network are collaboratively set during the training process of the deep Q-network action value evaluation model, improving the stability of the action value learning process and providing a model foundation for subsequently obtaining a pre-trained deep Q-network action value evaluation model.
[0103] After model construction, the current state characteristics, resource allocation actions, corresponding reward values, and post-action running state information of the terminal device are acquired across multiple resource allocation cycles. The current state characteristics characterize the terminal device's state at the start of a resource allocation cycle; the resource allocation actions characterize the actual resource allocation adjustment scheme executed within that cycle; the reward value characterizes the feedback effect of the resource allocation action; and the post-action running state information characterizes the subsequent running state the terminal device enters after executing the resource allocation action. The reward value can be determined based on task completion, system energy consumption, model inference performance, and user experience. Task completion reflects whether high-priority tasks receive timely responses; system energy consumption reflects the impact of resource allocation actions on the terminal device's energy consumption; model inference performance reflects the processing speed and accuracy of model inference tasks; and user experience reflects the terminal device's performance in terms of interaction smoothness and response time. By continuously acquiring the above information across multiple resource allocation cycles, multi-cycle state transition data can be generated for training a deep Q-network action value evaluation model. By collecting status information before and after resource allocation, resource allocation actions and their execution feedback, continuous accumulation of sample data on the resource allocation process is achieved, providing a data source for the construction of subsequent experience samples.
[0104] Next, based on the operational status information, adjusted state features are obtained. These adjusted state features represent the terminal device's operational status after the resource allocation action is executed, and are the subsequent state representation results corresponding to the current state features. Since the operational status information after the resource allocation action already reflects the changes in the terminal device's resource status, the model inference task's operational status, and the concurrent business task's operational status after the resource allocation action, the operational status information after the resource allocation action can be organized and represented using a feature extraction method corresponding to the generated current state features, forming adjusted state features. These adjusted state features describe the subsequent impact of the resource allocation action on the terminal device's operational status, enabling the training process to establish the connection between the current state features, the resource allocation action, and the subsequent state. This achieves a structured representation of the state after the resource allocation action is executed, allowing empirical samples to fully reflect the state transition relationships of a single resource allocation cycle.
[0105] After generating the adjusted state features, the current state features, resource allocation actions, reward values, and adjusted state features are combined as experience samples and stored in the sample library. Experience samples characterize the complete process of a terminal device within a resource allocation cycle, starting from the current state features, executing resource allocation actions, obtaining a reward value, and transitioning to the adjusted state features. Essentially, it corresponds to a sequence of state, action, reward, and subsequent state. This sequence can be represented as state, action, reward, next state, i.e., (S...t A t R t S (t+1) ), where S t As a characteristic of the current state, A t For resource allocation actions, R t For the reward value, S (t+1) This is to adjust the state characteristics. By continuously storing experience samples in the sample library, resource allocation experience under different operating states can be accumulated in multiple resource allocation cycles, providing a sample basis for subsequent training phases.
[0106] After the sample library is formed, training samples are randomly sampled from the library. The current state features of the training samples are then input into the initial deep Q-network action value evaluation model to obtain the action value corresponding to each candidate resource allocation action under the current state features. The current action value is then determined based on the resource allocation action. Random sampling of training samples (such as an experience replay mechanism) breaks the continuous correlation of samples over time by randomly sampling historical experience samples, avoiding excessive reliance on similar data in adjacent resource allocation cycles and improving the stability of the training process. After the current state features are input into the initial deep Q-network action value evaluation model, the model outputs the action value corresponding to each candidate resource allocation action under the current state features. Then, based on the resource allocation actions recorded in the training samples, the current action value corresponding to the actually executed resource allocation action is extracted from the output action value. This current action value is used to characterize the value estimation result of the initial deep Q-network action value evaluation model for the executed resource allocation action under the current state features.
[0107] Simultaneously, the adjusted state features from the training samples are input into the target network model to obtain the action value corresponding to each candidate resource allocation action in the preset candidate resource allocation action set under the adjusted state features. Since the adjusted state features characterize the subsequent operating state after the resource allocation action is executed, inputting them into the target network model yields the action value corresponding to each candidate resource allocation action in the subsequent state of the terminal device. This value reflects the expected effects of continuing to execute different resource allocation actions in the subsequent state. The target network model provides a relatively stable value estimation result in this step, which is used to participate in the calculation of the target action value. This achieves the prediction of the candidate resource allocation action value in the subsequent state, providing a subsequent value basis for determining the target action value.
[0108] After obtaining the current action value and the action value corresponding to each candidate resource allocation action under the adjusted state characteristics, the target action value is determined based on the reward value and the action value corresponding to each candidate resource allocation action. The target action value can be determined based on the combination relationship between the immediate reward and the action value in the subsequent state, and can be expressed as the sum of the reward value and the discounted maximum subsequent action value.
[0109] After the target action value is determined, a loss function is constructed based on the target action value and the current action value. The initial deep Q-network action value evaluation model is then trained by minimizing the loss function to obtain a pre-trained deep Q-network action value evaluation model. The loss function reflects the degree of difference between the target action value and the current action value. The smaller the difference, the closer the initial deep Q-network action value evaluation model's estimate of the resource allocation action value in the current state is to the training objective.
[0110] Specifically, the loss function uses the following formula:
[0111]
[0112] The loss function L(θ) is used to measure the estimation error of the deep Q-network action value evaluation model during training, and θ represents the model parameters of the action value evaluation model. [ ] represents the expectation of the sample distribution in the sample library, which can be approximated by the average of the randomly sampled training samples; S t A represents the state at time t, i.e., the current state characteristic; t R represents the resource allocation action selected and executed at time t, i.e., the target resource allocation action (or the resource allocation action recorded in the training samples); t S represents the reward value corresponding to the execution of the resource allocation action; t+1 γ represents the subsequent state after the resource allocation action is performed, i.e., the adjusted state characteristics; γ represents the discount factor, used to describe the trade-off between immediate rewards and future benefits; A' represents the state after the resource allocation action is performed. t+1 The following are optional candidate resource allocation actions; Q(S) t A t ;θ) represents the output of the current action value assessment model in state S. t Next, execute action A t Action value; Q (S t+1 (A'; θ') represents the output of the target network model in subsequent state S. t+1The action value of executing candidate action A' is calculated, where θ' represents the model parameters of the target network model. This formula enables the action value evaluation model to gradually reduce temporal difference errors under different operating states covered by the training samples, improving the accuracy of the action value evaluation results for candidate resource allocation and providing a more reliable value evaluation basis for subsequent determination of target resource allocation actions. Furthermore, the model parameters of the trained action value evaluation model are copied to the target network model to update it.
[0113] In one embodiment, determining the target resource allocation action based on the value assessment results corresponding to each candidate resource allocation action includes:
[0114] S231, Based on the preset exploration probability, randomly select a candidate resource allocation action from the preset candidate resource allocation action set, and determine the randomly selected candidate resource allocation action as the target resource allocation action.
[0115] S232, compare the value assessment results corresponding to each candidate resource allocation action, and determine the candidate resource allocation action with the highest value assessment result as the target resource allocation action.
[0116] In this embodiment, the action selection method is used to balance the utilization of existing evaluation results and the exploration of the effects of other candidate resource allocation actions during the resource allocation decision-making stage. By establishing a selection mechanism between random selection and determination based on value evaluation results, the terminal device, in scenarios where model inference tasks and concurrent business tasks are running simultaneously, can not only perform resource allocation adjustments based on the current value evaluation results, but also try other candidate resource allocation actions with a certain probability, continuously providing richer resource allocation experience for the action value evaluation model.
[0117] The preset exploration probability is used to characterize the probability of using a random selection method in the current resource allocation decision. This action selection method corresponds to the ephemeral (Epsilon-Greedy) strategy, where an action is randomly selected with probability ε, and the action with the highest value is selected with probability 1-ε. In this embodiment, ε can be represented by the preset exploration probability. When the selection condition corresponding to the preset exploration probability is met, a candidate resource allocation action is randomly selected from the preset candidate resource allocation action set, and this randomly selected candidate resource allocation action is determined as the target resource allocation action. This random selection method allows the terminal device, under the current state characteristics, not to be limited to always selecting the candidate resource allocation action with the highest current value assessment result, but to try resource allocation of other candidate resource allocation actions with a certain probability. Through this method, the execution results of different candidate resource allocation actions under different states can be continuously obtained during the resource allocation application process, providing a sample basis for subsequent reward value acquisition, model updates, and action value assessment model optimization.
[0118] Specifically, the determination of the target resource allocation action adopts an Epsilon-Greedy strategy to generate the action selection strategy π(S). t ), where S t The state at time t represents the current state characteristic; π(St) represents the state at time t. t The strategy for selecting resource allocation actions is as follows. Specifically, when selecting an action, a resource allocation action is randomly selected as A from the set of candidate resource allocation actions with probability ε. t This allows for the exploration of different candidate actions; a choice is made with probability 1-ε to make Q(S) t The resource allocation action that takes the maximum value of A; θ) is taken as A. t That is, selection This approach leverages the value assessment results learned by the current action value assessment model. By doing so, resource allocation decisions tend to favor actions with better value assessment results in most cases, while also attempting other candidate resource allocation actions with a certain probability. This continuously accumulates execution feedback from different actions in dynamic load scenarios involving model inference tasks and concurrent business tasks, preventing action selection from becoming stuck in locally optimal choices and providing richer empirical samples to support the training and updating of the subsequent action value assessment model.
[0119] Furthermore, when random selection is not used, the value assessment results of each candidate resource allocation action are compared, and the candidate resource allocation action with the highest value assessment result is determined as the target resource allocation action. By comparing the value assessment results of each candidate resource allocation action, a candidate resource allocation action that better matches the characteristics of the current state can be selected from the preset set of candidate resource allocation actions as the target resource allocation action. By comparing the value assessment results and selecting the candidate resource allocation action with the highest value assessment result, the existing learning results of the action value assessment model are effectively utilized, which helps to improve the pertinence of the determination of the target resource allocation action.
[0120] The random selection branch and the maximum value selection branch together constitute the mechanism for determining the target resource allocation action. These two types of branches work together, enabling the terminal device to execute resource allocation based on current state characteristics and the output of the action value assessment model during the resource allocation decision-making process, while continuously acquiring new action execution feedback during the formation of the resource allocation strategy. For scenarios where model inference tasks and concurrent business tasks run simultaneously, and the task load changes over time, this action selection mechanism improves the flexibility of the target resource allocation action determination process, allowing resource allocation processing to consider both the immediate adaptation effect under the current state and the learning needs of the subsequent action value assessment model.
[0121] In one embodiment, resource allocation adjustments are made to the model inference task and each concurrent business task based on the target resource allocation action, including:
[0122] S231, based on the target resource allocation action, adjust the operating frequency of at least one heterogeneous computing resource among the processor, graphics processor and neural network processor in the terminal device by increasing or decreasing it.
[0123] S232, Based on the target resource allocation action, adjust the memory resources allocated to at least one of the model inference tasks or concurrent business tasks by increasing or decreasing them.
[0124] S233, Based on the target resource allocation action, adjust the network bandwidth allocated to at least one of the model inference tasks or concurrent business tasks by increasing or decreasing it.
[0125] S234, based on the target resource allocation action, adjust the execution priority of at least one task among the model inference task or concurrent business tasks by raising or lowering it.
[0126] In this embodiment, heterogeneous computing resources refer to computing units within a terminal device that undertake different types of computing tasks. Since the computational resource requirements of model inference tasks and concurrent business tasks differ during execution, and different computing units have varying degrees of adaptability to different task types, after determining the target resource allocation action, the operating frequency of at least one heterogeneous computing resource among the processor, graphics processor, and neural network processor can be increased or decreased based on the target resource allocation action to change the computational power supply level of the corresponding computing unit. For scenarios with high model inference task loads requiring increased inference throughput or reduced processing latency, the operating frequency of the corresponding heterogeneous computing resources can be increased to provide higher computational power for the model inference task. For scenarios with high concurrent business task loads, or where the terminal device's current load is approaching saturation and computational power constraints on some tasks are required, the operating frequency of the corresponding heterogeneous computing resources can be decreased to control the computational resource usage of related tasks.
[0127] Memory resources are used to support data caching, intermediate result storage, and runtime data exchange during task execution. The memory resource requirements of different tasks vary depending on the task type, task status, and current processing load. After the target resource allocation action is determined, the memory resources occupied by the model inference task or at least one of the concurrent business tasks can be reallocated according to the target resource allocation action. This allows memory resources to be tilted towards tasks that currently require more resource support, or some memory resources to be reclaimed from tasks with relatively lower memory requirements. For model inference tasks, when the demand for memory resources increases due to model parameter loading, intermediate feature storage, or batch data processing, the allocated memory resources can be increased to improve the data carrying capacity of the model inference task. For concurrent business tasks, when a task is active and needs to maintain continuous data processing or interface responsiveness, its memory resources can also be increased accordingly. Conversely, when some tasks enter standby, paused, or when memory demand decreases, their allocated memory resources can be reduced according to the target resource allocation action to free up memory for other tasks.
[0128] Network bandwidth is used to support the transmission capacity of terminal devices during data reception, data transmission, remote interaction, and business communication. Different tasks at different stages of operation have varying degrees of dependence on network bandwidth. The current network status of a terminal device includes its upload and download speeds. For model inference tasks that rely on external data input, remote requests, result uploads, or streaming data interaction, the allocated network bandwidth can be increased when higher data transmission capabilities are required. For concurrent business tasks such as communication tasks, network request tasks, or real-time data processing tasks, network bandwidth allocation can be adjusted according to their current task status and processing needs. Conversely, when the network transmission demand of certain tasks decreases, their allocated network bandwidth can be reduced, freeing up transmission capacity for other tasks.
[0129] Execution priority is used to characterize the order of tasks and the priority of resource acquisition during the scheduling process of terminal devices. Different tasks have different execution priority requirements under different operating states. The task running states of model inference tasks and various concurrent business tasks can include active state, standby state, and paused state. Therefore, the execution priority of at least one task among the model inference tasks or concurrent business tasks can be adjusted by raising or lowering it based on the target resource allocation action and the current running state of the task. When a model inference task is in a critical inference stage and needs to obtain more timely processing support, the execution priority of the model inference task can be increased so that it can obtain resources first during the scheduling process. When a concurrent business task corresponds to a real-time sensitive business such as communication, display, media processing, or interactive response, the execution priority of the concurrent business task can also be increased to ensure its timely response. Conversely, for tasks with low current processing urgency or whose resource requirements can be appropriately postponed, their execution priority can be lowered to reduce their impact on the allocation of resources for critical tasks.
[0130] Furthermore, adjustments to the operating frequency of heterogeneous computing resources, memory resources, network bandwidth, and task execution priorities can be implemented individually or in combination. The target resource allocation action can act on only one type of resource or simultaneously on multiple types, forming a composite resource allocation adjustment scheme tailored to the current operating state. The computing resources, storage resources, communication resources, and scheduling priorities within the terminal device can change synergistically around the target resource allocation action, rather than being adjusted independently. This achieves overall coordination of resource relationships between model inference tasks and various concurrent business tasks, which is beneficial for improving the resource management capabilities of terminal devices in complex multi-tasking scenarios.
[0131] In summary, by implementing target resource allocation actions through specific adjustments to the operating frequency, memory resources, network bandwidth, and task execution priorities of heterogeneous computing resources, model inference tasks and concurrent business tasks can obtain more suitable resource support based on their current load relationships while sharing limited resources. This achieves the transformation of resource allocation decisions into concrete resource control behaviors, improves the adaptability of resource allocation adjustments to the real-time operating status of terminal devices, and helps to balance model inference performance, concurrent business processing efficiency, system energy consumption control, and overall user experience.
[0132] In one embodiment, after adjusting the resource allocation for the model inference task and each concurrent business task according to the target resource allocation action, the method further includes:
[0133] S241, Obtain the running status information and running feedback information after the resource allocation action is executed;
[0134] S242, Based on the operation feedback information, determine the reward value corresponding to the target resource allocation action, and obtain the adjusted state characteristics based on the operation status information after the resource allocation action is executed;
[0135] S243, construct model update training samples based on current state characteristics, target resource allocation actions, reward values, and adjusted state characteristics, and update the action value evaluation model based on the model update training samples;
[0136] S244, copy the updated action value assessment model parameters to the target network model to update the target network model.
[0137] In this embodiment, the operational status information after the resource allocation action is executed can continue to reflect the device resource status information, as well as the task operational status information of the model inference task and concurrent business tasks. For example, it can reflect changes in heterogeneous computing resource usage, memory usage, network bandwidth usage, battery status, temperature, and task operational status. Operational feedback information can characterize the task completion status, system energy consumption changes, model inference performance changes, and user-side response status after the resource allocation action is executed. By acquiring operational status information and operational feedback information after the target resource allocation action is executed, a direct observation of the execution result of this resource allocation action can be formed, allowing subsequent model updates to be based on the actual execution results. This achieves closed-loop acquisition of the resource allocation action execution effect, providing a feedback basis for reward value determination and action value assessment model updates.
[0138] After obtaining operational feedback information, the reward value corresponding to the target resource allocation action is determined, and adjusted state features are obtained based on the operational status information after the resource allocation action is executed. The reward value is used to quantitatively characterize the execution effect of the target resource allocation action in the current resource allocation cycle and is an important feedback quantity for the action value evaluation model to conduct subsequent learning. For example, the reward value can be determined by comprehensively considering task completion, system energy consumption, model inference performance, and user experience. Higher rewards can be assigned when high-priority tasks receive timely responses, when system energy consumption is reduced, when model inference speed and accuracy are improved, and when user operation smoothness and response time are guaranteed. The above feedback quantities can first undergo unified quantification and normalization processing, and then form corresponding reward values according to a preset method. Simultaneously, the operational status information after the resource allocation action is executed can be processed using a feature extraction method consistent with the current state features to generate adjusted state features. The adjusted state features are used to characterize the terminal device state after the target resource allocation action is executed, enabling the state changes before and after the action to be expressed in a structured manner. It achieves a unified representation of the execution effect and subsequent state of actions, enabling the results of resource allocation actions to be further utilized by the action value assessment model.
[0139] After obtaining the reward value and adjusted state features, model update training samples are constructed, and the action value assessment model is updated based on these samples. The model update training samples essentially characterize the state transition relationships in a complete resource allocation process, corresponding to a combination of state, action, reward, and next state. Specifically, the current state features characterize the terminal device state before the target resource allocation action is executed; the target resource allocation action characterizes the actual resource allocation decision executed in this resource allocation cycle; the reward value characterizes the execution effect brought about by this resource allocation decision; and the adjusted state features characterize the terminal device state after the resource allocation action is executed. By combining the above information as model update training samples, the feedback results and state change relationships after the target resource allocation action is executed in the current state can be completely recorded. Furthermore, the model update training samples can be stored in a sample library, and update samples can be randomly sampled from the sample library to break the temporal correlation between consecutive resource allocation cycles and improve the stability of the action value assessment model update process. This realizes the transformation of resource allocation action execution experience into learnable samples, providing a data foundation for the continuous optimization of the action value assessment model.
[0140] When updating the action value evaluation model based on the model update training samples, the current action value and the target action value can be calculated based on the current state features and the adjusted state features, respectively. Specifically, the current state features from the model update training samples can be input into the action value evaluation model to obtain the action value corresponding to each candidate resource allocation action under the current state features, and the current action value corresponding to the actual execution action can be determined based on the target resource allocation action. Simultaneously, the adjusted state features from the model update training samples can be input into the target network model to obtain the action value corresponding to each candidate resource allocation action in the preset candidate resource allocation action set under the adjusted state features, and the target action value can be determined based on the reward value and the action value. The target action value can be jointly determined by the immediate reward and the superior action value among the candidate resource allocation actions in subsequent states. For example, the action value in subsequent states can be weighted using a discount factor γ to take into account both the current reward and subsequent benefits.
[0141] Furthermore, a loss function can be constructed based on the target action value and the current action value. By minimizing the loss function, the model parameters of the action value evaluation model can be adjusted, allowing the action value output by the model to gradually approach the target action value determined by actual execution feedback. This enables the action value evaluation model to continuously learn from the actual resource allocation execution results, which is beneficial for improving the model's adaptability to dynamic task loads and complex operating states.
[0142] The updated action value assessment model parameters are copied to the target network model to update the target network model. The target network model participates in the calculation of the target action value during model updates. Its parameter update frequency is typically lower than that of the action value assessment model, providing a relatively stable reference value output during training. By copying the updated action value assessment model parameters to the target network model, the target network model can gradually follow the learning results of the action value assessment model while maintaining relative stability, avoiding excessively rapid fluctuations in the target action value during continuous updates. The purpose of the target network update process is to improve training stability and reduce the risk of oscillations in the action value assessment model during continuous updates. This achieves parameter synchronization between the target network model and the action value assessment model, providing a stable target reference for model updates and action value calculations in subsequent resource allocation cycles.
[0143] In summary, after each target resource allocation action is executed, the terminal device can generate a reward value and adjusted state features based on the post-execution operational status information and feedback information. This information is then used to construct model update training samples, update the action value evaluation model, and synchronously update the target network model. In this way, the action value evaluation model can not only perform resource allocation action selection based on pre-training, but also continuously absorb operational feedback from model inference tasks and concurrent business tasks during actual application, enabling the resource allocation strategy to continuously optimize as the actual operational status of the terminal device changes. This continuous optimization of the resource allocation strategy is beneficial for improving the real-time performance, adaptability, and long-term optimization capabilities of resource allocation adjustments in dynamic multi-task scenarios.
[0144] Figure 4 This is a schematic diagram of the structure of the resource allocation device for the terminal device provided in the embodiments of this application, as shown below. Figure 4 As shown, the resource allocation device 40 for the terminal device provided in this embodiment includes:
[0145] The data acquisition module 401 is used to acquire the operating status information of the terminal device and acquire the current status characteristics based on the operating status information; wherein, the operating status information includes device resource status information, task operating status information of the model inference task, and task operating status information of at least one concurrent business task;
[0146] The value assessment module 402 is used to obtain the value assessment results of each candidate resource allocation action in the preset candidate resource allocation action set based on the current state characteristics and through a pre-trained action value assessment model.
[0147] The resource adjustment module 403 is used to determine the target resource allocation action based on the value assessment results corresponding to each candidate resource allocation action, and to adjust the allocation of resources for the model inference task and each concurrent business task based on the target resource allocation action.
[0148] The resource allocation device 40 of the terminal device provided in this embodiment can execute the method provided in the above method embodiment. Its implementation principle and technical effect are similar, and will not be described in detail here.
[0149] Figure 5 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Figure 5 As shown, the electronic device 50 provided in this embodiment includes at least one processor 501 and a memory 502. Optionally, the electronic device 50 further includes a communication component 503. The processor 501, memory 502, and communication component 503 are connected via a bus 504.
[0150] In a specific implementation, at least one processor 501 executes computer execution instructions stored in memory 502, causing at least one processor 501 to perform the above-described method.
[0151] The specific implementation process of processor 501 can be found in the above method embodiments, and its implementation principle and technical effect are similar. It will not be repeated here.
[0152] In the above embodiments, it should be understood that the processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), etc. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the method disclosed in this invention can be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules within the processor.
[0153] The memory may include random access memory (RAM) and may also include non-volatile memory (NVM), such as at least one disk storage device.
[0154] The bus can be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of illustration, the buses shown in the accompanying drawings are not limited to a single bus or a single type of bus.
[0155] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the above-described method.
[0156] This application provides a chip, which includes at least one processor. The processor is used to run program instructions to execute the model inference method involved in the above method embodiments.
[0157] This application provides a chip module on which a computer program is stored. When the computer program is executed by the chip module, it implements the model reasoning method involved in the above method embodiments.
[0158] This application also provides a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, implement the above-described method.
[0159] The aforementioned readable storage medium can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk. The readable storage medium can be any available medium accessible to a general-purpose or special-purpose computer.
[0160] Those skilled in the art will understand that all or part of the steps of the above-described method embodiments can be implemented by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When executed, the program performs the steps of the above-described method embodiments; and the aforementioned storage medium includes various media capable of storing program code, such as ROM, RAM, magnetic disks, or optical disks.
[0161] Finally, it should be noted that other embodiments of the invention will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention that follow the general principles of the invention and include common knowledge or customary techniques in the art not disclosed herein, and is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of the invention is limited only by the appended claims.
Claims
1. A resource allocation method for a terminal device, characterized in that, include: Obtain the operating status information of the terminal device, and obtain the current status characteristics based on the operating status information; wherein, the operating status information includes device resource status information, task operating status information of the model inference task, and task operating status information of at least one concurrent business task; Based on the current state characteristics, the value evaluation results corresponding to each candidate resource allocation action in the preset candidate resource allocation action set are obtained through a pre-trained action value evaluation model. Based on the value assessment results corresponding to each of the candidate resource allocation actions, a target resource allocation action is determined, and the resources of the model inference task and each of the concurrent business tasks are allocated and adjusted according to the target resource allocation action.
2. The method according to claim 1, characterized in that, The step of obtaining the operating status information of the terminal device and obtaining the current status characteristics based on the operating status information includes: By using pre-set sensors and system logs, the device resource status information of the terminal device, the task running status information of the model inference task in the terminal device, and the task running status information of at least one concurrent business task in the terminal device are obtained. Based on the device resource status information, the task execution status information of the model inference task, and the task execution status information of the concurrent business task, the current status features of the terminal device are obtained through feature extraction processing.
3. The method according to claim 1, characterized in that, The action value assessment model is a deep Q-network action value assessment model. Accordingly, based on the current state characteristics, the value evaluation results corresponding to each candidate resource allocation action in the preset candidate resource allocation action set are obtained through a pre-trained action value evaluation model, including: Obtain a preset set of candidate resource allocation actions, and obtain a pre-trained deep Q-network action value evaluation model; The current state features are input into the action value evaluation model, so that the action value evaluation model performs approximate calculations on the current state features and each candidate resource allocation action in the preset candidate resource allocation action set according to the state action value function, and outputs the value evaluation results corresponding to each candidate resource allocation action under the current state features.
4. The method according to claim 3, characterized in that, Training a deep Q-network action value evaluation model includes: Construct an initial deep Q-network action value evaluation model and a target network model corresponding to the initial deep Q-network action value evaluation model; Acquire the current status characteristics of the terminal device, the resource allocation action, the reward value corresponding to the execution of the resource allocation action, and the running status information after the execution of the resource allocation action in multiple resource allocation cycles; Based on the running status information after the resource allocation action is executed, the adjusted status features are generated; The current state features, the resource allocation action, the reward value, and the adjusted state features are combined as an experience sample and stored in the sample library; Training samples are randomly sampled from the sample library, and the current state features in the training samples are input into the initial deep Q network action value evaluation model to obtain the action value of each candidate resource allocation action under the current state features, and the current action value is determined based on the resource allocation action. The adjusted state features from the training samples are input into the target network model to obtain the action value of each candidate resource allocation action in the preset candidate resource allocation action set under the adjusted state features. The target action value is determined based on the reward value and the action value corresponding to each candidate resource allocation action; A loss function is constructed based on the target action value and the current action value, and the initial deep Q-network action value evaluation model is trained by minimizing the loss function to obtain the pre-trained deep Q-network action value evaluation model.
5. The method according to claim 1, characterized in that, The step of determining the target resource allocation action based on the value assessment results corresponding to each of the candidate resource allocation actions includes: Based on a preset exploration probability, a candidate resource allocation action is randomly selected from the preset candidate resource allocation action set, and the randomly selected candidate resource allocation action is determined as the target resource allocation action. Alternatively, the value assessment results corresponding to each of the candidate resource allocation actions can be compared, and the candidate resource allocation action with the highest value assessment result can be determined as the target resource allocation action.
6. The method according to claim 1, characterized in that, The step of allocating and adjusting resources for the model inference task and each of the concurrent business tasks according to the target resource allocation action includes: Based on the target resource allocation action, the operating frequency of at least one heterogeneous computing resource among the processor, graphics processor and neural network processor in the terminal device is increased or decreased. And / or adjust the memory resources allocated to at least one of the model inference tasks or concurrent business tasks according to the target resource allocation action, either by increasing or decreasing them; And / or adjust the network bandwidth allocated to at least one of the model inference tasks or concurrent service tasks according to the target resource allocation action, either by increasing or decreasing the bandwidth. And / or, based on the target resource allocation action, adjust the execution priority of at least one of the model inference tasks or concurrent business tasks by raising or lowering it.
7. The method according to claim 1, characterized in that, After adjusting the resource allocation for the model inference task and each of the concurrent business tasks according to the target resource allocation action, the method further includes: Obtain runtime status information and runtime feedback information after the resource allocation action is executed; Based on the operational feedback information, determine the reward value corresponding to the target resource allocation action, and obtain the adjusted state characteristics based on the operational state information after the resource allocation action is executed; Based on the current state characteristics, the target resource allocation action, the reward value, and the adjusted state characteristics, a model update training sample is constructed, and the action value evaluation model is updated based on the model update training sample. The updated model parameters of the action value evaluation model are copied to the target network model to update the target network model.
8. A resource allocation device for a terminal device, characterized in that, include: The data acquisition module is used to acquire the operating status information of the terminal device and acquire the current status characteristics based on the operating status information; wherein, the operating status information includes device resource status information, task operating status information of the model inference task, and task operating status information of at least one concurrent business task; The value assessment module is used to obtain the value assessment results of each candidate resource allocation action in the preset candidate resource allocation action set based on the current state characteristics and through a pre-trained action value assessment model. The resource adjustment module is used to determine the target resource allocation action based on the value assessment results corresponding to each of the candidate resource allocation actions, and to adjust the resource allocation of the model inference task and each of the concurrent business tasks according to the target resource allocation action.
9. An electronic device, characterized in that, include: Memory, processor; The memory stores computer-executed instructions; The processor executes computer execution instructions stored in the memory, causing the processor to perform the method as described in any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, are used to implement the method as described in any one of claims 1 to 7.