Task processing method, electronic device and storage medium
By constructing a two-layer mapping relationship between thread blocks and task indices, and between task indices and sub-model indices, the target thread block and sub-model are dynamically determined, solving the problems of low efficiency and poor flexibility in heterogeneous multi-task processing, and achieving efficient parallel computing.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- ALIBABA (CHINA) CO LTD
- Filing Date
- 2025-12-09
- Publication Date
- 2026-07-02
AI Technical Summary
Existing heterogeneous multi-task batch processing frameworks lack versatility and universality, resulting in low processing efficiency and poor flexibility.
By constructing a first mapping relationship between the thread block index and the task index, and a second mapping relationship between the task index and the sub-model index of the hybrid inference model, the target sub-model corresponding to the target thread block is dynamically determined, and the target sub-model is used to analyze and calculate multiple tasks to be processed.
It improves processor utilization efficiency and flexibility in heterogeneous multitasking, and provides a flexible, efficient and general parallel computing framework that can handle tasks of different types, scales or computing needs at the same time.
Smart Images

Figure CN2025141299_02072026_PF_FP_ABST
Abstract
Description
Task processing methods, electronic devices and storage media Technical Field
[0001] This disclosure relates to large model technology and computer technology, and more specifically, to a task processing method, electronic device and storage medium. Background Technology
[0002] With the increasing prevalence of deep learning models, more and more heterogeneous multi-task batch processing frameworks for Graphics Processing Units (GPUs) have been proposed. Mixture-of-Experts (MoE) models, as a model design that can dynamically increase model parameters while maintaining a relatively constant computational load, can significantly improve the quality or performance of model output without significantly increasing the computational burden. MoE model inference can essentially be viewed as a problem of batching heterogeneous tasks. The heterogeneity lies in the fact that each expert has a different computational task. When processing MoE model inference, due to the heterogeneity and dynamic load characteristics of MoE models, related heterogeneous multi-task batch processing frameworks can usually only be optimized for specific hardware architectures and task types, lacking generality and universality. Therefore, how to efficiently execute heterogeneous multi-task batch processing on computer processors has become an urgent problem to be solved.
[0003] There is currently no effective solution to the above problems. Summary of the Invention
[0004] This disclosure provides a task processing method, an electronic device, and a storage medium to at least solve the technical problems of low processing efficiency and poor flexibility in the related art when processing heterogeneous multitasking.
[0005] According to one aspect of the present disclosure, a task processing method is provided, comprising: constructing a first mapping relationship and a second mapping relationship associated with multiple tasks to be processed, wherein each task to be processed includes at least one word vector, the first mapping relationship is used to represent the mapping relationship between a thread block index and a task index of the task to be processed, and the second mapping relationship is used to represent the mapping relationship between the task index of the task to be processed and a sub-model index of a hybrid inference model; determining a target sub-model corresponding to a target thread block in the hybrid inference model based on the first mapping relationship and the second mapping relationship, wherein the target thread block is used to execute multiple tasks to be processed; and analyzing and calculating the multiple tasks to be processed using the target sub-model to obtain a task processing result.
[0006] According to another aspect of the embodiments of this disclosure, a task processing method is also provided, comprising: constructing a first mapping relationship and a second mapping relationship associated with multiple tasks to be computed, wherein the first mapping relationship is used to represent the mapping relationship between a thread block index and a task index of a task to be computed, and the second mapping relationship is used to represent the mapping relationship between a task index of a task to be computed and a sub-model index of a heterogeneous multi-task computing model; determining a target sub-model corresponding to a target thread block in the heterogeneous multi-task computing model based on the first mapping relationship and the second mapping relationship, wherein the target thread block is used to execute multiple tasks to be computed; and analyzing and computing the multiple tasks to be computed using the target sub-model to obtain task processing results.
[0007] According to another aspect of the embodiments of this disclosure, a task processing method is also provided, comprising: obtaining a data processing request through a first application programming interface, wherein the request data carried in the data processing request includes: multiple tasks to be processed, each task to be processed including at least one lexical vector; and returning a data processing response through a second application programming interface, wherein the response data carried in the data processing response includes: a task processing result, wherein the task processing result is obtained by analyzing and calculating multiple tasks to be processed using a target sub-model corresponding to a target thread block, the target sub-model is determined in a hybrid inference model based on a first mapping relationship and a second mapping relationship, the target thread block is used to execute multiple tasks to be processed, the first mapping relationship is used to represent the mapping relationship between the thread block index and the task index of the task to be processed, and the second mapping relationship is used to represent the mapping relationship between the task index of the task to be processed and the sub-model index of the hybrid inference model.
[0008] According to another aspect of the embodiments of this disclosure, a task processing method is also provided, comprising: acquiring a currently input data processing dialogue request, wherein the request data carried in the data processing dialogue request includes: multiple tasks to be processed, each task to be processed including at least one word vector; responding to the data processing dialogue request, returning a data processing dialogue response, wherein the information carried in the data processing dialogue response includes: a task processing result, wherein the task processing result is obtained by analyzing and calculating multiple tasks to be processed using a target sub-model corresponding to a target thread block, the target sub-model is determined in a hybrid inference model based on a first mapping relationship and a second mapping relationship, the target thread block is used to execute multiple tasks to be processed, the first mapping relationship is used to represent the mapping relationship between the thread block index and the task index of the task to be processed, the second mapping relationship is used to represent the mapping relationship between the task index of the task to be processed and the sub-model index of the hybrid inference model; and displaying the task processing result in a graphical user interface.
[0009] According to another aspect of the embodiments of the present disclosure, a computer-readable storage medium is also provided, the computer-readable storage medium including a stored executable program, wherein, when the executable program is executed, it controls the device where the computer-readable storage medium is located to perform the methods of the various embodiments of the present disclosure.
[0010] According to another aspect of the embodiments of this disclosure, a computer program product is also provided, including a computer program that, when executed by a processor, implements the methods of various embodiments of this disclosure.
[0011] According to another aspect of the embodiments of this disclosure, a computer program product is also provided, including a non-volatile computer-readable storage medium storing a computer program that, when executed by a processor, implements the methods of various embodiments of this disclosure.
[0012] According to another aspect of the embodiments of this disclosure, a computer program is also provided, which, when executed by a processor, implements the methods of the various embodiments of this disclosure.
[0013] In this embodiment of the disclosure, a first mapping relationship and a second mapping relationship are constructed to associate multiple tasks to be processed. Each task to be processed includes at least one word vector. The first mapping relationship is used to represent the mapping relationship between the thread block index and the task index of the task to be processed. The second mapping relationship is used to represent the mapping relationship between the task index of the task to be processed and the sub-model index of the hybrid inference model. Then, based on the first mapping relationship and the second mapping relationship, the target sub-model corresponding to the target thread block is determined in the hybrid inference model. Finally, the target sub-model is used to analyze and calculate multiple tasks to be processed to obtain the task processing result. Thus, the target sub-model corresponding to each thread block is dynamically determined through a two-layer mapping mechanism, and the target sub-model is used to batch process multiple tasks to be processed. This can effectively balance the computing load during thread operation and improve the utilization efficiency of the processor. The task processing method in this embodiment of the disclosure provides a flexible, efficient and general parallel computing framework by constructing and utilizing a two-layer mapping mechanism. This allows for the simultaneous processing of multiple tasks of different types, scales or computing requirements in parallel or distributed computing environments, thereby achieving the goal of batch processing heterogeneous multitasking. This improves the processing efficiency and flexibility of heterogeneous multitasking, and solves the technical problems of low processing efficiency and poor flexibility in related technologies when processing heterogeneous multitasking.
[0014] It is worth noting that the above general description and the following detailed description are merely for illustrative and explanatory purposes and do not constitute a limitation thereof. Attached Figure Description
[0015] The accompanying drawings, which are included to provide a further understanding of this disclosure and form part of this disclosure, illustrate exemplary embodiments of the present disclosure and are used to explain the disclosure, but do not constitute an undue limitation of the disclosure. In the drawings:
[0016] Figure 1 is a schematic diagram of an application scenario of a task processing method according to an embodiment of the present disclosure;
[0017] Figure 2 is a flowchart of a task processing method according to an embodiment of the present disclosure;
[0018] Figure 3 is a flowchart of another task processing method according to an embodiment of the present disclosure;
[0019] Figure 4 is a flowchart of another task processing method according to an embodiment of the present disclosure;
[0020] Figure 5 is a flowchart of another task processing method according to an embodiment of the present disclosure;
[0021] Figure 6 is a structural block diagram of a task processing device according to an embodiment of the present disclosure;
[0022] Figure 7 is a structural block diagram of another task processing device according to an embodiment of the present disclosure;
[0023] Figure 8 is a structural block diagram of another task processing device according to an embodiment of the present disclosure;
[0024] Figure 9 is a structural block diagram of another task processing device according to an embodiment of the present disclosure;
[0025] Figure 10 is a structural block diagram of a computing device according to an embodiment of the present disclosure;
[0026] Figure 11 is a structural block diagram of an electronic device according to an embodiment of the present disclosure. Detailed Implementation
[0027] To enable those skilled in the art to better understand the present disclosure, the technical solutions of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments. Obviously, the described embodiments are only some embodiments of the present disclosure, and not all embodiments. Based on the embodiments of the present disclosure, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present disclosure.
[0028] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this disclosure are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this disclosure described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0029] The technical solution disclosed herein is primarily implemented using large-scale model technology. Here, "large-scale model" refers to a deep learning model with a massive number of parameters, typically containing hundreds of millions, tens of billions, hundreds of billions, trillions, or even tens of trillions of parameters. Large-scale models, also known as foundation models, are pre-trained using large-scale unlabeled corpora to produce pre-trained models with hundreds of millions of parameters. These models are adaptable to a wide range of downstream tasks and exhibit good generalization ability. Examples include Large Language Models (LLMs) and multi-modal pre-training models.
[0030] It should be noted that, in practical applications, large models can be fine-tuned using a small number of samples to adapt them to different tasks. For example, large models can be widely used in Natural Language Processing (NLP), computer vision, and speech processing. Specifically, they can be applied to computer vision tasks such as Visual Question Answering (VQA), Image Captioning (IC), and Image Generation, as well as NLP tasks such as text-based sentiment classification, text summarization, and machine translation. Therefore, the main application scenarios for large models include, but are not limited to, digital assistants, intelligent robots, search, online education, office software, e-commerce, and intelligent design. In this embodiment, data processing using a hybrid inference model in a task processing scenario is used as an example for explanation.
[0031] First, some nouns or terms that appear in the description of the embodiments of this disclosure shall be interpreted as follows:
[0032] Kernel functions are functions that execute on the GPU and can be executed concurrently by multiple threads. They are typically used to perform large-scale parallel tasks, such as matrix operations, image processing, and deep learning model inference.
[0033] Streaming Multiprocessors (SM): An SM contains a large number of processing cores, and a GPU contains a large number of SMs.
[0034] Thread: A thread executes on a single processing core.
[0035] Thread block: A group of threads executing on the same SM (Streaming Service) that can communicate within the thread block at low cost.
[0036] Warp: A group of threads that are scheduled simultaneously. It is the smallest unit of SM thread scheduling and is typically 32 consecutive threads on a common GPU.
[0037] General-purpose matrix multiplication (GEMM) usually refers to the multiplication of two matrices.
[0038] Tensor Core: A unit on a GPU dedicated to calculating matrix multiplication.
[0039] Tile: A matrix segmentation for computation in a single thread block when computing GEMM.
[0040] Batching: Combining multiple tasks into a batch for simultaneous execution.
[0041] A token is the smallest unit of input in a language model.
[0042] Reasoning: Given the model structure and parameters, input a sequence of tokens, and calculate and output the next token.
[0043] According to embodiments of this disclosure, a task processing method is provided. It should be noted that the steps shown in the flowcharts of the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions. Furthermore, although a logical order is shown in the flowcharts, in some cases, the steps shown or described may be executed in a different order than that shown here.
[0044] Considering the large number of model parameters in large models and the limited computing resources of mobile terminals, the method provided in this disclosure can be applied to the application scenario shown in Figure 1, but is not limited thereto. In the application scenario shown in Figure 1, the large model is deployed on server 10. Server 10 can connect to one or more client devices 20 via a local area network (LAN), wide area network (WAN), Internet, or other types of data networks. These client devices 20 may include, but are not limited to, smartphones, tablets, laptops, PDAs, personal computers, smart home devices, and in-vehicle devices. Client devices 20 can interact with users through a graphical user interface to invoke the large model, thereby implementing the method provided in this disclosure.
[0045] In this embodiment, the system comprising a client device and a server can perform the following steps: the server obtains a currently input data processing dialogue request from the client device, wherein the request data carried in the data processing dialogue request includes: multiple tasks to be processed, each task including at least one token vector; in response to the data processing dialogue request, the server returns a data processing dialogue response to the client device. The information carried in the data processing dialogue response includes: a task processing result, which is obtained by analyzing and calculating multiple tasks to be processed using a target sub-model corresponding to the target thread block; the target sub-model is determined in the hybrid inference model based on a first mapping relationship and a second mapping relationship; the target thread block is used to execute multiple tasks to be processed; the first mapping relationship represents the mapping relationship between the thread block index and the task index of the task to be processed; and the second mapping relationship represents the mapping relationship between the task index of the task to be processed and the sub-model index of the hybrid inference model. After receiving the task processing result, the client device displays the task processing result in a graphical user interface.
[0046] It should be noted that with the rapid development of high-performance computing units, the methods provided in this disclosure can also be applied to integrated model machines in other application scenarios. In one optional embodiment, the integrated model machine has multiple built-in models. Users can select one model to adjust as needed to obtain their own model. The high-performance computing unit built into the integrated model machine can then directly call the adjusted model to execute the methods provided in this disclosure. In another optional embodiment, the large integrated model machine has a pre-trained model built-in. Therefore, the high-performance computing unit built into the integrated model machine can directly call this model to execute the methods provided in this disclosure.
[0047] Furthermore, when users need to train their own models, they can upload their own datasets via the client. These datasets are then sent to the server, allowing the server to adjust the pre-trained model using the dataset to obtain the user's customized model, which can then be deployed to the production environment. To facilitate users' model adjustment needs, the server provides complete adjustment tools, development frameworks, and processes, supporting multiple adjustment strategies. This allows the adjusted model to better adapt to different application domains and achieve a high degree of customization.
[0048] In the above operating environment, this disclosure provides a task processing method as shown in Figure 2. Figure 2 is a flowchart of a task processing method according to an embodiment of this disclosure. As shown in Figure 2, the method may include the following steps:
[0049] Step S21: Construct a first mapping relationship and a second mapping relationship associated with multiple tasks to be processed. The tasks to be processed include at least one word vector. The first mapping relationship is used to represent the mapping relationship between the thread block index and the task index of the task to be processed. The second mapping relationship is used to represent the mapping relationship between the task index of the task to be processed and the sub-model index of the hybrid inference model.
[0050] Step S22: Based on the first mapping relationship and the second mapping relationship, determine the target sub-model corresponding to the target thread block in the hybrid inference model, wherein the target thread block is used to execute multiple tasks to be processed;
[0051] Step S23: Analyze and calculate multiple tasks to be processed using the target sub-model to obtain the task processing results.
[0052] The aforementioned tasks are computational tasks that need to be executed during the inference process of the hybrid inference model. These tasks are heterogeneous, and in parallel or distributed computing environments, multiple tasks of different types, scales, or computational requirements can be processed simultaneously. Each task requires different computational resources, execution time, data structures, or algorithms. Specifically, different types of tasks involve different mathematical operations, such as matrix multiplication, convolution, and activation functions. In tasks of different scales, the amount of data processed by each task varies, depending on the size of the input data, the complexity of the expert model, or the number of tokens allocated to the model. Different tasks have different computational resource requirements; for example, some tasks require more Tensor Core resources for intensive matrix operations, while others rely more on floating-point units.
[0053] It should be noted that efficient processing of heterogeneous multitasking is a common computing challenge. Different types of processors may achieve similar functions in their own unique ways. Specifically, it can be applied to, but is not limited to, GPUs, central processing units (CPUs), field-programmable gate arrays (FPGAs), and tensor processing units (TPUs). The embodiments disclosed herein are only an example and do not constitute specific limitations.
[0054] Hybrid inference models can be categorized as MoE models. MoE is a machine learning model architecture that introduces expert mechanisms into traditional deep learning models, aiming to address model efficiency and performance issues in large-scale data and complex task scenarios. The core idea of MoE models is to decompose a complex model into multiple relatively simple expert models. These expert models are sub-models of the hybrid inference model, with each expert focusing on processing a specific part or type of data. A controller is responsible for deciding which expert to assign the input data to, allowing the model to dynamically select the appropriate expert based on the characteristics of the input data. This improves overall model capability while reducing unnecessary computation and increasing computational efficiency.
[0055] In the MoE model, multiple experts constitute multiple sub-models or components, each with its own weights, parameters, and specific computational functions. For example, in language processing tasks, different experts may be responsible for different types of language structures or semantic understanding. When data is input into the MoE model, the controller dynamically allocates task data to different experts based on a strategy, such as the experts' abilities and the characteristics of the input data. Each expert independently processes the assigned task data, and then the controller merges or selects the outputs of each expert to generate the final model output. The MoE model can significantly improve the scalability and flexibility of the model, allowing it to handle larger-scale and more complex tasks while maintaining high accuracy. Especially in large language models, the MoE architecture overcomes the computational bottleneck of a single model when handling a large number of tasks through parallel computation among experts and dynamic task allocation.
[0056] The tasks to be processed mentioned above are non-empty tasks among multiple original tasks. The number of original tasks is equal to the number of experts in the MoE model. Each expert in the MoE model corresponds to a highly heterogeneous set of original tasks; therefore, MoE inference optimization can be viewed as a batching problem of heterogeneous tasks. If some experts in the MoE model are not assigned any tokens during computation, that expert corresponds to an empty task; if some experts in the MoE model are assigned at least one token during computation, that expert corresponds to a non-empty task. For example, the MoE model includes N > 0 experts, and N experts can execute N original tasks. Among the N original tasks, M are non-empty tasks, where M ≤ N, and each expert in a non-empty task can be assigned at least one token.
[0057] The aforementioned block index is a numerical value used to uniquely identify a parallel execution unit (i.e., a thread block) within the GPU parallel computing framework. On each GPU's streaming multiprocessor, multiple threads form a block to execute kernel functions, and the block index helps determine which thread block executes which part of the task to be processed.
[0058] The task index mentioned above is a numerical value used to uniquely identify each task. During the inference process of the MoE model, each task may involve different sets of word vectors and different expert models. The task index enables the system to distinguish and manage different computational requirements.
[0059] The expert index in the hybrid inference model described above is a numerical value used to identify the individual experts in the MoE model. A MoE model typically consists of multiple experts, each responsible for calculating a portion of the model. The expert index helps specify which expert should handle a particular task.
[0060] The aforementioned first mapping relationship is a correspondence between the thread block index and the task index of the task to be processed, i.e., block index -> task index. This first mapping relationship allows kernel functions to locate the specific task to be processed based on the current thread block index, thus achieving efficient matching between thread blocks and tasks. By pre-determining this first mapping relationship when the GPU kernel starts, the corresponding task index can be quickly calculated during thread block execution, further improving computational efficiency.
[0061] The second mapping relationship described above connects the task index of the task to be processed with the sub-model index of the hybrid inference model, i.e., task index -> expert index. Each sub-model index corresponds to a specific expert in the MoE model, and each expert is responsible for the computation of a specific part of the model. This second mapping relationship allows the system to further determine which expert should handle the task based on its index, thus achieving a direct association between the task to be processed and the model sub-parts, ensuring accurate allocation of computational resources.
[0062] Based on the first and second mapping relationships, the target sub-model corresponding to the target thread block is determined in the hybrid inference model. The target thread block is used to execute multiple tasks to be processed; that is, the target thread block can represent the thread block required to execute a single non-empty task. Combining the first and second mapping relationships, when the GPU kernel starts, the task to be processed by the thread block can be determined according to the thread block index, and the expert to process the task can be further determined according to the task index. This achieves efficient allocation and utilization of computing resources, reduces unnecessary computation and data handling, improves the peak computing power utilization of the GPU, and ultimately accelerates the inference computation process of the MoE model.
[0063] After identifying the target sub-model, the target thread block can perform specific analysis and calculations based on the parameters and structure of the target sub-model and the tasks to be processed. During the inference process of the MoE model, the thread block uses the parameters of a specific expert to calculate the assigned word vectors, generating the expert's output for the task. After the calculation is completed, the target thread block generates the corresponding task processing result, which may include, but is not limited to, the expert's prediction or classification of the input words, or, in more complex tasks, the expert's intermediate calculation results. Finally, by collecting and integrating the calculation results of all thread blocks, a complete inference output is formed. Through the above process, a large number of tasks to be processed can be effectively distributed in parallel to multiple thread blocks on the GPU, while ensuring that each thread block can accurately find and utilize the correct sub-model for calculation, thereby achieving efficient batch processing of MoE model inference and significantly improving the overall computing speed and resource utilization. In addition, the dynamic scheduling mechanism based on the two-layer mapping relationship enables the model to flexibly adapt to different scales and types of input data, thereby improving the performance and adaptability of the MoE architecture in practical applications.
[0064] Based on steps S21 to S23 above, a first mapping relationship and a second mapping relationship are constructed to associate multiple tasks to be processed. Each task to be processed includes at least one word vector. The first mapping relationship is used to represent the mapping relationship between the thread block index and the task index of the task to be processed. The second mapping relationship is used to represent the mapping relationship between the task index of the task to be processed and the sub-model index of the hybrid inference model. Then, based on the first mapping relationship and the second mapping relationship, the target sub-model corresponding to the target thread block is determined in the hybrid inference model. Finally, the target sub-model is used to analyze and calculate multiple tasks to be processed to obtain the task processing results. Thus, the target sub-model corresponding to each thread block is dynamically determined through a two-layer mapping mechanism. Then, the target sub-model is used to batch process multiple tasks to be processed, which can effectively balance the computing load during thread operation and improve the utilization efficiency of the processor. The task processing method in this embodiment of the disclosure provides a flexible, efficient and general parallel computing framework by constructing and utilizing a two-layer mapping mechanism. This allows for the simultaneous processing of multiple tasks of different types, scales or computing requirements in parallel or distributed computing environments, thereby achieving the goal of batch processing heterogeneous multitasking. This improves the processing efficiency and flexibility of heterogeneous multitasking, and solves the technical problems of low processing efficiency and poor flexibility in related technologies when processing heterogeneous multitasking.
[0065] The task processing method in the embodiments of this disclosure will be further described below.
[0066] In an optional embodiment, step S21, constructing the first mapping relationship includes:
[0067] The first mapping relationship is constructed using a multi-task batching framework, which is used to allocate corresponding target thread blocks to multiple tasks to be processed according to the target required number.
[0068] The aforementioned multi-task batching framework addresses the challenge of executing heterogeneous tasks on GPUs, handling tasks with varying shapes, sizes, or computational requirements. Using thread blocks as the basic scheduling and computation unit, the framework dynamically and efficiently allocates multiple tasks to these blocks on the GPU, avoiding the scheduling overhead and resource waste common in traditional parallel computing. The core of the multi-task batching framework lies in calculating the mapping from thread block indices to task indices, ensuring that the tasks executed by each thread block are matched to its computational capabilities and resources.
[0069] Specifically, in the multi-task batching framework, thread blocks are used as units. The mapping from thread blocks to tasks is calculated backward within the kernel with minimal overhead—that is, the initial mapping relationship is established, and the current thread block is determined as its position within the task. This allows the kernel to dynamically determine the tasks to be executed for each thread block. When the kernel starts, a number of thread blocks exactly covering the sum of the required number for each task are allocated. Thus, by assigning corresponding thread blocks to multiple tasks according to the target number of tasks, batch execution of multiple tasks can be achieved.
[0070] Based on the above optional embodiments, by using a multi-task batching framework to construct a first mapping relationship, the multi-task batching framework is used to allocate corresponding target thread blocks to multiple tasks to be processed according to the target demand quantity. This not only optimizes task scheduling on the GPU and achieves efficient matching of tasks to thread blocks, but also significantly reduces unnecessary computation and data transfer, improves the utilization rate of GPU resources, thereby achieving efficient and flexible parallel computing.
[0071] In one optional embodiment, the target thread block includes: at least one thread scheduling group, the thread scheduling group including: multiple consecutive threads, and constructing the first mapping relationship using a multi-task batching framework includes:
[0072] Obtain thread block requirement information for multiple pending tasks, wherein the thread block requirement information is used to determine the number of target thread blocks required by the pending tasks; determine the prefix sum array corresponding to at least one thread scheduling group based on the thread block requirement information; and construct a first mapping relationship using the prefix sum array.
[0073] In GPU computing, a target thread block is the smallest unit of parallel execution. A target thread block includes at least one thread scheduling group (warp), which contains multiple consecutive threads. For example, a warp might contain 32 consecutive threads. If the number of tasks exceeds the warp size of 32, it can be split into multiple groups of 32 threads each. Because threads run on the same memory space (SM) and share the same storage space, threads within a target thread block can communicate and collaborate quickly through shared memory. Parallel execution of thread blocks can significantly improve GPU computational efficiency, especially when processing large-scale data and computationally intensive tasks.
[0074] A warp is the basic unit of thread scheduling, consisting of a group of consecutive threads, typically 32 threads. Threads within the same warp are scheduled and executed almost simultaneously, which helps maximize the utilization of computing resources through hardware parallel processing. Threads within a warp can execute the same instructions in parallel, but can also achieve a degree of asynchronous or branched execution through conditional execution and masked instructions.
[0075] In constructing the first mapping relationship using a multi-task batching framework, it is first necessary to analyze the computational requirements of each task to determine the number of thread blocks required for each task's execution. This means pre-calculating the required number of thread blocks for each task on the host before starting the kernel. Specifically, the computational requirements of the task can be determined based on its computational complexity, required memory size, and the characteristics of GPU thread blocks, such as the number of threads and computational resource allocation. For example, a task involving a large number of matrix multiplications requires allocating more thread blocks to improve parallel computing efficiency.
[0076] After determining the number of thread blocks required for each task, a prefix sum array needs to be constructed to indicate the total thread block requirement for each task up to the current position. Each element of the prefix sum array represents the total number of thread blocks required by all tasks preceding the current task. This prefix sum array helps the thread scheduling group find the corresponding task when building the initial mapping. Specifically, the number of thread blocks is arranged into an array with the task number as the index. The length of this array equals the number of tasks, and the value of each element is the number of thread blocks required for that task. For example, if there are three tasks requiring 5, 10, and 15 thread blocks respectively, the array would be [5, 10, 15]. The prefix sum of this array is then calculated.
[0077] A prefix sum array is a data structure used for quickly querying cumulative sums. For an array A consisting of numbers, its prefix sum array P is defined as P[i] = A[0] + A[1] + ... + A[i], meaning that each element of array P is the sum of all elements in array A from the first element to the element itself. Based on the thread block demand array, a prefix sum array can be calculated. For example, for the thread block demand array [5, 10, 15] above, the calculated prefix sum array is [5, 15, 30].
[0078] The prefix sum array is used to determine the allocation of thread blocks when GPU computation is initiated. Specifically, when the GPU kernel function is started, the number of thread blocks to be allocated is set to the last element of the prefix sum array, which is the total number of thread blocks required by all tasks, and the prefix sum array is passed as an argument to the kernel function. Each thread block can query the prefix sum array to determine which task it belongs to and its position within that task, thereby establishing the first mapping relationship.
[0079] Based on the above optional embodiments, by obtaining thread block requirement information of multiple tasks to be processed, and then determining the prefix and array corresponding to at least one thread scheduling group based on the thread block requirement information, and finally using the prefix and array to construct a first mapping relationship, a flexible and efficient task-to-thread block mapping mechanism is provided, which reduces data transmission and scheduling overhead, improves cache locality and GPU resource utilization, and thus significantly improves the parallel computing performance of the GPU when processing irregular multi-tasks.
[0080] In an optional embodiment, constructing the first mapping relationship using a prefix sum array includes:
[0081] The prefix sum array is compared with the current thread block index to obtain the comparison calculation result. The comparison calculation result is used to determine whether the current thread block index is greater than or equal to the current element in the prefix sum array.
[0082] A target mask is generated based on the comparison and calculation results using a thread group voting mechanism.
[0083] Determine the current task index corresponding to the target thread block based on the target mask;
[0084] Construct the first mapping relationship using the current thread block index, the current task index, and the prefix sum array.
[0085] At the start of each thread block on the GPU, a comparison calculation operation is performed, which compares the current thread block index with an element in the prefix sum array to determine whether the current thread block index is greater than or equal to the current element in the prefix sum array. Specifically, the thread block checks whether the current thread block index is greater than or equal to a certain element value in the prefix sum array. This comparison calculation helps determine which set of thread blocks the current thread block belongs to for which task to be processed.
[0086] Each warp can scan the prefix sum array in parallel. For each element, it checks if the current thread's block index is greater than or equal to that element. After threads within a warp perform the comparison calculations in parallel, a thread group voting mechanism (Warp Vote) is used to aggregate the comparison results. Warp Vote generates a 32-bit unsigned integer mask, where each bit corresponds to a thread within the warp. If the thread's comparison calculation result is true (i.e., the current thread's block index is greater than or equal to the corresponding element value in the prefix sum array), the corresponding bit in the target mask is set to 1; otherwise, it is set to 0. Through Warp Vote, data aggregation and conditional branching operations can be performed efficiently at the warp level, avoiding unnecessary thread switching and waiting, and improving parallel execution efficiency.
[0087] A target mask is generated using Warp Vote to determine whether the current thread block should participate in the computation of the current task. The target mask is a 32-bit unsigned integer, where 1 bit represents the number of valid threads in the current task, i.e., the number of threads whose comparison result is true.
[0088] Furthermore, each thread block can use the `popc` instruction to calculate the number of 1-bits in the target mask, thereby determining its position in the task allocation, i.e., the current task index. This allows for the rapid determination of which specific task each thread block should execute without requiring additional instructions or data from the host, further reducing communication overhead during task allocation.
[0089] The first mapping relationship is constructed using the current thread block index, the current task index, and a prefix sum array. This means that each thread block can determine its own computation task and its relative position within the task based on its own index, task index, and prefix sum array. The construction of this first mapping relationship allows thread blocks to immediately begin executing their assigned tasks without additional scheduling or query operations, improving the continuity and efficiency of computation.
[0090] Based on the above optional embodiments, by comparing the prefix sum array with the current thread block index to obtain the comparison calculation result, a target mask is generated based on the comparison calculation result using a thread group voting mechanism. Subsequently, the current task index corresponding to the target thread block is determined based on the target mask. Finally, the first mapping relationship is constructed using the current thread block index, the current task index, and the prefix sum array. This can be used to more flexibly schedule resources when processing heterogeneous multitasking, thereby further improving task processing efficiency.
[0091] In an optional embodiment, step S22, determining the target sub-model corresponding to the target thread block in the hybrid inference model based on the first mapping relationship and the second mapping relationship includes:
[0092] A task information structure is created based on multiple tasks to be processed. The task information structure includes: the task index of the task to be processed and the sub-model index corresponding to the task to be processed.
[0093] Generate a task information array using the task information structure;
[0094] The target processing task is determined by using the target thread block and the first mapping relationship;
[0095] Based on the target processing task and the second mapping relationship, the task information array is queried and processed to obtain the target sub-model corresponding to the target thread block.
[0096] Specifically, a corresponding task information structure is created for each non-empty task. This structure contains the task index of the task to be processed, its corresponding sub-model index, and other information required for MoE expert calculations. The task information structures of multiple tasks to be processed are then combined into a task information array, the length of which is the number of non-empty tasks.
[0097] The target processing task is determined using the target thread block and the first mapping relationship. The target processing task is the non-empty task corresponding to the target thread block obtained through the block index->task index mapping. Further, the task information array is queried based on the target processing task and the second mapping relationship to obtain the target sub-model corresponding to the target thread block.
[0098] Based on the above optional embodiments, a task information structure is created based on multiple tasks to be processed, and then a task information array is generated using the task information structure. Subsequently, the target processing task is determined using the target thread block and the first mapping relationship. Finally, the task information array is queried based on the target processing task and the second mapping relationship to obtain the target sub-model corresponding to the target thread block. This enables efficient batch processing of heterogeneous multi-tasks on the GPU, significantly improving computing performance and resource utilization, especially when processing heterogeneous computing tasks such as MoE model inference.
[0099] In an optional embodiment, in step S23, the target sub-model is used to analyze and calculate multiple tasks to be processed, and the task processing results are obtained, including:
[0100] Obtain the model inference parameters corresponding to the target sub-model and the word vectors corresponding to multiple tasks to be processed;
[0101] Determine the lexical indices of multiple sub-models in the hybrid inference model based on the lexical vectors corresponding to multiple tasks to be processed;
[0102] Bucketing is performed on the lexical indices corresponding to multiple sub-models in the hybrid inference model to obtain the bucketing result, which is used to represent the lexical index array corresponding to multiple sub-models.
[0103] Determine the word index array corresponding to the target sub-model from the bucketing results;
[0104] The model inference parameters and lexical index array are input into the target sub-model for analysis and calculation to obtain the task processing results.
[0105] The aforementioned model inference parameters include, but are not limited to, the weights, biases, activation functions, etc. of the target sub-model. These model inference parameters can be stored in the GPU's global memory so that all thread blocks can access them.
[0106] In the MoE model, each input lexical can be assigned to one or more sub-models for processing. The assignment process is usually based on the features of the lexical and the characteristics of the sub-model, such as their domain of expertise or computing power. By analyzing the input lexical vector, it can be determined which lexicals should be processed by which sub-models, thereby generating a mapping that contains the set of lexical indices that each sub-model is responsible for. This allows GPU thread blocks to directly access the lexical indices related to their tasks according to the mapping of the task information array.
[0107] Instead of directly copying token vectors repeatedly, atomic operations are used to bucket the token indices corresponding to each sub-model, resulting in an array of token indices assigned to each sub-model. Specifically, first, a bucket array is created in the GPU's global memory to store the token indices for each sub-model. The size of the bucket array is typically equal to the number of sub-models, and each element is initialized to an empty list or a specific value to indicate that the sub-model has not yet been assigned any token indices. Next, for each input token's token vector, atomic operations, such as atomic addition, atomic minimum, or atomic maximum, are used to add the token index to the corresponding position in the bucket array. In the MoE model, each token can be assigned to one or more sub-models for processing. Atomic operations ensure that even under high parallelism, when multiple threads simultaneously attempt to update the same array position, no data conflicts or inconsistencies occur. Once all tokens have been processed, the bucket array will contain the set of token indices that each sub-model should have processed. Furthermore, the word index array corresponding to the target sub-model can be quickly determined from the binning results. The model inference parameters and word index array are then input into the target sub-model for analysis and calculation to obtain the task processing results.
[0108] Based on the above optional embodiments, by obtaining the model inference parameters corresponding to the target sub-model and the word vectors corresponding to multiple tasks to be processed, the word indexes corresponding to multiple sub-models in the hybrid inference model are determined based on the word vectors corresponding to multiple tasks to be processed. Subsequently, the word indexes corresponding to multiple sub-models in the hybrid inference model are bucketed to obtain the bucketing results. The word index array corresponding to the target sub-model is determined from the bucketing results. Finally, the model inference parameters and the word index array are input into the target sub-model for analysis and calculation to obtain the task processing results. This further improves the computing performance and cache utilization, realizes dynamic load balancing, simplifies the parallel strategy, ensures scalability, avoids the waste of computing resources, and ultimately improves the accuracy and quality of inference. It has significant advantages and practical application value for processing inference tasks of large-scale language models such as the MoE model.
[0109] In one optional embodiment, the model inference parameters and the lexical index array are input into the target sub-model for analysis and calculation, and the task processing results include:
[0110] The target sub-model is used to perform general matrix multiplication on the model inference parameters and the word index array to obtain the task processing results.
[0111] In related technologies, each expert in MoE performs a GEMM operation. One input to the GEMM is the parameter matrix corresponding to the expert, and the other input matrix consists of the token vector assigned to that expert. However, the outer dimensions of the token matrix corresponding to each expert may be different, and the tokens themselves are not stored contiguously. Due to problems such as non-contiguous data access, inconsistent data shape, data copying and rearrangement, and uneven expert load, the task processing efficiency is low, resources are wasted, and parallel computing capabilities are not fully utilized, which further limits the improvement of MoE model inference performance.
[0112] In this embodiment of the disclosure, after each thread block determines its corresponding expert, it performs GEMM operation based on the lexical index array and model inference parameters. This allows the corresponding data to be read from the original token vector sequence using the lexical index array. Furthermore, by utilizing the lexical index array and GEMM operation, the thread block can efficiently and accurately execute expert model inference, thereby further reducing memory access overhead and optimizing cache utilization.
[0113] In one optional embodiment, the target sub-model is used to perform general matrix multiplication on the model inference parameters and the lexical index array to obtain the task processing results, including:
[0114] Obtain the load scale information of the target sub-model;
[0115] Based on the load scale information, a general matrix multiplication operation is performed on the model inference parameters and the word index array to obtain the task processing result.
[0116] The aforementioned load scale information refers to the number of tokens corresponding to the target sub-model. Due to uneven expert loads, the shapes of GEMMs corresponding to different experts will vary significantly. Therefore, a series of GEMMs with different shapes are implemented. The kernel dynamically selects a suitable GEMM based on the problem size corresponding to the mapped expert, i.e., the load scale information. This ensures that even under extremely uneven load conditions, the GEMM operations of each expert can be appropriately processed, avoiding the performance loss that may occur when the general GEMM algorithm processes matrices with special shapes. Dynamically selecting GEMMs can improve the computational performance of the GPU when processing mixed expert models, reduce unnecessary computational overhead, and utilize hardware resources more efficiently. Especially for cases with uneven expert loads, it can significantly improve the efficiency of parallel computing and the overall system throughput.
[0117] To optimize the inference performance of MoE models on GPUs, especially when batching irregular tasks, tile swizzling technology can be used to improve the efficiency of L2 cache utilization for GEMMs with large tiles. Tile swizzling technology reorganizes data storage, ensuring effective cache utilization even when dealing with large tiles, improving data locality, and reducing the number of times data is read from slower storage levels, thereby significantly reducing latency and improving task computation efficiency. Embodiments of this disclosure can also utilize asynchronous Tensor Cores to maximize hardware computing power. By supporting asynchronous operations, it ensures that Tensor Cores can perform computations while reading data, avoiding idle time while waiting for data transfer, thus fully utilizing GPU computing resources and significantly improving the speed and throughput of MoE model inference. Embodiments of this disclosure can use asynchronous copying to reduce latency caused by memory access. Asynchronous copying technology allows data transfer and computation to proceed in parallel, reducing latency caused by data access, avoiding idle computing units due to waiting for data transfer, and improving overall computational efficiency. The embodiments disclosed herein can also employ a two-stage pipeline to perform global memory to shared memory prefetching and circular copying, ensuring that the Tensor Core remains fully loaded, avoiding a decrease in computing unit efficiency due to insufficient data, and further improving the performance and resource utilization of MoE model inference.
[0118] Figure 3 is a flowchart of another task processing method according to an embodiment of the present disclosure. As shown in Figure 3, the method may include the following steps:
[0119] Step S31: Construct a first mapping relationship and a second mapping relationship associated with multiple tasks to be computed. The first mapping relationship is used to represent the mapping relationship between the thread block index and the task index of the task to be computed, and the second mapping relationship is used to represent the mapping relationship between the task index of the task to be computed and the sub-model index of the heterogeneous multi-task computing model.
[0120] Step S32: Based on the first mapping relationship and the second mapping relationship, determine the target sub-model corresponding to the target thread block in the heterogeneous multi-task computing model, wherein the target thread block is used to execute multiple tasks to be computed;
[0121] Step S33: Analyze and calculate multiple tasks to be calculated using the target sub-model to obtain the task processing results.
[0122] Based on steps S31 to S33 above, a first mapping relationship and a second mapping relationship are constructed to associate multiple tasks to be computed. Each task to be computed includes at least one word vector. The first mapping relationship is used to represent the mapping relationship between the thread block index and the task index of the task to be computed. The second mapping relationship is used to represent the mapping relationship between the task index of the task to be computed and the sub-model index of the heterogeneous multi-task computing model. Then, based on the first mapping relationship and the second mapping relationship, the target sub-model corresponding to the target thread block is determined in the heterogeneous multi-task computing model. Finally, the target sub-model is used to analyze and compute multiple tasks to be computed to obtain the task processing results. Thus, the target sub-model corresponding to each thread block is dynamically determined through a two-layer mapping mechanism, and multiple tasks to be processed are processed in batches using the target sub-model. This can effectively balance the computing load during thread operation and improve the utilization efficiency of the processor. The task processing method in this embodiment of the disclosure provides a flexible, efficient and general parallel computing framework by constructing and utilizing a two-layer mapping mechanism. This allows for the simultaneous processing of multiple tasks of different types, scales or computing requirements in parallel or distributed computing environments, thereby achieving the goal of batch processing heterogeneous multitasking. This improves the processing efficiency and flexibility of heterogeneous multitasking, and solves the technical problems of low processing efficiency and poor flexibility in related technologies when processing heterogeneous multitasking.
[0123] In an optional embodiment, in step S33, the target sub-model is used to analyze and calculate multiple tasks to be computed, and the task processing results are obtained, including:
[0124] Obtain the model inference parameters corresponding to the target sub-model and the word vectors corresponding to multiple tasks to be computed;
[0125] Determine the lexical indices of multiple sub-models in the heterogeneous multi-task computing model based on the lexical vectors corresponding to multiple tasks to be computed;
[0126] Bucketing is performed on the lexical indices corresponding to multiple sub-models in the heterogeneous multi-task computing model to obtain the bucketing result, which is used to represent the lexical index array corresponding to multiple sub-models.
[0127] Determine the word index array corresponding to the target sub-model from the bucketing results;
[0128] The model inference parameters and lexical index array are input into the target sub-model for analysis and calculation to obtain the task processing results.
[0129] The aforementioned model inference parameters include, but are not limited to, the weights, biases, activation functions, etc. of the target sub-model. These model inference parameters can be stored in the GPU's global memory so that all thread blocks can access them.
[0130] In the MoE model, each input lexical can be assigned to one or more sub-models for processing. The assignment process is usually based on the features of the lexical and the characteristics of the sub-model, such as their domain of expertise or computing power. By analyzing the input lexical vector, it can be determined which lexicals should be processed by which sub-models, thereby generating a mapping that contains the set of lexical indices that each sub-model is responsible for. This allows GPU thread blocks to directly access the lexical indices related to their tasks according to the mapping of the task information array.
[0131] Instead of directly copying token vectors repeatedly, atomic operations are used to bucket the token indices corresponding to each sub-model, resulting in an array of token indices assigned to each sub-model. Specifically, first, a bucket array is created in the GPU's global memory to store the token indices for each sub-model. The size of the bucket array is typically equal to the number of sub-models, and each element is initialized to an empty list or a specific value to indicate that the sub-model has not yet been assigned any token indices. Next, for each input token's token vector, atomic operations, such as atomic addition, atomic minimum, or atomic maximum, are used to add the token index to the corresponding position in the bucket array. In the MoE model, each token can be assigned to one or more sub-models for processing. Atomic operations ensure that even under high parallelism, when multiple threads simultaneously attempt to update the same array position, no data conflicts or inconsistencies occur. Once all tokens have been processed, the bucket array will contain the set of token indices that each sub-model should have processed. Furthermore, the word index array corresponding to the target sub-model can be quickly determined from the binning results. The model inference parameters and word index array are then input into the target sub-model for analysis and calculation to obtain the task processing results.
[0132] Based on the above optional embodiments, by obtaining the model inference parameters corresponding to the target sub-model and the word vectors corresponding to multiple tasks to be computed, the word indexes corresponding to multiple sub-models in the heterogeneous multi-task computing model are determined based on the word vectors corresponding to multiple tasks to be computed. Subsequently, the word indexes corresponding to multiple sub-models in the heterogeneous multi-task computing model are bucketed to obtain the bucketing results. The word index array corresponding to the target sub-model is determined from the bucketing results. Finally, the model inference parameters and the word index array are input into the target sub-model for analysis and calculation to obtain the task processing results. This further improves the computing performance and cache utilization, realizes dynamic load balancing, simplifies the parallel strategy, ensures scalability, avoids the waste of computing resources, and ultimately improves the accuracy and quality of inference. It has significant advantages and practical application value for processing inference tasks of large-scale language models such as the MoE model.
[0133] Figure 4 is a flowchart of another task processing method according to an embodiment of the present disclosure. As shown in Figure 4, the method may include the following steps:
[0134] Step S41: Obtain a data processing request through the first application programming interface, wherein the request data carried in the data processing request includes: multiple tasks to be processed, and each task to be processed includes at least one word vector;
[0135] Step S42: Return a data processing response through the second application programming interface. The response data carried in the data processing response includes: task processing results, which are obtained by analyzing and calculating multiple tasks to be processed using the target sub-model corresponding to the target thread block. The target sub-model is determined in the hybrid inference model based on the first mapping relationship and the second mapping relationship. The target thread block is used to execute multiple tasks to be processed. The first mapping relationship is used to represent the mapping relationship between the thread block index and the task index of the task to be processed. The second mapping relationship is used to represent the mapping relationship between the task index of the task to be processed and the sub-model index of the hybrid inference model.
[0136] Based on steps S41 to S42 above, a data processing request is obtained through the first application programming interface (API). The request data carried in the data processing request includes multiple tasks to be processed, each task including at least one token vector. Then, a data processing response is returned through the second API. The response data carried in the data processing response includes task processing results. The task processing results are obtained by analyzing and calculating multiple tasks to be processed using the target sub-model corresponding to the target thread block. The target sub-model is determined in the hybrid inference model based on the first mapping relationship and the second mapping relationship. The target thread block is used to execute multiple tasks to be processed. The first mapping relationship is used to represent the mapping relationship between the thread block index and the task index of the task to be processed. The second mapping relationship is used to represent the mapping relationship between the task index of the task to be processed and the sub-model index of the hybrid inference model. Thus, the target sub-model corresponding to each thread block is dynamically determined through a two-layer mapping mechanism. Then, the target sub-model is used to batch process multiple tasks to be processed, which can effectively balance the computational load during thread operation and improve the utilization efficiency of the processor. The task processing method in this embodiment of the disclosure provides a flexible, efficient and general parallel computing framework by constructing and utilizing a two-layer mapping mechanism. This allows for the simultaneous processing of multiple tasks of different types, scales or computing requirements in parallel or distributed computing environments, thereby achieving the goal of batch processing heterogeneous multitasking. This improves the processing efficiency and flexibility of heterogeneous multitasking, and solves the technical problems of low processing efficiency and poor flexibility in related technologies when processing heterogeneous multitasking.
[0137] Figure 5 is a flowchart of another task processing method according to an embodiment of the present disclosure. As shown in Figure 5, the method may include the following steps:
[0138] Step S51: Obtain the currently input data processing dialogue request, wherein the request data carried in the data processing dialogue request includes: multiple tasks to be processed, and each task to be processed includes at least one word vector.
[0139] Step S52: In response to the data processing dialogue request, a data processing dialogue response is returned. The information carried in the data processing dialogue response includes: task processing results, which are obtained by analyzing and calculating multiple tasks to be processed using the target sub-model corresponding to the target thread block. The target sub-model is determined in the hybrid inference model based on the first mapping relationship and the second mapping relationship. The target thread block is used to execute multiple tasks to be processed. The first mapping relationship is used to represent the mapping relationship between the thread block index and the task index of the task to be processed. The second mapping relationship is used to represent the mapping relationship between the task index of the task to be processed and the sub-model index of the hybrid inference model.
[0140] Step S53: Display the task processing results within the graphical user interface.
[0141] Based on steps S51 to S53 above, by acquiring the currently input data processing dialogue request, the data processing dialogue request carries request data including: multiple tasks to be processed, each task including at least one word vector. Then, in response to the data processing dialogue request, a data processing dialogue reply is returned. The data processing dialogue reply carries information including: task processing results, which are obtained by analyzing and calculating multiple tasks using the target sub-model corresponding to the target thread block. The target sub-model is determined in the hybrid inference model based on a first mapping relationship and a second mapping relationship. The target thread block is used to execute multiple tasks to be processed. The first mapping relationship represents the mapping relationship between the thread block index and the task index of the task to be processed, and the second mapping relationship represents the mapping relationship between the task index of the task to be processed and the sub-model index of the hybrid inference model. Finally, the task processing results are displayed in the graphical user interface. Thus, through a two-layer mapping mechanism, the target sub-model corresponding to each thread block is dynamically determined, and multiple tasks to be processed in batches using the target sub-model. This effectively balances the computational load during thread execution and improves processor utilization efficiency. The task processing method in this embodiment of the disclosure provides a flexible, efficient and general parallel computing framework by constructing and utilizing a two-layer mapping mechanism. This allows for the simultaneous processing of multiple tasks of different types, scales or computing requirements in parallel or distributed computing environments, thereby achieving the goal of batch processing heterogeneous multitasking. This improves the processing efficiency and flexibility of heterogeneous multitasking, and solves the technical problems of low processing efficiency and poor flexibility in related technologies when processing heterogeneous multitasking.
[0142] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this disclosure are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use and processing of the relevant data must comply with the relevant laws, regulations and standards of the relevant countries and regions, and corresponding operation portals are provided for users to choose to authorize or refuse.
[0143] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that this disclosure is not limited to the described order of actions, because according to this disclosure, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions and modules involved are not necessarily essential to this disclosure.
[0144] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods according to the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, they can also be implemented by hardware. Based on this understanding, the technical solutions of this disclosure, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, or network device, etc.) to execute the methods described in the various embodiments of this disclosure.
[0145] According to embodiments of this disclosure, a task processing apparatus for implementing the above-described task processing method is also provided. FIG6 is a structural block diagram of a task processing apparatus according to an embodiment of this disclosure. As shown in FIG6, the apparatus includes:
[0146] The construction module 601 is configured to construct a first mapping relationship and a second mapping relationship associated with multiple tasks to be processed, wherein the tasks to be processed include at least one word vector, the first mapping relationship is used to represent the mapping relationship between the thread block index and the task index of the task to be processed, and the second mapping relationship is used to represent the mapping relationship between the task index of the task to be processed and the sub-model index of the hybrid inference model.
[0147] The determination module 602 is configured to determine the target sub-model corresponding to the target thread block in the hybrid inference model based on the first mapping relationship and the second mapping relationship, wherein the target thread block is used to execute multiple pending tasks;
[0148] The processing module 603 is configured to analyze and calculate multiple tasks to be processed using the target sub-model to obtain the task processing results.
[0149] Optionally, the building module 601 is further configured to: construct a first mapping relationship using a multi-task batching framework, wherein the multi-task batching framework is used to allocate corresponding target thread blocks to multiple tasks to be processed according to the target required number.
[0150] Optionally, the construction module 601 is further configured to: obtain thread block requirement information for multiple tasks to be processed, wherein the thread block requirement information is used to determine the number of target thread blocks required by the tasks to be processed; determine a prefix sum array corresponding to at least one thread scheduling group based on the thread block requirement information; and construct a first mapping relationship using the prefix sum array.
[0151] Optionally, the construction module 601 is further configured to: perform a comparison calculation using the prefix sum array and the current thread block index to obtain a comparison calculation result, wherein the comparison calculation result is used to determine whether the current thread block index is greater than or equal to the current element in the prefix sum array; generate a target mask based on the comparison calculation result using a thread group voting mechanism; determine the current task index corresponding to the target thread block based on the target mask; and construct a first mapping relationship using the current thread block index, the current task index, and the prefix sum array.
[0152] Optionally, the determining module 602 is further configured to: create a task information structure based on multiple tasks to be processed, wherein the task information structure includes: the task index of the tasks to be processed and the sub-model index corresponding to the tasks to be processed; generate a task information array using the task information structure; determine the target processing task using the target thread block and the first mapping relationship; and perform query processing on the task information array based on the target processing task and the second mapping relationship to obtain the target sub-model corresponding to the target thread block.
[0153] Optionally, the processing module 603 is further configured to: obtain the model inference parameters corresponding to the target sub-model and the word vectors corresponding to multiple tasks to be processed; determine the word indices corresponding to multiple sub-models in the hybrid inference model based on the word vectors corresponding to multiple tasks to be processed; perform bucketing processing on the word indices corresponding to multiple sub-models in the hybrid inference model to obtain the bucketing processing result, wherein the bucketing processing result is used to represent the word index array corresponding to multiple sub-models; determine the word index array corresponding to the target sub-model from the bucketing processing result; input the model inference parameters and the word index array into the target sub-model for analysis and calculation to obtain the task processing result.
[0154] Optionally, the processing module 603 is also configured to: perform general matrix multiplication operations on the model inference parameters and the lexical index array using the target sub-model to obtain the task processing result.
[0155] Optionally, the processing module 603 is further configured to: obtain the load scale information of the target sub-model; and perform general matrix multiplication operations on the model inference parameters and the lexical index array based on the load scale information to obtain the task processing result.
[0156] It should be noted that the aforementioned construction module 601, determination module 602, and processing module 603 correspond to steps S21 to S23 in the above embodiments. The three modules and their corresponding steps implement the same instances and application scenarios, but are not limited to the content disclosed in the above embodiments. It should be noted that the aforementioned modules or units may be hardware or software components stored in memory and processed by one or more processors. The aforementioned modules may also be part of a device and run in the server provided in the above embodiments.
[0157] Figure 7 is a structural block diagram of another task processing apparatus according to an embodiment of the present disclosure. As shown in Figure 7, the apparatus includes:
[0158] The construction module 701 is configured to construct a first mapping relationship and a second mapping relationship associated with multiple tasks to be computed. The first mapping relationship is used to represent the mapping relationship between the thread block index and the task index of the task to be computed, and the second mapping relationship is used to represent the mapping relationship between the task index of the task to be computed and the sub-model index of the heterogeneous multi-task computing model.
[0159] The determination module 702 is configured to determine the target sub-model corresponding to the target thread block in the heterogeneous multi-task computing model based on the first mapping relationship and the second mapping relationship, wherein the target thread block is used to execute multiple tasks to be computed;
[0160] The processing module 703 is configured to analyze and calculate multiple tasks to be calculated using the target sub-model to obtain the task processing results.
[0161] Optionally, the processing module 703 is further configured to: obtain the model inference parameters corresponding to the target sub-model and the word vectors corresponding to multiple tasks to be computed; determine the word indexes corresponding to multiple sub-models in the heterogeneous multi-task computation model based on the word vectors corresponding to multiple tasks to be computed; perform bucketing processing on the word indexes corresponding to multiple sub-models in the heterogeneous multi-task computation model to obtain the bucketing processing result, wherein the bucketing processing result is used to represent the word index array corresponding to multiple sub-models; determine the word index array corresponding to the target sub-model from the bucketing processing result; input the model inference parameters and the word index array into the target sub-model for analysis and calculation to obtain the task processing result.
[0162] It should be noted that the aforementioned construction module 701, determination module 702, and processing module 703 correspond to steps S31 to S33 in the above embodiments. The three modules and their corresponding steps implement the same instances and application scenarios, but are not limited to the content disclosed in the above embodiments. It should be noted that the aforementioned modules or units may be hardware or software components stored in memory and processed by one or more processors. The aforementioned modules may also be part of a device and run in the server provided in the above embodiments.
[0163] Figure 8 is a structural block diagram of another task processing apparatus according to an embodiment of the present disclosure. As shown in Figure 8, the apparatus includes:
[0164] The acquisition module 801 is configured to acquire a data processing request through a first application programming interface, wherein the request data carried in the data processing request includes: multiple tasks to be processed, and each task to be processed includes at least one word vector;
[0165] Return module 802 is configured to return a data processing response via a second application programming interface. The response data carried in the data processing response includes: task processing results, which are obtained by analyzing and calculating multiple tasks to be processed using the target sub-model corresponding to the target thread block. The target sub-model is determined in the hybrid inference model based on a first mapping relationship and a second mapping relationship. The target thread block is used to execute multiple tasks to be processed. The first mapping relationship is used to represent the mapping relationship between the thread block index and the task index of the task to be processed. The second mapping relationship is used to represent the mapping relationship between the task index of the task to be processed and the sub-model index of the hybrid inference model.
[0166] It should be noted that the acquisition module 801 and the return module 802 correspond to steps S41 to S42 in the above embodiments. The two modules and the corresponding steps implement the same instances and application scenarios, but are not limited to the content disclosed in the above embodiments. It should be noted that the above modules or units can be hardware components or software components stored in memory and processed by one or more processors. The above modules can also be part of the device and run in the server provided in the above embodiments.
[0167] Figure 9 is a structural block diagram of another task processing apparatus according to an embodiment of the present disclosure. As shown in Figure 9, the apparatus includes:
[0168] The acquisition module 901 is configured to acquire the currently input data processing dialogue request, wherein the request data carried in the data processing dialogue request includes: multiple tasks to be processed, and each task to be processed includes at least one word vector;
[0169] Return module 902 is configured to respond to a data processing dialog request and return a data processing dialog response. The data processing dialog response carries information including: task processing results, which are obtained by analyzing and calculating multiple tasks to be processed using the target sub-model corresponding to the target thread block. The target sub-model is determined in the hybrid inference model based on the first mapping relationship and the second mapping relationship. The target thread block is used to execute multiple tasks to be processed. The first mapping relationship is used to represent the mapping relationship between the thread block index and the task index of the task to be processed. The second mapping relationship is used to represent the mapping relationship between the task index of the task to be processed and the sub-model index of the hybrid inference model.
[0170] Display module 903 is configured to display task processing results within a graphical user interface.
[0171] It should be noted that the acquisition module 901, return module 902, and display module 903 correspond to steps 51 to S53 in the above embodiments. The three modules and their corresponding steps implement the same instances and application scenarios, but are not limited to the content disclosed in the above embodiments. It should be noted that the above modules or units can be hardware or software components stored in memory and processed by one or more processors. The above modules can also run as part of the device in the server provided in the above embodiments.
[0172] It should be noted that the preferred embodiments involved in the above embodiments of this disclosure are the same as the solutions, application scenarios and implementation processes provided in the above embodiments, but are not limited to the solutions provided in the above embodiments.
[0173] Embodiments of this disclosure may provide a computing device. FIG10 is a structural block diagram of a computing device according to an embodiment of the present disclosure. As shown in FIG10, the computing device 100 may include: one or more (only one is shown in the figure) processors 102, memory 104, memory controller, and peripheral interfaces.
[0174] The aforementioned computing device can be understood as an integrated smart terminal, including but not limited to servers, desktop computers, PCs (Personal Computers), all-in-one model machines, etc., and the computing device may have the model described in the above embodiments of this disclosure pre-installed.
[0175] Specifically, this computing device can pre-install various types of models, including but not limited to models in natural language processing, visual processing, speech processing, code processing, and multimodal task processing, thus providing diverse model selection. In different product forms, this computing device can support one or more model usage methods, including but not limited to model training, model invocation, model fine-tuning, model deployment, model inference, and application. In some product forms, this computing device also supports model management, including but not limited to multi-type model management (supporting the management of discriminative, generative, and other model types), model version control (supporting the control of different model versions), and model evaluation (evaluating model performance and effectiveness based on model evaluation tools). In other product forms, this computing device can also create applications based on models, providing API calling capabilities, allowing models to be called into created applications through API interfaces, and providing application management tools for application management and monitoring.
[0176] Furthermore, the computing device may also include data management (supporting the creation and management of model tuning datasets), a training center (providing abundant training resources to help users learn and master AI technology), and basic control capabilities (providing enterprise-level basic control capabilities to ensure the security and efficient operation of the system). Through the above functions, it provides a comprehensive and integrated device for AI development, training, deployment, and application.
[0177] The memory can be used to store software programs and modules, such as the program instructions / modules corresponding to the methods and apparatus in the embodiments of this disclosure. The processor executes various functional applications and data processing by running the software programs and modules stored in the memory, thereby implementing the methods in the above embodiments. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory remotely located relative to the processor, and these remote memories can be connected to the terminal via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
[0178] The processor can invoke information and application programs stored in memory via a transmission device to perform the following steps: constructing a first mapping relationship and a second mapping relationship associated with multiple tasks to be processed, wherein each task to be processed includes at least one word vector; the first mapping relationship is used to represent the mapping relationship between the thread block index and the task index of the task to be processed; the second mapping relationship is used to represent the mapping relationship between the task index of the task to be processed and the sub-model index of the hybrid inference model; determining the target sub-model corresponding to the target thread block in the hybrid inference model based on the first mapping relationship and the second mapping relationship, wherein the target thread block is used to execute multiple tasks to be processed; and analyzing and calculating the multiple tasks to be processed using the target sub-model to obtain the task processing result.
[0179] Optionally, the processor may also execute program code that performs the following steps: constructing a first mapping relationship using a multi-task batching framework, wherein the multi-task batching framework is used to allocate corresponding target thread blocks to multiple tasks to be processed according to the target required number.
[0180] Optionally, the processor may also execute program code that performs the following steps: obtaining thread block requirement information for multiple tasks to be processed, wherein the thread block requirement information is used to determine the number of target thread blocks required by the tasks to be processed; determining a prefix sum array corresponding to at least one thread scheduling group based on the thread block requirement information; and constructing a first mapping relationship using the prefix sum array.
[0181] Optionally, the processor may also execute program code that performs the following steps: compares the current thread block index with the prefix sum array to obtain a comparison result, wherein the comparison result is used to determine whether the current thread block index is greater than or equal to the current element in the prefix sum array; generates a target mask based on the comparison result using a thread group voting mechanism; determines the current task index corresponding to the target thread block based on the target mask; and constructs a first mapping relationship using the current thread block index, the current task index, and the prefix sum array.
[0182] Optionally, the processor may also execute program code that performs the following steps: creating a task information structure based on multiple tasks to be processed, wherein the task information structure includes: the task index of the tasks to be processed and the sub-model index corresponding to the tasks to be processed; generating a task information array using the task information structure; determining the target processing task using the target thread block and the first mapping relationship; and querying the task information array based on the target processing task and the second mapping relationship to obtain the target sub-model corresponding to the target thread block.
[0183] Optionally, the processor may also execute program code that performs the following steps: obtaining model inference parameters corresponding to the target sub-model and word vectors corresponding to multiple tasks to be processed; determining word indices corresponding to multiple sub-models in the hybrid inference model based on word vectors corresponding to multiple tasks to be processed; performing bucketing processing on word indices corresponding to multiple sub-models in the hybrid inference model to obtain bucketing results, wherein the bucketing results are used to represent word index arrays corresponding to multiple sub-models; determining the word index array corresponding to the target sub-model from the bucketing results; and inputting the model inference parameters and word index array into the target sub-model for analysis and calculation to obtain task processing results.
[0184] Optionally, the processor may also execute program code that performs general matrix multiplication on the model inference parameters and the lexical index array using the target sub-model to obtain the task processing result.
[0185] Optionally, the processor may also execute program code that performs the following steps: obtains the load scale information of the target sub-model; performs general matrix multiplication on the model inference parameters and the lexical index array based on the load scale information to obtain the task processing result.
[0186] Optionally, the processor may also execute program code that performs the following steps: constructing a first mapping relationship and a second mapping relationship associated with multiple tasks to be computed, wherein the first mapping relationship represents the mapping relationship between the thread block index and the task index of the task to be computed, and the second mapping relationship represents the mapping relationship between the task index of the task to be computed and the sub-model index of the heterogeneous multi-task computing model; determining the target sub-model corresponding to the target thread block in the heterogeneous multi-task computing model based on the first mapping relationship and the second mapping relationship, wherein the target thread block is used to execute multiple tasks to be computed; and analyzing and computing the multiple tasks to be computed using the target sub-model to obtain the task processing results.
[0187] Optionally, the processor may also execute program code that performs the following steps: obtaining a data processing request through a first application programming interface, wherein the request data carried in the data processing request includes: multiple tasks to be processed, each task to be processed including at least one token vector; returning a data processing response through a second application programming interface, wherein the response data carried in the data processing response includes: task processing results, wherein the task processing results are obtained by analyzing and calculating multiple tasks to be processed using the target sub-model corresponding to the target thread block, the target sub-model is determined in the hybrid inference model based on a first mapping relationship and a second mapping relationship, the target thread block is used to execute multiple tasks to be processed, the first mapping relationship is used to represent the mapping relationship between the thread block index and the task index of the task to be processed, and the second mapping relationship is used to represent the mapping relationship between the task index of the task to be processed and the sub-model index of the hybrid inference model.
[0188] Optionally, the processor may also execute program code that performs the following steps: obtaining the currently input data processing dialogue request, wherein the request data carried in the data processing dialogue request includes: multiple tasks to be processed, each task including at least one word vector; responding to the data processing dialogue request, returning a data processing dialogue response, wherein the information carried in the data processing dialogue response includes: task processing results, which are obtained by analyzing and calculating multiple tasks to be processed using the target sub-model corresponding to the target thread block, the target sub-model being determined in the hybrid inference model based on a first mapping relationship and a second mapping relationship, the target thread block being used to execute multiple tasks to be processed, the first mapping relationship being used to represent the mapping relationship between the thread block index and the task index of the task to be processed, and the second mapping relationship being used to represent the mapping relationship between the task index of the task to be processed and the sub-model index of the hybrid inference model; and displaying the task processing results in the graphical user interface.
[0189] By employing embodiments of this disclosure, a first mapping relationship and a second mapping relationship are constructed to associate multiple tasks to be processed. Each task to be processed includes at least one word vector. The first mapping relationship represents the mapping relationship between the thread block index and the task index of the task to be processed, and the second mapping relationship represents the mapping relationship between the task index of the task to be processed and the sub-model index of the hybrid inference model. Based on the first and second mapping relationships, the target sub-model corresponding to the target thread block is determined in the hybrid inference model. Finally, the target sub-model is used to analyze and calculate multiple tasks to be processed to obtain the task processing results. Thus, the target sub-model corresponding to each thread block is dynamically determined through a two-layer mapping mechanism, and the target sub-model is used to batch process multiple tasks to be processed. This can effectively balance the computational load during thread execution and improve the utilization efficiency of the processor. The task processing method in this embodiment of the disclosure provides a flexible, efficient and general parallel computing framework by constructing and utilizing a two-layer mapping mechanism. This allows for the simultaneous processing of multiple tasks of different types, scales or computing requirements in parallel or distributed computing environments, thereby achieving the goal of batch processing heterogeneous multitasking. This improves the processing efficiency and flexibility of heterogeneous multitasking, and solves the technical problems of low processing efficiency and poor flexibility in related technologies when processing heterogeneous multitasking.
[0190] Embodiments of this disclosure can provide an electronic device. FIG11 is a structural block diagram of an electronic device according to an embodiment of this disclosure. As shown in the figure, the electronic device may include: an input / output device 112; a memory 114; and a processor 116, wherein the processor 116 is connected to the input / output device 112 and the memory 114 via a bus 118.
[0191] The memory can be used to store software programs and modules, such as the program instructions / modules corresponding to the methods and apparatus in the embodiments of this disclosure. The processor executes various functional applications and data processing by running the software programs and modules stored in the memory, thereby implementing the methods in the above embodiments. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory remotely located relative to the processor, and these remote memories can be connected to the terminal via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
[0192] The processor can invoke an executable program stored in memory via a transmission device to execute the method described in any of the above embodiments.
[0193] It will be understood by those skilled in the art that the structure shown in the figure is merely illustrative, and the computing device may also be a smartphone, tablet computer, PDA, mobile internet device (MID), PAD, or other terminal device. This figure does not limit the structure of the aforementioned computing device. For example, the computing device may include more or fewer components (such as network interfaces, display devices, etc.) than shown in the figure, or may have a different configuration than that shown in the figure.
[0194] Those skilled in the art will understand that all or part of the steps in the various methods of the above embodiments can be implemented by a program instructing the hardware related to the terminal device. The program can be stored in a computer-readable storage medium, which may include: flash drive, read-only memory (ROM), random access memory (RAM), disk or optical disk, etc.
[0195] Embodiments of this disclosure also provide a computer-readable storage medium. Optionally, in this embodiment, the computer-readable storage medium can be used to store program code executed by the method provided in the above embodiments.
[0196] Optionally, in this embodiment, the storage medium may be located in a computing device.
[0197] Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: constructing a first mapping relationship and a second mapping relationship associated with multiple tasks to be processed, wherein each task to be processed includes at least one word vector, the first mapping relationship is used to represent the mapping relationship between the thread block index and the task index of the task to be processed, and the second mapping relationship is used to represent the mapping relationship between the task index of the task to be processed and the sub-model index of the hybrid inference model; determining the target sub-model corresponding to the target thread block in the hybrid inference model based on the first mapping relationship and the second mapping relationship, wherein the target thread block is used to execute multiple tasks to be processed; and analyzing and calculating the multiple tasks to be processed using the target sub-model to obtain the task processing result.
[0198] Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: constructing a first mapping relationship using a multi-task batching framework, wherein the multi-task batching framework is used to allocate corresponding target thread blocks to multiple tasks to be processed according to the target required number.
[0199] Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: obtaining thread block requirement information for multiple tasks to be processed, wherein the thread block requirement information is used to determine the number of target thread blocks required by the tasks to be processed; determining a prefix sum array corresponding to at least one thread scheduling group based on the thread block requirement information; and constructing a first mapping relationship using the prefix sum array.
[0200] Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: comparing the current thread block index with the prefix sum array to obtain a comparison calculation result, wherein the comparison calculation result is used to determine whether the current thread block index is greater than or equal to the current element in the prefix sum array; generating a target mask based on the comparison calculation result using a thread group voting mechanism; determining the current task index corresponding to the target thread block based on the target mask; and constructing a first mapping relationship using the current thread block index, the current task index, and the prefix sum array.
[0201] Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: creating a task information structure based on multiple tasks to be processed, wherein the task information structure includes: a task index of the tasks to be processed and a sub-model index corresponding to the tasks to be processed; generating a task information array using the task information structure; determining the target processing task using the target thread block and a first mapping relationship; querying the task information array based on the target processing task and a second mapping relationship to obtain the target sub-model corresponding to the target thread block.
[0202] Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: obtaining model inference parameters corresponding to the target sub-model and word vectors corresponding to multiple tasks to be processed; determining word indices corresponding to multiple sub-models in the hybrid inference model based on word vectors corresponding to multiple tasks to be processed; performing bucketing processing on word indices corresponding to multiple sub-models in the hybrid inference model to obtain bucketing processing results, wherein the bucketing processing results are used to represent word index arrays corresponding to multiple sub-models; determining the word index array corresponding to the target sub-model from the bucketing processing results; inputting the model inference parameters and word index array into the target sub-model for analysis and calculation to obtain task processing results.
[0203] Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: using the target sub-model to perform general matrix multiplication operations on the model inference parameters and the lexical index array to obtain the task processing result.
[0204] Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: obtaining the load scale information of the target sub-model; performing general matrix multiplication operations on the model inference parameters and the lexical index array based on the load scale information to obtain the task processing result.
[0205] Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: constructing a first mapping relationship and a second mapping relationship associated with multiple tasks to be computed, wherein the first mapping relationship is used to represent the mapping relationship between the thread block index and the task index of the task to be computed, and the second mapping relationship is used to represent the mapping relationship between the task index of the task to be computed and the sub-model index of the heterogeneous multi-task computing model; determining the target sub-model corresponding to the target thread block in the heterogeneous multi-task computing model based on the first mapping relationship and the second mapping relationship, wherein the target thread block is used to execute multiple tasks to be computed; and analyzing and computing the multiple tasks to be computed using the target sub-model to obtain task processing results.
[0206] Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: obtaining a data processing request through a first application programming interface, wherein the request data carried in the data processing request includes: multiple tasks to be processed, each task to be processed including at least one token vector; returning a data processing response through a second application programming interface, wherein the response data carried in the data processing response includes: task processing results, the task processing results being obtained by analyzing and calculating multiple tasks to be processed using a target sub-model corresponding to the target thread block, the target sub-model being determined in the hybrid inference model based on a first mapping relationship and a second mapping relationship, the target thread block being used to execute multiple tasks to be processed, the first mapping relationship being used to represent the mapping relationship between the thread block index and the task index of the task to be processed, and the second mapping relationship being used to represent the mapping relationship between the task index of the task to be processed and the sub-model index of the hybrid inference model.
[0207] Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: obtaining the currently input data processing dialogue request, wherein the request data carried in the data processing dialogue request includes: multiple tasks to be processed, each task to be processed including at least one word vector; responding to the data processing dialogue request, returning a data processing dialogue response, wherein the information carried in the data processing dialogue response includes: task processing results, which are obtained by analyzing and calculating multiple tasks to be processed using the target sub-model corresponding to the target thread block, the target sub-model being determined in the hybrid inference model based on a first mapping relationship and a second mapping relationship, the target thread block being used to execute multiple tasks to be processed, the first mapping relationship being used to represent the mapping relationship between the thread block index and the task index of the task to be processed, and the second mapping relationship being used to represent the mapping relationship between the task index of the task to be processed and the sub-model index of the hybrid inference model; and displaying the task processing results in a graphical user interface.
[0208] By employing embodiments of this disclosure, a first mapping relationship and a second mapping relationship are constructed to associate multiple tasks to be processed. Each task to be processed includes at least one word vector. The first mapping relationship represents the mapping relationship between the thread block index and the task index of the task to be processed, and the second mapping relationship represents the mapping relationship between the task index of the task to be processed and the sub-model index of the hybrid inference model. Based on the first and second mapping relationships, the target sub-model corresponding to the target thread block is determined in the hybrid inference model. Finally, the target sub-model is used to analyze and calculate multiple tasks to be processed to obtain the task processing results. Thus, the target sub-model corresponding to each thread block is dynamically determined through a two-layer mapping mechanism, and the target sub-model is used to batch process multiple tasks to be processed. This can effectively balance the computational load during thread execution and improve the utilization efficiency of the processor. The task processing method in this embodiment of the disclosure provides a flexible, efficient and general parallel computing framework by constructing and utilizing a two-layer mapping mechanism. This allows for the simultaneous processing of multiple tasks of different types, scales or computing requirements in parallel or distributed computing environments, thereby achieving the goal of batch processing heterogeneous multitasking. This improves the processing efficiency and flexibility of heterogeneous multitasking, and solves the technical problems of low processing efficiency and poor flexibility in related technologies when processing heterogeneous multitasking.
[0209] Embodiments of this disclosure also provide a computer program product. Optionally, in this embodiment, the computer program product may include a computer program that, when executed by a processor, implements the methods provided in the embodiments described above.
[0210] Embodiments of this disclosure also provide a computer program product. Optionally, the computer program product may include a non-volatile computer-readable storage medium, which can be used to store a computer program that, when executed by a processor, implements the methods provided in the embodiments described above.
[0211] Embodiments of this disclosure also provide a computer program. Optionally, in this embodiment, when the computer program is executed by a processor, it implements the method provided in the above embodiments.
[0212] In the above embodiments of this disclosure, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0213] In the several embodiments provided in this disclosure, it should be understood that the disclosed technical content can be implemented in other ways. The device embodiments described above are merely illustrative; for example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the displayed or discussed mutual couplings, direct couplings, or communication connections may be through some interfaces; indirect couplings or communication connections between units or modules may be electrical or other forms.
[0214] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0215] Furthermore, the functional units in the various embodiments of this disclosure can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0216] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this disclosure, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this disclosure. The aforementioned storage medium includes various media capable of storing program code, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard drive, magnetic disk, or optical disk.
[0217] The above description is only a preferred embodiment of this disclosure. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principles of this disclosure, and these improvements and modifications should also be considered within the scope of protection of this disclosure.
Claims
1. A task processing method, comprising: Construct a first mapping relationship and a second mapping relationship to associate multiple tasks to be processed, wherein each task to be processed includes at least one word vector, the first mapping relationship is used to represent the mapping relationship between the thread block index and the task index of the task to be processed, and the second mapping relationship is used to represent the mapping relationship between the task index of the task to be processed and the sub-model index of the hybrid inference model; Based on the first mapping relationship and the second mapping relationship, the target sub-model corresponding to the target thread block is determined in the hybrid inference model, wherein the target thread block is used to execute the plurality of pending tasks; The target sub-model is used to analyze and calculate the multiple tasks to be processed, and the task processing results are obtained.
2. The method according to claim 1, wherein, Constructing the first mapping relationship includes: The first mapping relationship is constructed using a multi-task batching framework, wherein the multi-task batching framework is used to allocate the corresponding target thread blocks to the multiple tasks to be processed according to the target required quantity.
3. The method according to claim 2, wherein, The target thread block includes: at least one thread scheduling group, the thread scheduling group including: multiple consecutive threads, and constructing the first mapping relationship using a multi-task batching framework includes: Obtain thread block requirement information for the plurality of pending tasks, wherein the thread block requirement information is used to determine the number of target thread blocks required by the pending tasks; Based on the thread block requirement information, determine the prefix and array corresponding to the at least one thread scheduling group; The first mapping relationship is constructed using the prefix sum array.
4. The method according to claim 3, wherein, Constructing the first mapping relationship using the prefix sum array includes: The prefix sum array is compared with the current thread block index to obtain a comparison result, wherein the comparison result is used to determine whether the current thread block index is greater than or equal to the current element in the prefix sum array; A target mask is generated based on the comparison calculation results using a thread group voting mechanism. Determine the current task index corresponding to the target thread block based on the target mask; The first mapping relationship is constructed using the current thread block index, the current task index, and the prefix sum array.
5. The method according to claim 1, wherein, Determining the target sub-model corresponding to the target thread block in the hybrid inference model based on the first mapping relationship and the second mapping relationship includes: A task information structure is created based on the multiple tasks to be processed, wherein the task information structure includes: the task index of the tasks to be processed and the sub-model index corresponding to the tasks to be processed; The task information structure is used to generate a task information array; The target processing task is determined using the target thread block and the first mapping relationship; Based on the target processing task and the second mapping relationship, the task information array is queried to obtain the target sub-model corresponding to the target thread block.
6. The method according to claim 1, wherein, The target sub-model is used to analyze and calculate the multiple tasks to be processed, and the task processing results are as follows: Obtain the model inference parameters corresponding to the target sub-model and the word vectors corresponding to the multiple tasks to be processed; Based on the word vectors corresponding to the multiple tasks to be processed, determine the word indexes corresponding to the multiple sub-models in the hybrid inference model; The lexical indices corresponding to multiple sub-models in the hybrid inference model are bucketed to obtain a bucketing result, wherein the bucketing result is used to represent the lexical index array corresponding to the multiple sub-models; The term index array corresponding to the target sub-model is determined from the bucketing results; The model inference parameters and the lexical index array are input into the target sub-model for analysis and calculation to obtain the task processing result.
7. The method according to claim 6, wherein, The model inference parameters and the lexical index array are input into the target sub-model for analysis and calculation, and the task processing results are obtained, including: The target sub-model is used to perform general matrix multiplication on the model inference parameters and the lexical index array to obtain the task processing result.
8. The method according to claim 7, wherein, The task processing results are obtained by performing general matrix multiplication on the model inference parameters and the lexical index array using the target sub-model, including: Obtain the load scale information of the target sub-model; Based on the load scale information, a general matrix multiplication operation is performed on the model inference parameters and the lexical index array to obtain the task processing result.
9. The method according to claim 1, wherein, The multiple tasks to be processed are heterogeneous tasks in parallel computing environments or distributed computing environments, and the task types, task sizes and task computing requirements of the multiple tasks to be processed are different.
10. The method according to claim 1, wherein, The hybrid reasoning model includes a controller and multiple sub-models. The controller is used to assign multiple tasks to be processed to the multiple sub-models and to merge or select the output results of the multiple sub-models to obtain the task processing result.
11. The method according to any one of claims 1 to 5, wherein, The threads in the target thread block communicate and cooperate through shared memory.
12. The method according to claim 3, wherein, Threads within the thread scheduling group are used to execute the same instructions in parallel, or threads within the thread scheduling group perform asynchronous or branched execution through conditional execution mechanisms and mask instructions.
13. The method according to claim 4, wherein, The target mask is generated based on the comparison calculation results using the thread group voting mechanism, including: In response to the current thread block index being greater than or equal to the current element in the prefix sum array, the mask value at the corresponding position in the target mask is determined to be 1; In response to the current thread block index being less than the current element in the prefix sum array, the mask value at the corresponding position in the target mask is determined to be 0.
14. The method according to claim 6, wherein, The model inference parameters include at least the weights, biases, and activation functions of the target sub-model, and the model inference parameters are stored in the global memory of the graphics processing unit.
15. A task processing method, wherein, include: Construct a first mapping relationship and a second mapping relationship to associate multiple tasks to be computed, wherein the first mapping relationship is used to represent the mapping relationship between the thread block index and the task index of the task to be computed, and the second mapping relationship is used to represent the mapping relationship between the task index of the task to be computed and the sub-model index of the heterogeneous multi-task computing model; Based on the first mapping relationship and the second mapping relationship, the target sub-model corresponding to the target thread block is determined in the heterogeneous multi-task computing model, wherein the target thread block is used to execute the multiple tasks to be computed; The target sub-model is used to analyze and calculate the multiple tasks to be calculated, and the task processing results are obtained.
16. The method according to claim 15, wherein, The target sub-model is used to analyze and calculate the multiple tasks to be computed, and the task processing results are as follows: Obtain the model inference parameters corresponding to the target sub-model and the word vectors corresponding to the multiple tasks to be computed; Based on the word vectors corresponding to the multiple tasks to be computed, determine the word indexes corresponding to multiple sub-models in the heterogeneous multi-task computing model; The term indexes corresponding to multiple sub-models in the heterogeneous multi-task computing model are binned to obtain a binning result, wherein the binning result is used to represent the term index array corresponding to the multiple sub-models. The term index array corresponding to the target sub-model is determined from the bucketing results; The model inference parameters and the lexical index array are input into the target sub-model for analysis and calculation to obtain the task processing result.
17. A task processing method, wherein, include: A data processing request is obtained through a first application programming interface, wherein the request data carried in the data processing request includes: multiple tasks to be processed, and each task to be processed includes at least one word vector; A data processing response is returned through a second application programming interface. The response data carried in the data processing response includes: task processing results, which are obtained by analyzing and calculating the multiple tasks to be processed using the target sub-model corresponding to the target thread block. The target sub-model is determined in the hybrid inference model based on a first mapping relationship and a second mapping relationship. The target thread block is used to execute the multiple tasks to be processed. The first mapping relationship is used to represent the mapping relationship between the thread block index and the task index of the task to be processed. The second mapping relationship is used to represent the mapping relationship between the task index of the task to be processed and the sub-model index of the hybrid inference model.
18. A task processing method, wherein, include: Obtain the currently input data processing dialogue request, wherein the request data carried in the data processing dialogue request includes: multiple tasks to be processed, and the tasks to be processed include at least one word vector; In response to the data processing dialogue request, a data processing dialogue response is returned, wherein the information carried in the data processing dialogue response includes: task processing results, which are obtained by analyzing and calculating the multiple tasks to be processed using the target sub-model corresponding to the target thread block, the target sub-model being determined in the hybrid inference model based on a first mapping relationship and a second mapping relationship, the target thread block being used to execute the multiple tasks to be processed, the first mapping relationship being used to represent the mapping relationship between the thread block index and the task index of the task to be processed, and the second mapping relationship being used to represent the mapping relationship between the task index of the task to be processed and the sub-model index of the hybrid inference model; The task processing results are displayed within a graphical user interface.
19. An electronic device, wherein, include: Memory, which stores executable programs; A processor, connected to the memory via a bus, is used to run the program, wherein the program, when running, executes the method according to any one of claims 1 to 18.
20. A computer-readable storage medium, wherein, The computer-readable storage medium includes a stored executable program, wherein, when the executable program is executed, it controls the device on which the computer-readable storage medium is located to perform the method according to any one of claims 1 to 18.