Large-model-based data processing method and apparatus, storage medium, and electronic device

By storing the information of the same key-value logic block in a contiguous video memory region and using parallel processing in large model-separate inference, the transmission latency problem is solved and the data processing efficiency is improved.

WO2026123848A1PCT designated stage Publication Date: 2026-06-18CLOUD INTELLIGENCE ASSETS HOLDING (SINGAPORE) PTE LTD +1

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
CLOUD INTELLIGENCE ASSETS HOLDING (SINGAPORE) PTE LTD
Filing Date
2025-09-10
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

In large-model split inference, existing technologies use a physical cache structure for key-value caching that uses contiguous memory regions for each layer. This results in multiple API calls required to transmit key-value information, causing significant transmission delays and impacting computational performance.

Method used

By storing key-value information from the same key-value logical block identifier in consecutive addresses within the target memory region, the number of transmissions is reduced. Parallel processing and weighted parameter splitting are employed to ensure effective transmission of key-value information during pre-filling and decoding.

🎯Benefits of technology

It improves the data processing efficiency of large-scale model split inference, reduces transmission latency, and enhances overall computing performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025120495_18062026_PF_FP_ABST
    Figure CN2025120495_18062026_PF_FP_ABST
Patent Text Reader

Abstract

Disclosed in the present disclosure are a large-model-based data processing method and apparatus, a storage medium, and an electronic device. The present disclosure relates to the technical field of artificial intelligence, wherein the method comprises: processing description information by means of a first inference model in a first graphics processing unit corresponding to a prefilled node, to obtain key-value information and a first reply character outputted by the first inference model in the first graphics processing unit; on the basis of a key-value logic block identifier corresponding to the key-value information, storing key-value information belonging to the same key-value logic block identifier into a target video memory region in the first graphics processing unit; and processing the key-value information and the first reply character in the target video memory region by means of a second inference model in a second graphics processing unit corresponding to a decoding node, to obtain target reply information corresponding to the description information. The present disclosure solves the problem of low data processing efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Data processing methods and devices based on large models, storage media and electronic devices Technical Field

[0001] This disclosure relates to the field of artificial intelligence technology, and more specifically, to a data processing method and apparatus, storage medium and electronic device based on a large model. Background Technology

[0002] For generative large-model inference services, reusing key-value caches as much as possible to reduce the required computational resources is key to improving overall throughput. For split inference frameworks, prefilling and decoding are separated so that each runs independently. A crucial process in a split architecture is transferring the reusable key-value cache generated during the prefilling phase to the selected decoder; this process needs to be as fast as possible, otherwise it will impact the response speed of the inference service.

[0003] In existing technologies, to facilitate layer-by-layer forward propagation computation of the model, the physical cache structure of the key-value (kv) cache is designed so that each layer of the model uses a contiguous region of video memory. For example, the data structure of the kv cache is: list[2, num_blocks, block_size*num_kv_heads*head_size], where each list element represents a single-layer contiguous kv cache. This design introduces the problem of inter-block discreteness. Because the send application programming interface (API) and receive API (recv API) of the centralized communication library (used for communication between multiple graphics processing units (GPUs) can only transfer one contiguous kv cache in video memory space per call, the number of API calls required for kv cache transfer is directly related to the number of discrete memory blocks, which leads to significant transfer latency. For example, a single-key-value cache might have a number of discrete memory blocks equal to the number of levels (64 in a 32-bit model) multiplied by 2 (for key (k) or value (v)), requiring 128 calls to the API atomic capabilities to transfer a single block. This discreteness negatively impacts overall computational performance.

[0004] There is currently no effective solution to the above problems. Summary of the Invention

[0005] This disclosure provides a data processing method, apparatus, storage medium, and electronic device based on a large model, to at least solve the technical problem that when pre-filling the key-value pairs (kv) output by each layer of the model in a separate inference process, the key-value pairs (kv) are stored in a continuous video memory area. When the kv needs to be transmitted to the decoding part for decoding, only one continuous video memory space of kv can be transmitted at a time, resulting in transmission delay and thus low data processing efficiency.

[0006] According to one aspect of the present disclosure, a data processing method based on a large model is provided, comprising: processing description information input to a target object through a first inference model in a first graphics processor corresponding to a pre-filled node to obtain key-value information and a first response character output by the first inference model in the first graphics processor; storing key-value information belonging to the same key-value logic block identifier in a target video memory region in the first graphics processor based on the key-value logic block identifier corresponding to the key-value information, wherein the key-value information belonging to the same key-value logic block identifier has contiguous storage addresses in the target video memory region; and processing the key-value information and the first response character in the target video memory region through a second inference model in a second graphics processor corresponding to a decoding node to obtain target response information corresponding to the description information.

[0007] Furthermore, before processing the description information through the first inference model in the first graphics processor corresponding to the pre-filled node to obtain the key-value information and the first reply character output by the first inference model in the first graphics processor, the method further includes: determining whether the weight parameters corresponding to the target inference model need to be split; if the weight parameters corresponding to the target inference model need to be split, then based on the description information, determining the first number of required first graphics processors and the second number of required second graphics processors; splitting the weight parameters of the target inference model to obtain the first number of first inference models, and determining the first graphics processor based on the first number of first inference models; splitting the weight parameters of the target inference model to obtain the second number of second inference models, and determining the second graphics processor based on the second number of second inference models.

[0008] Furthermore, before processing the key-value information and the first reply character in the target video memory region through the second inference model in the second graphics processor corresponding to the decoding node to obtain the target reply information corresponding to the description information, the method further includes: determining whether the number of the first graphics processors is the same as the number of the second graphics processors; if the number of the first graphics processors is the same as the number of the second graphics processors, then sending the key-value information in the target video memory region to the second target graphics processor through the first graphics processor, wherein the second target graphics processor is the second graphics processor corresponding to the decoding node that needs to decode the key-value information stored in the first graphics processor.

[0009] Further, sending the key-value information in the target video memory region to the second graphics processor via the first graphics processor includes: determining the second target graphics processor based on the mapping relationship between the first graphics processor and the second graphics processor; determining the identifier (ID) of the first target key-value information required by the second target graphics processor in the target video memory region; reading the first target key-value information from the target video memory region based on the ID of the first target key-value information via the first graphics processor; and sending the first target key-value information to the second target graphics processor via the first graphics processor.

[0010] Furthermore, after sending the first target key value information to the second target graphics processor through the first graphics processor, the method further includes: receiving the key value information through the second target graphics processor and determining a video memory address in the second target graphics processor for storing the first target key value information; and writing the first target key value information into the video memory address through the second target graphics processor.

[0011] Furthermore, after determining whether the number of the first graphics processors is the same as the number of the second graphics processors, the method further includes: if the number of the first graphics processors is not the same as the number of the second graphics processors, determining the ID of the second target key value information to be sent in the first graphics processor; reading the second target key value information from the target video memory area based on the ID of the second target key value information using a parallel copy operator, and writing the second target key value information in parallel into a preset transmission buffer; and sending the third target key value information in the transmission buffer to the second image processor in the decoding node using a preset transmission function.

[0012] Furthermore, after sending the third target key value information in the sending buffer to the plurality of decoding nodes through a preset sending function, the method further includes: receiving the third target key value information through a preset receiving function and writing the third target key value information into a preset receiving buffer; determining the fourth target key value information required by the second graphics processor corresponding to the decoding node; splitting the key value information in the receiving buffer through a parallelized copy operator to obtain the fourth target key value information, and writing the fourth target key value information into the video memory area of ​​the second graphics processor corresponding to the decoding node.

[0013] Furthermore, before processing the description information through the first inference model in the first graphics processor corresponding to the pre-filled node, the method further includes: for the first graphics processor, dividing the description information into key-value logic blocks according to a preset key-value block size to obtain the key-value logic block identifiers of the characters in the description information; and determining the key-value logic block identifier corresponding to the key-value information to be generated based on the key-value logic block identifiers of the characters in the description information.

[0014] According to another aspect of the embodiments of this disclosure, a data processing apparatus based on a large model is also provided, comprising: a first processing unit, configured to process description information through a first inference model in a first graphics processor corresponding to a pre-filled node, to obtain key-value information and a first reply character output by the first inference model in the first graphics processor, wherein the first inference model in the first graphics processor corresponding to the pre-filled node is determined by a target inference model; a storage unit, configured to store key-value information belonging to the same key-value logic block identifier in a target video memory area in the first graphics processor based on the key-value logic block identifier corresponding to the key-value information, wherein the key-value information belonging to the same key-value logic block identifier is stored at contiguous addresses in the target video memory area; and a second processing unit, configured to process the key-value information and the first reply character in the target video memory area through a second inference model in a second graphics processor corresponding to a decoding node, to obtain target reply information corresponding to the description information, wherein the second inference model in the second graphics processor corresponding to the decoding node is determined by the target inference model.

[0015] Furthermore, the apparatus further includes: a first determining unit, configured to determine whether the weight parameters corresponding to the target inference model need to be split before processing the description information through the first inference model in the first graphics processor corresponding to the pre-filled node to obtain the key-value information and the first reply character output by the first inference model in the first graphics processor; a first determining unit, configured to determine, based on the description information, a first number of required first graphics processors and a second number of required second graphics processors if the weight parameters corresponding to the target inference model need to be split; a first splitting unit, configured to split the weight parameters of the target inference model to obtain a first number of first inference models, and determine the first graphics processor based on the first number of first inference models; and a second splitting unit, configured to split the weight parameters of the target inference model to obtain a second number of second inference models, and determine the second graphics processor based on the second number of second inference models.

[0016] Furthermore, the device further includes: a second judging unit, configured to judge whether the number of the first graphics processors is the same as the number of the second graphics processors before processing the key-value information and the first reply character in the target video memory area through the second inference model in the second graphics processor corresponding to the decoding node to obtain the target reply information corresponding to the description information; and a first sending unit, configured to send the key-value information in the target video memory area to the second target graphics processor through the first graphics processor if the number of the first graphics processors is the same as the number of the second graphics processors, wherein the second target graphics processor is the second graphics processor corresponding to the decoding node that needs to decode the key-value information stored in the first graphics processor.

[0017] Further, the sending unit includes: a first determining module, configured to determine the second target graphics processor based on the mapping relationship between the first graphics processor and the second graphics processor using the first graphics processor; a second determining module, configured to determine the ID of the first target key value information required by the second target graphics processor in the target video memory region; a reading module, configured to read the first target key value information from the target video memory region using the first graphics processor based on the ID of the first target key value information; and a sending module, configured to send the first target key value information to the second target graphics processor using the first graphics processor.

[0018] Furthermore, the apparatus further includes: a first receiving unit, configured to receive the key value information and determine a video memory address in the second target graphics processor for storing the first target key value information after the first target graphics processor sends the first target key value information to the second target graphics processor; and a writing unit, configured to write the first target key value information to the video memory address via the second target graphics processor.

[0019] Furthermore, the device further includes: a second determining unit, configured to determine the ID of the second target key value information to be sent in the first graphics processor if the number of the first graphics processor and the number of the second graphics processor are not the same after determining whether the number of the first graphics processor and the number of the second graphics processor are the same; a reading unit, configured to read the second target key value information from the target video memory area based on the ID of the second target key value information using a parallel copy operator, and write the second target key value information in parallel into a preset sending buffer; and a second sending unit, configured to send the third target key value information in the sending buffer to the second graphics processor corresponding to the decoding node using a preset sending function.

[0020] Furthermore, the device further includes: a second receiving unit, configured to receive the third target key value information through a preset receiving function after sending the third target key value information in the sending buffer to the second graphics processor corresponding to the decoding node through a preset sending function, and write the third target key value information into a preset receiving buffer; a third determining unit, configured to determine the fourth target key value information required by the second graphics processor corresponding to the decoding node; and a third splitting unit, configured to split the key value information in the receiving buffer through a parallelized copying operator to obtain the fourth target key value information, and write the fourth target key value information into the video memory area of ​​the second graphics processor corresponding to the decoding node.

[0021] Furthermore, the device further includes: a partitioning unit, configured to partition the description information into key-value logic blocks according to a preset key-value block size for the first graphics processor before processing the description information through the first inference model in the first graphics processor corresponding to the pre-filled node, so as to obtain the key-value logic block identifier of the characters in the description information; and a fourth determining unit, configured to determine the key-value logic block identifier corresponding to the key-value information to be generated based on the key-value logic block identifier of the characters in the description information.

[0022] According to another aspect of the embodiments of this disclosure, a computer-readable storage medium is also provided, the computer-readable storage medium including a stored program, wherein, when the program is executed, it controls the device where the storage medium is located to perform the data processing method based on the large model described in any one of the above embodiments.

[0023] According to another aspect of the present disclosure, an electronic device is also provided, including a memory storing an executable program; and a processor for running the program, wherein the program executes the large-model-based data processing method described in any of the preceding embodiments.

[0024] According to another aspect of the embodiments of this disclosure, a computer program product is also provided, the computer program product including a stored computer program that, when executed by a processor, implements the large model-based data processing method described in any one of the preceding embodiments.

[0025] In this embodiment, the description information is processed by a first inference model in a first graphics processor corresponding to a pre-filled node to obtain key-value information and a first response character output by the first inference model in the first graphics processor. Based on the key-value logic block identifier corresponding to the key-value information, key-value information belonging to the same key-value logic block identifier is stored in the target video memory area of ​​the first graphics processor, wherein the key-value information belonging to the same key-value logic block identifier is stored at contiguous addresses in the target video memory area. The key-value information and the first response in the target video memory area are processed by a second inference model in a second graphics processor corresponding to a decoding node. The method of processing characters to obtain the target response information corresponding to the description information achieves the goal of reducing the number of key-value transmissions by storing the key-value information belonging to the same key-value logical block identifier in a continuous target video memory area in the first graphics processor. This improves the data processing efficiency of the inference model and solves the technical problem of storing the key-value output of each layer of the model in a continuous video memory area during pre-filling of large model split inference. When the key-value is needed to be transmitted to the decoding part for decoding, only one continuous video memory space of key-value can be transmitted at a time, resulting in transmission delay and low data processing efficiency. Attached Figure Description

[0026] The accompanying drawings, which are included to provide a further understanding of this disclosure and form part of this disclosure, illustrate exemplary embodiments of the present disclosure and are used to explain the disclosure, but do not constitute an undue limitation of the disclosure. In the drawings:

[0027] Figure 1 is a hardware structure block diagram of a computer terminal provided according to Embodiment 1 of this disclosure;

[0028] Figure 2 is a flowchart of a data processing method based on a large model according to Embodiment 1 of this disclosure;

[0029] Figure 3 is a flowchart of a data processing method based on a large model according to Embodiment 1 of this disclosure;

[0030] Figure 4 is a flowchart of the data processing method based on a large model provided according to Embodiment 1 of this disclosure;

[0031] Figure 5 is a schematic diagram of a data processing device based on a large model according to Embodiment 2 of this disclosure;

[0032] Figure 6 is a structural block diagram of an electronic device provided according to Embodiment 3 of this disclosure. Detailed Implementation

[0033] To enable those skilled in the art to better understand the present disclosure, the technical solutions of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments. Obviously, the described embodiments are only some embodiments of the present disclosure, and not all embodiments. Based on the embodiments of the present disclosure, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present disclosure.

[0034] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this disclosure are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this disclosure described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0035] First, some nouns or terms that appear in the description of the embodiments of this disclosure shall be interpreted as follows:

[0036] Key-Value Block (kv block): A cache used to store a fixed number of key and value characters (tokens).

[0037] num_blocks: The number of kv blocks.

[0038] num_layers: Number of model layers.

[0039] block_size: The size of the kv block.

[0040] num_kv_heads: The number of heads in the kv cache.

[0041] head_size: The size of the head of the key-value cache.

[0042] cache_dim: The tensor dimension of the actual key-value cache stored in a single key-value block, cache_dim = block_size * num_kv_heads * head_size.

[0043] Tensor Parallelism (TP) is a technique for distributed training and inference. In Tensor Parallelism, the model's parameters (such as the weight matrix) are divided into multiple small blocks and distributed to different GPUs, with each GPU responsible for processing a different part.

[0044] GPU rank: Typically refers to the unique identifier (ID or rank) of each GPU in a multi-GPU system.

[0045] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this disclosure are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use and processing of the relevant data must comply with the relevant laws, regulations and standards of the relevant regions, and corresponding operation portals are provided for users to choose to authorize or refuse.

[0046] According to embodiments of this disclosure, a data processing method based on a large model is also provided. It should be noted that the steps shown in the flowcharts in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions. Furthermore, although a logical order is shown in the flowcharts, in some cases, the steps shown or described may be executed in a different order than that shown here.

[0047] The method embodiment provided in this disclosure can be executed in a mobile terminal, computer terminal, or similar computing device. Figure 1 shows a hardware structure block diagram of a computer terminal (or mobile device) for implementing a large-model-based data processing method. As shown in Figure 1, the computer terminal (or mobile device) 10 may include a processor set 102 (the processor set 102 may include, but is not limited to, a processing device such as a microprocessor (MCU) or a field-programmable gate array (FPGA), and the processor set 102 may include a processor set, shown as 102a, 102b, ..., 102n in Figure 1), a memory 104 for storing data, and a transmission module 106 for communication functions. In addition, it may also include: a display, an input / output interface (I / O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of a BUS bus), a network interface, a power supply, and / or a camera. Those skilled in the art will understand that the structure shown in Figure 1 is only illustrative and does not limit the structure of the above-described electronic device. For example, computer terminal 10 may also include more or fewer components than shown in FIG1, or have a different configuration than shown in FIG1.

[0048] It should be noted that the aforementioned one or more processors 102 and / or other data processing circuitry are generally referred to herein as "data processing circuitry". This data processing circuitry may be embodied, in whole or in part, in software, hardware, firmware, or any other combination thereof. Furthermore, the data processing circuitry may be a single, independent processing module, or may be integrated, in whole or in part, into any other element within the computer terminal 10 (or mobile device). As involved in embodiments of this disclosure, the data processing circuitry serves as a processor control mechanism (e.g., selection of a variable resistor termination path connected to an interface).

[0049] The memory 104 can be used to store software programs and modules of application software, such as the program instructions / data storage device corresponding to the large-model-based data processing method in this embodiment of the present disclosure. The processor 102 executes various functional applications and data processing by running the software programs and modules stored in the memory 104, thereby realizing the aforementioned large-model-based data processing method. The memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include memory remotely located relative to the processor 102, and these remote memories can be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

[0050] The transmission device 106 is used to receive or send data via a network. Specific examples of the network described above may include a wireless network provided by the communication provider of the computer terminal 10. In one example, the transmission device 106 includes a Network Interface Controller (NIC), which can connect to other network devices via a base station to communicate with the Internet. In another example, the transmission device 106 may be a Radio Frequency (RF) module, used for wireless communication with the Internet.

[0051] The display can be a touchscreen liquid crystal display (LCD) that allows the user to interact with the user interface of the computer terminal 10 (or mobile device).

[0052] Under the aforementioned operating environment, this disclosure provides a data processing method based on a large model, as shown in Figure 2. Figure 2 is a flowchart of the data processing method based on a large model according to Embodiment 1 of this disclosure.

[0053] Step S201: The description information input by the target object is processed by the first inference model in the first graphics processor corresponding to the pre-filled node to obtain the key value information and the first reply character output by the first inference model in the first graphics processor.

[0054] Step S202: Based on the key value logical block identifier corresponding to the key value information, store the key value information belonging to the same key value logical block identifier into the target video memory area in the first graphics processor, wherein the key value information belonging to the same key value logical block identifier has a continuous storage address in the target video memory area.

[0055] Optionally, the split inference framework separates the pre-filling and decoding of the inference model, with each being processed on different GPU (Graphics Processing Unit) nodes. The specific inference process is shown in Figure 3. The front-end interface receives descriptive information from the target object (e.g., the question to be answered) and sends this information to the GPU node corresponding to pre-filling. During the pre-filling stage, the received descriptive information is converted into key-value pairs (kv) and stored in its own GPU memory. The first response character, or first token, is then output. The kv and first token are sent to the GPU node corresponding to decoding for further decoding until the final response information is obtained.

[0056] In existing technologies, to facilitate layer-by-layer forward propagation computation of the model, the physical cache structure of the kv cache is designed so that each model layer uses a contiguous memory region. For example, the kv cache data structure is: list[2, num_blocks, block_size*num_kv_heads*head_size], where each list element represents a single-layer contiguous kv cache. This design introduces the problem of inter-block discreteness. Since the send and recv APIs of the centralized communication library (used for communication between multiple GPUs) can only transfer one contiguous kv cache in memory space per call, the number of API calls required for kv cache transfer is directly related to the number of discrete memory blocks, which leads to significant transfer latency. For example, the number of discrete memory blocks in a block of kv cache may be the number of layers (64 layers in a 32-bit model) multiplied by 2 (for k or v), and transferring one block requires 128 API atomic calls.

[0057] To address the aforementioned issues, in the data processing method based on a large model provided in Embodiment 1 of this disclosure, the description information is processed by the first inference model in the first graphics processor corresponding to the pre-filled node to obtain the key-value information and the first reply character output by the first inference model in the first graphics processor corresponding to the pre-filled node.

[0058] It should be noted that the first inference model in the first graphics processor corresponding to the aforementioned pre-filled nodes is determined by the target inference model. The target inference model is a large model, which refers to a machine learning model with a large number of parameters and complex computational structure, usually constructed from deep neural networks.

[0059] After obtaining the aforementioned kv cache (i.e., the key-value information), the kv cache structure is modified. For the first graphics processor corresponding to the pre-filled node, the key-value information belonging to the same key-value logical block identifier in the kv cache output by the first graphics processor is stored in the target video memory area of ​​the first graphics processor. It should be noted that the storage addresses of key-value information belonging to the same key-value logical block identifier in the target video memory area are contiguous, that is, the physical addresses of key-value information belonging to the same key-value logical block identifier are contiguous. The kv cache structure is modified into a contiguous space tensor, i.e., [num_blocks, num_layers, 2, block_size * num_kv_heads * head_size], compared to [2, num_blocks, block_size * num_kv_heads * head_size] in the prior art. This improvement can effectively reduce the number of network API calls because the key-value cache of a block is at the same physical address. It is not necessary to read the key-value cache of a block from each layer. Therefore, the number of network API calls for each key-value block can be reduced by "layer number * 2 (for key or value)".

[0060] Step S203: The key-value information and the first reply character in the target memory area are processed by the second inference model in the second graphics processor corresponding to the decoding node to obtain the target reply information corresponding to the description information. The second inference model in the second graphics processor corresponding to the decoding node is determined by the target inference model.

[0061] Optionally, after obtaining the corresponding reusable key-value cache and the first response character (first token), the key-value information and the first response character in the target video memory area are processed by the second inference model in the second graphics processor corresponding to the decoding node, thereby obtaining the target response information corresponding to the description information.

[0062] It should be noted that the number of first graphics processors corresponding to pre-filled nodes can be one or more, and the number of second graphics processors corresponding to decoding nodes can also be one or more. When there are multiple first graphics processors, for any one first graphics processor, only its own output key-value cache is stored.

[0063] In summary, after the key-value information is output from the first inference model in the first graphics processor corresponding to the pre-filled node, the key-value information belonging to the same key-value logic block identifier in the key-value information output by the first graphics processor is stored in a contiguous video memory area. When it is necessary to transmit the key-value information in a key-value logic block to the decoding part for decoding, it is not necessary to repeatedly read the key-value information from the video memory area, which can effectively reduce the number of key-value transmissions, improve the transmission efficiency of key-value information, and thus improve the efficiency of data processing.

[0064] To improve the efficiency of parallel processing, in the data processing method based on a large model provided in Embodiment 1 of this disclosure, before processing the description information through the first inference model in the first graphics processor corresponding to the pre-filled node to obtain the key-value information and the first reply character output by the first inference model in the first graphics processor corresponding to the pre-filled node, the method further includes: determining whether the weight parameters corresponding to the target inference model need to be split; if the weight parameters corresponding to the target inference model need to be split, then determining the first number of required first graphics processors and the second number of required second graphics processors based on the description information; splitting the weight parameters of the target inference model to obtain the first number of first inference models, and determining the first graphics processor based on the first number of first inference models; splitting the weight parameters of the target inference model to obtain the second number of second inference models, and determining the second graphics processor based on the second number of second inference models.

[0065] Optionally, to improve the speed of pre-filling and decoding, multiple first graphics processors can be set in the pre-filling node, and multiple second graphics processors can be set in the decoding node. Based on actual needs, it can be determined whether the weight parameters corresponding to the target inference model need to be split, i.e., whether parallel pre-filling using multiple first graphics processors and parallel decoding using multiple second graphics processors are required.

[0066] If the weight parameters corresponding to the inference model need to be split, then the required number of first graphics processors and the required number of second graphics processors are determined according to the type of descriptive information. For example, long text summary question answering tasks are typically long input and short output, while content creation and generation tasks are short input and long output. For long input and short output tasks, the number of pre-filled nodes can be set to be larger, and the number of decoding nodes can be set to be smaller. Conversely, for short input and long output tasks, the number of pre-filled nodes can be set to be smaller, and the number of decoding nodes can be set to be larger.

[0067] After obtaining the first and second quantities mentioned above, the weight parameters of the target inference model can be split using Tensor Parallelism (TP) to obtain the first inference model with the first quantity. Then, the first graphics processor mentioned above is obtained based on the first inference model with the first quantity. Similarly, the weight parameters of the target inference model can be split using Tensor Parallelism (TP) to obtain the second inference model with the second quantity. Based on the second inference model with the second quantity, the second graphics processor is determined.

[0068] By splitting the weight parameters of the target inference model, parallel pre-filling and parallel decoding can be achieved, improving the efficiency of the target inference model in data processing.

[0069] It should be noted that if the weight parameters of the target inference model do not need to be split, then the first inference model in the first graphics processor corresponding to a pre-filled node is the target inference model, and the second inference model in the second graphics processor corresponding to a decoding node is also the target inference model. Furthermore, the number of GPUs corresponding to a pre-filled node is one, and the number of GPUs corresponding to a decoding node is also one.

[0070] To improve data processing efficiency, in the large model-based data processing method provided in Embodiment 1 of this disclosure, before processing the key-value information and the first reply character in the target video memory area through the second inference model in the second graphics processor corresponding to the decoding node to obtain the target reply information corresponding to the description information, the method further includes: determining whether the number of the first graphics processor and the number of the second graphics processor are the same; if the number of the first graphics processor and the number of the second graphics processor are the same, then the key-value information in the target video memory area is sent to the second target graphics processor through the first graphics processor, wherein the second target graphics processor is the second image processor corresponding to the decoding node that needs to decode the key-value information stored in the first graphics processor.

[0071] Optionally, when the inference service enables Tensor Processing (TP), the last layer of the original key-value cache changes from list[num_blocks,num_layers,2,cache_dim] to list[num_blocks,num_layers,2,cache_dim / TP size]. This is because TP splits the cache based on the head dimension and calculates the corresponding key-value cache on the corresponding prefill. The TP size refers to the number of tensor parallel operations, which is either the first or second number mentioned above. When transmitting the key-value cache, if the number of tensor parallel operations for prefill and decode is inconsistent, different GPUs cannot directly write data to the key-value cache of the decoding node in a single transmission due to dimension mismatch. Therefore, to improve transmission efficiency, it is advisable to first determine whether the number of the first graphics processor and the number of the second graphics processor are the same, i.e., whether the number of tensor parallel operations between prefill and decode is the same.

[0072] If the number of parallel tensors between prefill and decode is the same (i.e., the total tensor size is the same), ensuring that the key-value cache structure of all nodes in the cluster remains consistent, then the key-value information in the target memory area is directly sent to the second target graphics processor through the first graphics processor. It should be noted that the second target graphics processor is the second target graphics processor corresponding to the decoding node that needs to decode the key-value information stored in the first graphics processor. There may be one or more second target graphics processors.

[0073] By checking whether the number of the first graphics processor is the same as the number of the second graphics processor, it can be ensured that the dimensions of prefill and decode are consistent when transmitting the key-value cache, thus guaranteeing that the key-value cache is transmitted and received correctly.

[0074] It should be noted that if the inference service does not enable TP (i.e., the weight parameters of the target inference model do not need to be split), then the number of GPUs corresponding to a prefill node is one, and the number of GPUs corresponding to a decoding node is also one. At this time, the number of tensor parallelisms between prefill and decode is the same. Therefore, the key-value information in the target video memory area is directly sent to the second graphics processor through the first graphics processor.

[0075] If the number of parallel tensors between prefill and decode is the same, in the data processing method based on a large model provided in Embodiment 1 of this disclosure, sending key-value information in the target video memory region to the second target video processor through the first graphics processor includes: determining the second target video processor based on the mapping relationship between the first and second graphics processors; determining the ID of the first target key-value information in the target video memory region required by the second target video processor; reading the first target key-value information from the target video memory region based on the ID of the first target key-value information through the first graphics processor; and sending the first target key-value information to the second target video processor through the first graphics processor.

[0076] After the first target key value information is sent to the second target graphics processor through the first graphics processor, the second target graphics processor receives the key value information and determines the video memory address in the second target graphics processor for storing the first target key value information; the second target graphics processor writes the first target key value information to the video memory address.

[0077] Optionally, before data processing via pre-filled nodes and decoding nodes, if weight splitting has been performed, the mapping relationship between the first graphics processor and the second graphics processor will be determined based on the split weight parameters, i.e., the kv cache metadata list will be determined. This kv cache metadata list is divided into a send list (send_meta) and a receive list (recv_meta). The send list send_meta and the receive list recv_meta include the mapping relationship of GPU rank between sending and receiving, as well as the index information of the kv block (i.e., the ID of the kv block to be sent or the ID of the kv block to be received).

[0078] Therefore, when sending key-value information from the target video memory region to the second target video memory region via the first graphics processor, the following steps are included: Based on the mapping relationship between the first and second graphics processors, the first graphics processor determines the second target graphics processor and the ID of the first target key-value information required by the second target graphics processor within the target video memory region. That is, the sending node (i.e., the aforementioned first graphics processor) specifies the block ID index to be sent, as well as the current number of fragments and the total number of fragments (used to handle the situation where data is divided into small blocks due to TP). It should be noted that since TP may cause a block's key-value cache to be divided into small blocks, the number of fragments currently being transmitted and the total number of fragments corresponding to the block's key-value cache need to be specified during transmission.

[0079] Then, the first graphics processor executes the corresponding send API to send the kv cache corresponding to the address of the block index stored in itself to the GPU receiving the rank (i.e., the second target graphics processor mentioned above).

[0080] For the receiving node (i.e., the second target GPU mentioned above), the corresponding recv API is executed to accept the tensor from the sending rank GPU (i.e., the first GPU mentioned above) and write it directly to the address of the corresponding block index. That is, the second target GPU receives the key-value information and determines the video memory address in the second target GPU for storing the first target key-value information; the first target key-value information is written to the video memory address through the second target GPU.

[0081] The mapping relationship between pre-filled nodes and decoding nodes enables accurate transmission of key-value cache between them.

[0082] If the number of first graphics processors is different from the number of second graphics processors, in the data processing method based on a large model provided in Embodiment 1 of this disclosure, after determining whether the number of first graphics processors is the same as the number of second graphics processors, the method further includes: determining the ID of the second target key value information to be sent in the first graphics processor; reading the second target key value information from the target video memory area based on the ID of the second target key value information using a parallel copy operator, and writing the second target key value information in parallel into a preset transmission buffer; and sending the third target key value information in the transmission buffer to the second graphics processor corresponding to the decoding node using a preset transmission function.

[0083] After sending the third target key value information in the transmission buffer to the second graphics processor corresponding to the decoding node through a preset transmission function, the third target key value information is received through a preset reception function and written into a preset reception buffer; the fourth target key value information required by the second graphics processor corresponding to the decoding node is determined; the key value information in the reception buffer is split through a parallelized copy operator to obtain the fourth target key value information, and the fourth target key value information is written into the video memory area of ​​the second graphics processor corresponding to the decoding node.

[0084] Optionally, if the number of parallel prefill and decode tensors is inconsistent, different first graphics processors cannot directly write to the key-value cache of the second graphics processor in the decoding node due to dimension mismatch. This would cause the key-value cache from prefill to decoding to be constrained by discrete memory during transmission. Therefore, to avoid the above problem, when the number of parallel prefill and decode tensors is inconsistent, the ID of the second target key-value information to be sent in the first graphics processor is determined by the send list send_meta. Then, the parallelized copy operator writes the key-value cache of the scattered key-value blocks into the send buffer according to the order in send_meta. It should be noted that the size of the reserved send buffer for the key-value cache is determined based on the model size and GPU memory resources.

[0085] After parallel writing is completed, the nccl.send atomic operation (i.e. the preset send function mentioned above) is executed to send the contents of the send buffer to the GPU receiving the rank.

[0086] The receiving node (i.e., the second graphics processor mentioned above) executes the `nccl.recv` atomic operation (i.e., the preset receive function mentioned above) to receive the kv cache of the sending rank GPU device and writes the received kv cache (i.e., the third target key-value information mentioned above) into a preset receive buffer. After writing, the fourth target key-value information required by the second graphics processor corresponding to the decoding node is determined according to the kv cache metadata list (recv_meta). Then, the consecutive kv caches in the receive buffer are split into the corresponding video memory areas of the second graphics processor using the Triton operator (i.e., the parallelized copy operator mentioned above). It should be noted that if the size of the kv cache tensor to be sent exceeds the reserved send / receive buffer size, multiple write + send + receive operations are performed.

[0087] The above steps can effectively solve the problem of low-latency key-value cache transmission with consistent or inconsistent numbers of prefill / decode tensors in parallel, improve transmission efficiency, and thus improve data processing efficiency.

[0088] In the data processing method based on a large model provided in Embodiment 1 of this disclosure, before processing the description information through the first inference model in the first graphics processor corresponding to the pre-filled node, the method further includes: for the first graphics processor, dividing the description information into key-value logic blocks according to a preset key-value block size to obtain the key-value logic block identifier of the characters in the description information; and determining the key-value logic block identifier corresponding to the key-value information to be generated based on the key-value logic block identifier of the characters in the description information.

[0089] Optionally, before processing the description information through the first inference model in the first graphics processor corresponding to the pre-filled node, the description information is divided into key-value logical blocks according to the preset key-value block size (i.e., the size of the kv block). That is, the key-value logical block identifier corresponding to the token in the description information is determined, and then the key-value logical block identifier corresponding to the key-value information to be generated by the token is determined, so that the kv cache belonging to the same kv block is stored in a continuous physical address according to the key-value logical block identifier. That is, the key-value information belonging to the same key-value logical block identifier is stored in a continuous address in the target video memory area.

[0090] It should be noted that the size of a key-value block refers to the number of key-value pairs stored in a key-value block, while the key-value logical block refers to the logical address corresponding to the key-value pairs belonging to the same key-value block. In large model inference, the key-value cache corresponds to both logical and physical addresses. When stored in the target video memory area of ​​the first graphics processor, it corresponds to the physical address, while the logical address is the address used when operating on the data. Before generating key-value pairs, the logical addresses corresponding to the key-value pairs are first allocated, and then the corresponding physical addresses are determined.

[0091] It's important to note that due to the change in the key-value cache structure, the corresponding operators used for reading and calculating from the key-value cache also need to be modified. These operators utilize thread blocks and threads to process data in parallel. Each thread calculates the address of the data to be accessed in global memory using a thread index combined with an offset and a step size. The thread index is calculated based on the thread ID and data layout. Theoretically, the method for calculating this global index should change accordingly when the dimensions of the data structure change. Furthermore, in the operators, the offset and step size are used to calculate the element's position. When the dimensions or layout of the data structure change, the formulas for calculating the offset and step size for each element can be adjusted accordingly.

[0092] In an optional embodiment, the transfer of the kv cache can be implemented through the flowchart shown in Figure 4. As shown in Figure 4, the kv cache structure is adjusted, aggregating the kv cache of a kv block into a contiguous block, and the operation operators are adjusted according to the adjustment of the kv cache structure. The list of kv cache metadata to be sent and received, i.e., send_meta and recv_meta, is obtained. It is determined whether the number of parallel prefill and decode tensors is consistent. If the number of parallel prefill and decode tensors is consistent, the kv block of the prefill node can be directly transferred through in-place operations (i.e., operations are performed directly at the original storage location without establishing a kv cache send / receive buffer). If the number of parallel prefill and decode tensors is inconsistent, a kv cache send / receive buffer is established, and then the data read / write process is optimized through parallel processing. The Triton parallel copy operator writes the kv caches of each block into the send buffer in the order specified in send_meta. The receiving node receives the kv cache from the sending rank GPU device and writes it directly into the receive buffer. The Triton parallel copy operator splits the consecutive kv caches in the send and receive buffers and writes them back to the address of the corresponding block index.

[0093] In the large model-based data processing method provided in Embodiment 1 of this disclosure, the description information is processed by the first inference model in the first graphics processor corresponding to the pre-filled node to obtain the key-value information and the first reply character output by the first inference model in the first graphics processor; based on the key-value logic block identifier corresponding to the key-value information, the key-value information belonging to the same key-value logic block identifier is stored in the target video memory area in the first graphics processor, wherein the key-value information belonging to the same key-value logic block identifier is stored at a continuous address in the target video memory area; the key-value information and the first reply character in the target video memory area are processed by the second inference model in the second graphics processor corresponding to the decoding node to obtain the target reply information corresponding to the description information. This solves the problem in the related technology that when pre-filling the large model split inference, the key-value pairs output by each layer of the model are stored in a continuous video memory area, and when the key-value pairs need to be transmitted to the decoding part for decoding, only one continuous video memory space of key-value pairs can be transmitted at a time, resulting in transmission delay and thus low data processing efficiency. In this scheme, after the key-value information is output from the first inference model in the first graphics processor corresponding to the pre-filled node, for the first graphics processor, the key-value information belonging to the same key-value logic block identifier in the key-value information output by the first graphics processor is stored in a contiguous video memory area. When it is necessary to transmit the key-value information in a key-value logic block to the decoding part for decoding, it is not necessary to repeatedly read the key-value information from the video memory area, which can effectively reduce the number of key-value transmissions, improve the transmission efficiency of key-value information, and thus improve the efficiency of data processing.

[0094] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that this disclosure is not limited to the described order of actions, because according to this disclosure, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions and modules involved are not necessarily essential to this disclosure.

[0095] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods according to the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of this disclosure, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as read-only memory (ROM) / random access memory (RAM), magnetic disk, optical disk), and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, or network device, etc.) to execute the methods of the various embodiments of this disclosure.

[0096] According to an embodiment of this disclosure, a large-model-based data processing apparatus for implementing the above-described large-model-based data processing method is also provided, as shown in FIG5. The apparatus includes: a first processing unit 501, a storage unit 502, and a second processing unit 503.

[0097] The first processing unit 501 is configured to process the description information input by the target object through the first inference model in the first graphics processor corresponding to the pre-filled node, and obtain the key value information and the first reply character output by the first inference model in the first graphics processor.

[0098] Storage unit 502 is configured to store key value information belonging to the same key value logical block identifier into the target video memory area of ​​the first graphics processor based on the key value logical block identifier corresponding to the key value information. The key value information belonging to the same key value logical block identifier is stored at a continuous address in the target video memory area.

[0099] The second processing unit 503 is configured to process the key-value information and the first reply character in the target video memory area through the second inference model in the second graphics processor corresponding to the decoding node, so as to obtain the target reply information corresponding to the description information.

[0100] In the large model-based data processing apparatus provided in Embodiment 2 of this disclosure, the first processing unit 501 processes the description information through the first inference model in the first graphics processor corresponding to the pre-filled node to obtain the key-value information and the first reply character output by the first inference model in the first graphics processor; the storage unit 502 stores the key-value information belonging to the same key-value logic block identifier in the target video memory area of ​​the first graphics processor based on the key-value logic block identifier corresponding to the key-value information, wherein the key-value information belonging to the same key-value logic block identifier is stored at a continuous address in the target video memory area; the second processing unit 503 processes the key-value information and the first reply character in the target video memory area through the second inference model in the second graphics processor corresponding to the decoding node to obtain the target reply information corresponding to the description information. This solves the problem in the related technology that when pre-filling the large model separate inference, the key-value pairs output by each layer of the model are stored in a continuous video memory area, and when the key-value pairs need to be transmitted to the decoding part for decoding, only one continuous video memory space of key-value pairs can be transmitted at a time, resulting in transmission delay and thus low data processing efficiency. In this scheme, after the key-value information is output from the first inference model in the first graphics processor corresponding to the pre-filled node, for the first graphics processor, the key-value information belonging to the same key-value logic block identifier in the key-value information output by the first graphics processor is stored in a contiguous video memory area. When it is necessary to transmit the key-value information in a key-value logic block to the decoding part for decoding, it is not necessary to repeatedly read the key-value information from the video memory area, which can effectively reduce the number of key-value transmissions, improve the transmission efficiency of key-value information, and thus improve the efficiency of data processing.

[0101] Optionally, in the large model-based data processing apparatus provided in Embodiment 2 of this disclosure, the apparatus further includes: a first judgment unit, configured to determine whether the weight parameters corresponding to the target inference model need to be split before processing the description information through the first inference model in the first graphics processor corresponding to the pre-filled node to obtain the key-value information and the first reply character output by the first inference model in the first graphics processor; a first determination unit, configured to determine, based on the description information, a first number of required first graphics processors and a second number of required second graphics processors if the weight parameters corresponding to the target inference model need to be split; a first splitting unit, configured to split the weight parameters of the target inference model to obtain a first number of first inference models, and determine the first graphics processor based on the first number of first inference models; and a second splitting unit, configured to split the weight parameters of the target inference model to obtain a second number of second inference models, and determine the second graphics processor based on the second number of second inference models.

[0102] Optionally, in the large model-based data processing apparatus provided in Embodiment 2 of this disclosure, the apparatus further includes: a second judgment unit, configured to determine whether the number of the first graphics processor and the number of the second graphics processor are the same before processing the key-value information and the first reply character in the target video memory area through the second inference model in the second graphics processor corresponding to the decoding node to obtain the target reply information corresponding to the description information; and a first sending unit, configured to send the key-value information in the target video memory area to the second target graphics processor through the first graphics processor if the number of the first graphics processor and the number of the second graphics processor are the same, wherein the second target graphics processor is the second graphics processor corresponding to the decoding node that needs to decode the key-value information stored in the first graphics processor.

[0103] Optionally, in the large model-based data processing apparatus provided in Embodiment 2 of this disclosure, the sending unit includes: a first determining module, configured to determine a second target graphics processor based on the mapping relationship between the first graphics processor and the second graphics processor using a first graphics processor; a second determining module, configured to determine the ID of the first target key value information required by the second target graphics processor in the target video memory area; a reading module, configured to read the first target key value information from the target video memory area using the first graphics processor based on the ID of the first target key value information; and a sending module, configured to send the first target key value information to the second target graphics processor using the first graphics processor.

[0104] Optionally, in the large model-based data processing apparatus provided in Embodiment 2 of this disclosure, the apparatus further includes: a first receiving unit, configured to receive key value information and determine a video memory address in the second target graphics processor for storing the first target key value information after the first target key value information is sent to the second target graphics processor via the first graphics processor; and a writing unit, configured to write the first target key value information to the video memory address via the second target graphics processor.

[0105] Optionally, in the large model-based data processing apparatus provided in Embodiment 2 of this disclosure, the apparatus further includes: a second determining unit, configured to determine the ID of the second target key value information to be sent in the first graphics processor if the number of the first graphics processor and the number of the second graphics processor are not the same after determining whether the number of the first graphics processor and the number of the second graphics processor are the same; a reading unit, configured to read the second target key value information from the target video memory area based on the ID of the second target key value information using a parallel copy operator, and write the second target key value information in parallel into a preset sending buffer; and a second sending unit, configured to send the third target key value information in the sending buffer to the second graphics processor corresponding to the decoding node through a preset sending function.

[0106] Optionally, in the large model-based data processing apparatus provided in Embodiment 2 of this disclosure, the apparatus further includes: a second receiving unit, configured to receive the third target key value information through a preset receiving function and write the third target key value information into a preset receiving buffer after sending the third target key value information in the sending buffer to the second graphics processor corresponding to the decoding node through a preset sending function; a third determining unit, configured to determine the fourth target key value information required by the second graphics processor corresponding to the decoding node; and a third splitting unit, configured to split the key value information in the receiving buffer through a parallelized copying operator to obtain the fourth target key value information and write the fourth target key value information into the video memory area of ​​the second graphics processor corresponding to the decoding node.

[0107] Optionally, in the large model-based data processing apparatus provided in Embodiment 2 of this disclosure, the apparatus further includes: a partitioning unit, configured to partition the description information into key-value logic blocks according to a preset key-value block size for the first graphics processor before processing the description information through the first inference model in the first graphics processor corresponding to the pre-filled node, so as to obtain the key-value logic block identifier of the characters in the description information; and a fourth determining unit, configured to determine the key-value logic block identifier corresponding to the key-value information to be generated based on the key-value logic block identifier of the characters in the description information.

[0108] It should be noted that the first processing unit 501, storage unit 502, and second processing unit 503 mentioned above correspond to steps S201 to S203 in Embodiment 1. The three units and their corresponding steps implement the same instances and application scenarios, but are not limited to the content disclosed in Embodiment 1. It should also be noted that the above units, as part of the device, can run in the computer terminal 10 provided in Embodiment 1.

[0109] It should be noted that the preferred implementation schemes involved in the above embodiments of this disclosure are the same as the schemes, application scenarios and implementation processes provided in the foregoing embodiments, but are not limited to the schemes provided in the foregoing embodiments.

[0110] Embodiments of this disclosure can provide an electronic device, which can be any one of a group of electronic devices. Optionally, in this embodiment, the aforementioned electronic device can also be replaced with a terminal device such as a mobile terminal.

[0111] Optionally, in this embodiment, the aforementioned electronic device may be located in at least one of a plurality of network devices in a computer network.

[0112] In this embodiment, the aforementioned electronic device can execute the program code for the following steps in the data processing method based on a large model: processing the description information through the first inference model in the first graphics processor corresponding to the pre-filled node to obtain the key-value information and the first reply character output by the first inference model in the first graphics processor; storing the key-value information belonging to the same key-value logic block identifier in the target video memory area of ​​the first graphics processor based on the key-value logic block identifier corresponding to the key-value information, wherein the key-value information belonging to the same key-value logic block identifier has a continuous storage address in the target video memory area; processing the key-value information and the first reply character in the target video memory area through the second inference model in the second graphics processor corresponding to the decoding node to obtain the target reply information corresponding to the description information.

[0113] The aforementioned electronic device can execute the program code for the following steps in the data processing method based on a large model: before processing the description information through the first inference model in the first graphics processor corresponding to the pre-filled node to obtain the key-value information and the first reply character output by the first inference model in the first graphics processor corresponding to the pre-filled node, the method further includes: determining whether the weight parameters corresponding to the target inference model need to be split; if the weight parameters corresponding to the target inference model need to be split, then based on the description information, determining the first number of required first graphics processors and the second number of required second graphics processors; splitting the weight parameters of the target inference model to obtain the first number of first inference models, and determining the first graphics processor based on the first number of first inference models; splitting the weight parameters of the target inference model to obtain the second number of second inference models, and determining the second graphics processor based on the second number of second inference models.

[0114] The aforementioned electronic device can execute the following steps in the data processing method based on a large model: before processing the key-value information and the first reply character in the target video memory area through the second inference model in the second graphics processor corresponding to the decoding node to obtain the target reply information corresponding to the description information, the method further includes: determining whether the number of the first graphics processor and the number of the second graphics processor are the same; if the number of the first graphics processor and the number of the second graphics processor are the same, then sending the key-value information in the target video memory area to the second target graphics processor through the first graphics processor, wherein the second target graphics processor is the second graphics processor corresponding to the decoding node that needs to decode the key-value information stored in the first graphics processor.

[0115] The aforementioned electronic device can execute program code for the following steps in the data processing method based on a large model: sending key-value information in the target video memory area to the second target video processor via the first graphics processor includes: determining the second target video processor based on the mapping relationship between the first and second graphics processors via the first graphics processor; determining the ID of the first target key-value information in the target video memory area required by the second target video processor; reading the first target key-value information from the target video memory area via the first graphics processor based on the ID of the first target key-value information; and sending the first target key-value information to the second target video processor via the first graphics processor.

[0116] The aforementioned electronic device can execute program code for the following steps in the data processing method based on a large model: after sending the first target key value information to the second target graphics processor through the first graphics processor, the method further includes: receiving the key value information through the second target graphics processor and determining the video memory address in the second target graphics processor for storing the first target key value information; and writing the first target key value information to the video memory address through the second target graphics processor.

[0117] The aforementioned electronic device can execute the following steps in the data processing method based on a large model: after determining whether the number of first graphics processors is the same as the number of second graphics processors, the method further includes: if the number of first graphics processors is not the same as the number of second graphics processors, determining the ID of the second target key value information to be sent in the first graphics processor; reading the second target key value information from the target video memory area based on the ID of the second target key value information using a parallel copy operator, and writing the second target key value information in parallel into a preset transmission buffer; and sending the third target key value information in the transmission buffer to the second graphics processor corresponding to the decoding node using a preset transmission function.

[0118] The aforementioned electronic device can execute the following steps in the data processing method based on a large model: after sending the third target key value information in the transmission buffer to multiple decoding nodes through a preset transmission function, the method further includes: receiving the third target key value information through a preset reception function and writing the third target key value information into a preset reception buffer; determining the fourth target key value information required by the second graphics processor corresponding to the decoding node; splitting the key value information in the reception buffer through a parallelized copy operator to obtain the fourth target key value information, and writing the fourth target key value information into the video memory area of ​​the second graphics processor corresponding to the decoding node.

[0119] The aforementioned electronic device can execute the program code for the following steps in the data processing method based on a large model: before processing the description information through the first inference model in the first graphics processor corresponding to the pre-filled node, the method further includes: for the first graphics processor, dividing the description information into key-value logic blocks according to a preset key-value block size to obtain the key-value logic block identifier of the characters in the description information; and determining the key-value logic block identifier corresponding to the key-value information to be generated based on the key-value logic block identifier of the characters in the description information.

[0120] Optionally, FIG6 is a structural block diagram of an electronic device according to an embodiment of the present disclosure. As shown in FIG6, the electronic device 60 may include: one or more (only one is shown in FIG6) processors 602 and memory 604. The electronic device 60 may also include a storage controller for controlling and managing the memory 604; the electronic device 60 may also include a peripheral interface for connecting to a radio frequency module, an audio module, and a display screen, etc.

[0121] The memory can be used to store software programs and modules, such as the program instructions / modules corresponding to the large-model-based data processing method and apparatus in this embodiment. The processor executes various functional applications and data processing by running the software programs and modules stored in the memory, thereby realizing the aforementioned large-model-based data processing method. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory remotely located relative to the processor, and these remote memories can be connected to the electronic device 60 via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

[0122] The processor can invoke the information and application program stored in the memory through the transmission device to perform the following steps: process the description information through the first inference model in the first graphics processor corresponding to the pre-filled node to obtain the key value information and the first reply character output by the first inference model in the first graphics processor; based on the key value logic block identifier corresponding to the key value information, store the key value information belonging to the same key value logic block identifier in the target video memory area of ​​the first graphics processor, wherein the key value information belonging to the same key value logic block identifier is stored at contiguous addresses in the target video memory area; process the key value information and the first reply character in the target video memory area through the second inference model in the second graphics processor corresponding to the decoding node to obtain the target reply information corresponding to the description information.

[0123] Optionally, the processor may also execute program code for the following steps: before processing the description information through the first inference model in the first graphics processor corresponding to the pre-filled node to obtain the key-value information and the first reply character output by the first inference model in the first graphics processor corresponding to the pre-filled node, the method further includes: determining whether the weight parameters corresponding to the target inference model need to be split; if the weight parameters corresponding to the target inference model need to be split, then based on the description information, determining the first number of required first graphics processors and the second number of required second graphics processors; splitting the weight parameters of the target inference model to obtain the first number of first inference models, and determining the first graphics processor based on the first number of first inference models; splitting the weight parameters of the target inference model to obtain the second number of second inference models, and determining the second graphics processor based on the second number of second inference models.

[0124] Optionally, the processor may also execute program code for the following steps: before processing the key-value information and the first reply character in the target video memory area through the second inference model in the second graphics processor corresponding to the decoding node to obtain the target reply information corresponding to the description information, the method further includes: determining whether the number of the first graphics processors is the same as the number of the second graphics processors; if the number of the first graphics processors is the same as the number of the second graphics processors, then sending the key-value information in the target video memory area to the second target graphics processor through the first graphics processor, wherein the second target graphics processor is the second graphics processor corresponding to the decoding node that needs to decode the key-value information stored in the first graphics processor.

[0125] Optionally, the processor may also execute program code for the following steps: sending key-value information in the target video memory region to the second target video processor via the first graphics processor includes: determining the second target video processor based on the mapping relationship between the first and second graphics processors; determining the ID of the first target key-value information in the target video memory region required by the second target video processor; reading the first target key-value information from the target video memory region based on the ID of the first target key-value information via the first graphics processor; and sending the first target key-value information to the second target video processor via the first graphics processor.

[0126] Optionally, the processor may also execute program code for the following steps: after sending the first target key value information to the second target graphics processor through the first graphics processor, the method further includes: receiving the key value information through the second target graphics processor and determining the video memory address in the second target graphics processor for storing the first target key value information; and writing the first target key value information to the video memory address through the second target graphics processor.

[0127] Optionally, the processor may also execute program code with the following steps: after determining whether the number of first graphics processors is the same as the number of second graphics processors, the method further includes: if the number of first graphics processors is not the same as the number of second graphics processors, determining the ID of the second target key value information to be sent in the first graphics processor; reading the second target key value information from the target video memory area based on the ID of the second target key value information using a parallel copy operator, and writing the second target key value information in parallel into a preset transmission buffer; and sending the third target key value information in the transmission buffer to the second image processor corresponding to the decoding node using a preset transmission function.

[0128] Optionally, the processor may also execute program code with the following steps: after sending the third target key value information in the transmission buffer to multiple decoding nodes through a preset transmission function, the method further includes: receiving the third target key value information through a preset reception function and writing the third target key value information into a preset reception buffer; determining the fourth target key value information required by the second graphics processor corresponding to the decoding node; splitting the key value information in the reception buffer through a parallelized copy operator to obtain the fourth target key value information, and writing the fourth target key value information into the video memory area of ​​the second graphics processor corresponding to the decoding node.

[0129] Optionally, the processor may also execute program code for the following steps: before processing the description information through the first inference model in the first graphics processor corresponding to the pre-filled node, the method further includes: for the first graphics processor, dividing the description information into key-value logic blocks according to a preset key-value block size to obtain the key-value logic block identifier of the characters in the description information; and determining the key-value logic block identifier corresponding to the key-value information to be generated based on the key-value logic block identifier of the characters in the description information.

[0130] It will be understood by those skilled in the art that the structure shown in FIG6 is merely illustrative, and the electronic device 60 may also be a smartphone, tablet computer, PDA, mobile internet device (MID), PAD, or other terminal device. FIG6 does not limit the structure of the aforementioned electronic device. For example, the electronic device 60 may also include more or fewer components (such as network interfaces, display devices, etc.) than shown in FIG6, or have a different configuration than shown in FIG6.

[0131] Those skilled in the art will understand that all or part of the steps in the various methods of the above embodiments can be implemented by a program instructing the hardware related to the terminal device. The program can be stored in a computer-readable storage medium, which may include: flash drive, read-only memory (ROM), random access memory (RAM), disk or optical disk, etc.

[0132] Embodiments of this disclosure also provide a computer-readable storage medium. Optionally, in this embodiment, the computer program product described above can be used to store the program code executed by the data processing method based on a large model provided in Embodiment 1.

[0133] Embodiments of this disclosure also provide a computer program product. Optionally, in this embodiment, the computer program product can be used to store the program code executed by the large-model-based data processing method provided in Embodiment 1.

[0134] Optionally, in this embodiment, the computer program product may be located in any computer terminal in a group of computer terminals in a computer network, or in any mobile terminal in a group of mobile terminals.

[0135] The sequence numbers of the embodiments disclosed above are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.

[0136] In the above embodiments of this disclosure, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0137] In the several embodiments provided in this disclosure, it should be understood that the disclosed technical content can be implemented in other ways. The device embodiments described above are merely illustrative; for example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the displayed or discussed mutual couplings, direct couplings, or communication connections may be through some interfaces; indirect couplings or communication connections between units or modules may be electrical or other forms.

[0138] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0139] Furthermore, the functional units in the various embodiments of this disclosure can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0140] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this disclosure, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this disclosure. The aforementioned storage medium includes various media capable of storing program code, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard drive, magnetic disk, or optical disk.

[0141] The above description is only a preferred embodiment of this disclosure. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principles of this disclosure, and these improvements and modifications should also be considered within the scope of protection of this disclosure.

Claims

1. A data processing method based on a large model, comprising: The description information of the target object input is processed by the first inference model in the first graphics processor corresponding to the pre-filled node to obtain the key value information and the first reply character output by the first inference model in the first graphics processor. Based on the key value logic block identifier corresponding to the key value information, the key value information belonging to the same key value logic block identifier is stored in the target video memory area of ​​the first graphics processor, wherein the key value information belonging to the same key value logic block identifier is stored at consecutive addresses in the target video memory area; The key-value information and the first reply character in the target memory area are processed by the second inference model in the second graphics processor corresponding to the decoding node to obtain the target reply information corresponding to the description information.

2. The method according to claim 1, wherein, Before processing the description information through the first inference model in the first graphics processor corresponding to the pre-filled node to obtain the key-value information and the first response character output by the first inference model in the first graphics processor, the method further includes: Determine whether the weight parameters corresponding to the target inference model need to be split; If the weight parameters corresponding to the target inference model need to be split, the weight parameters of the target inference model are split to obtain a first number of first inference models, and the first graphics processor is determined based on the first number of first inference models. The weight parameters of the target inference model are split to obtain a second number of second inference models, and the second graphics processor is determined based on the second number of second inference models.

3. The method according to claim 2, wherein, Before splitting the weight parameters of the target inference model, the method further includes: Based on the description information, the first number of required first graphics processors and the second number of required second graphics processors are determined.

4. The method according to claim 1, wherein, Before processing the key-value information and the first response character in the target memory region through the second inference model in the second graphics processor corresponding to the decoding node to obtain the target response information corresponding to the description information, the method further includes: Determine whether the number of the first graphics processor is the same as the number of the second graphics processor; If the number of the first graphics processors is the same as the number of the second graphics processors, then the key-value information in the target video memory area is sent to the second target graphics processor through the first graphics processor, wherein the second target graphics processor is the second graphics processor corresponding to the decoding node that needs to decode the key-value information stored in the first graphics processor.

5. The method according to claim 4, wherein, Sending key-value information within the target video memory region to the second target graphics processor via the first graphics processor includes: The first graphics processor reads the first target key value information from the target memory region based on the ID of the first target key value information required by the second target graphics processor in the target memory region. The first target key value information is sent to the second target graphics processor through the first graphics processor.

6. The method according to claim 5, wherein, The method further includes: Based on the mapping relationship between the first graphics processor and the second graphics processor, the first graphics processor determines the second target graphics processor, and determines the ID of the first target key value information required by the second target graphics processor in the target video memory region.

7. The method according to claim 6, wherein, After sending the first target key value information to the second target graphics processor via the first graphics processor, the method further includes: The key value information is received by the second target graphics processor, and the video memory address in the second target graphics processor for storing the first target key value information is determined. The first target key value information is written to the video memory address by the second target graphics processor.

8. The method according to claim 5, wherein, After determining whether the number of the first graphics processors is the same as the number of the second graphics processors, the method further includes: If the number of the first graphics processors is different from the number of the second graphics processors, then determine the ID of the second target key value information to be sent in the first graphics processor; The parallelized copy operator reads the second target key value information from the target memory area based on the ID of the second target key value information, and writes the second target key value information into a preset transmission buffer in parallel. The third target key value information in the sending buffer is sent to the second graphics processor corresponding to the decoding node through a preset sending function.

9. The method according to claim 8, wherein, After sending the third target key-value information in the sending buffer to the second graphics processor corresponding to the decoding node via a preset sending function, the method further includes: The third target key value information is received through a preset receiving function, and the third target key value information is written into a preset receiving buffer. Determine the fourth target key value information required by the second graphics processor corresponding to the decoding node; The key-value information in the receiving buffer is split by a parallelized copy operator to obtain the fourth target key-value information, and the fourth target key-value information is written into the video memory area of ​​the second graphics processor corresponding to the decoding node.

10. The method according to claim 1, wherein, Before processing the description information through the first inference model in the first graphics processor corresponding to the pre-filled node, the method further includes: For the first graphics processor, the description information is divided into key-value logic blocks according to a preset key-value block size to obtain the key-value logic block identifier of the characters in the description information; Based on the key-value logic block identifier of the characters in the description information, determine the key-value logic block identifier corresponding to the key-value information to be generated.

11. The method according to claim 2, wherein, For long input, short output tasks, the first number of the first graphics processor is greater than the second number of the second graphics processor; for short input, long output tasks, the first number of the first graphics processor is less than the second number of the second graphics processor.

12. The method according to claim 2, wherein, If the weight parameters corresponding to the target inference model do not need to be split, then the target inference model is the first inference model in the first graphics processor corresponding to the pre-filled node, and the second inference model in the second graphics processor corresponding to the decoded node.

13. The method according to claim 12, wherein, When the weight parameters corresponding to the target inference model do not need to be split, one pre-filled node corresponds to one first graphics processor, and one decoding node corresponds to one second graphics processor.

14. The method according to claim 12, wherein, When the weight parameters corresponding to the target inference model do not need to be split, the number of tensor parallelisms of the pre-filled nodes is the same as the number of tensor parallelisms of the decoding nodes.

15. The method according to claim 4, wherein, Before sending the key-value information in the target video memory region to the second target graphics processor via the first graphics processor, the method further includes: The weight parameters corresponding to the target inference model need to be split, and the mapping relationship between the first graphics processor and the second graphics processor is determined based on the splitting of the weight parameters.

16. The method according to claim 8, wherein, The preset size of the sending buffer is determined based on the size of the target inference model, the memory resources of the first graphics processor, and the memory resources of the second graphics processor.

17. A data processing apparatus based on a large model, comprising: The first processing unit is used to process the description information input by the target object through the first inference model in the first graphics processor corresponding to the pre-filled node, and obtain the key value information and the first reply character output by the first inference model in the first graphics processor. A storage unit is configured to store key-value information belonging to the same key-value logical block identifier into a target video memory area in the first graphics processor, based on the key-value logical block identifier corresponding to the key-value information, wherein the key-value information belonging to the same key-value logical block identifier is stored at consecutive addresses in the target video memory area. The second processing unit is used to process the key-value information and the first reply character in the target memory area through the second inference model in the second graphics processor corresponding to the decoding node, so as to obtain the target reply information corresponding to the description information.

18. A computer-readable storage medium comprising a stored program, wherein, When the program is running, it controls the device containing the storage medium to execute the data processing method based on a large model as described in any one of claims 1 to 16.

19. An electronic device comprising: Memory, which stores executable programs; A processor for running the program, wherein the program, when running, performs the data processing method based on a large model as described in any one of claims 1 to 16.

20. A computer program product comprising a stored computer program that, when executed by a processor, implements the large-model-based data processing method of any one of claims 1 to 16.