A data processing method, device and electronic equipment
By constructing a multi-specification static computation graph library and dynamically selecting the appropriate static computation graph, the problems of resource waste and low memory efficiency of static computation graphs in terminal devices are solved, thereby improving the utilization of computing resources and processing performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- LENOVO (BEIJING) LTD
- Filing Date
- 2026-03-18
- Publication Date
- 2026-06-19
AI Technical Summary
Existing static computation graphs in terminal devices suffer from limitations in computational resource utilization and processing performance due to fixed sequence length constraints, resulting in wasted computational resources and low memory efficiency.
Multiple static computation graph libraries with different context capacities are constructed. The appropriate static computation graph is dynamically selected for processing based on the actual sequence length. The model computation parameters are shared, and seamless switching is achieved by using a shared key-value cache, avoiding duplicate computation and storage redundancy.
It improves the utilization rate and processing performance of computing resources, reduces first-word latency and overall response time, and achieves dynamic matching of computing resources and processing tasks.
Smart Images

Figure CN122242746A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, and more specifically to a data processing method, apparatus, and electronic device. Background Technology
[0002] With the application of artificial intelligence technology in terminal devices, these devices are often equipped with dedicated hardware accelerators such as Neural Processing Units (NPUs) to achieve high-efficiency artificial intelligence (AI) computing. However, in order to achieve high execution efficiency and low power consumption, the computing architecture of such hardware is usually executed in the form of a static computation graph.
[0003] Under the constraints of static computation graphs, the maximum sequence length that can be processed is usually set to a specific value, and the corresponding static computation graph is compiled and generated once based on this maximum length for execution. This results in limited utilization of computing resources and processing performance in relevant application scenarios. Summary of the Invention
[0004] In view of the above, this application provides the following technical solution:
[0005] A data processing method, comprising:
[0006] In response to the target model performing inference tasks, obtain the length information of the current sequence to be processed;
[0007] Based on the length information, a target static computation graph is determined in a static computation graph library, wherein the static computation graph library includes multiple static computation graphs corresponding to different context capacities, and the static computation graphs characterize the computation structure and data flow of the target model under the corresponding context capacity; each of the static computation graphs shares the model computation parameters of the target model;
[0008] The target static computation graph is invoked to process the sequence to be processed, and the model inference result corresponding to the sequence to be processed is obtained.
[0009] A data processing apparatus, comprising:
[0010] The acquisition unit is used to acquire the length information of the current sequence to be processed in response to the target model performing inference tasks.
[0011] A determining unit is configured to determine a target static computation graph in a static computation graph library based on the length information, wherein the static computation graph library includes multiple static computation graphs corresponding to different context capacities, and the static computation graphs characterize the computation structure and data flow of the target model under the corresponding context capacity; each of the static computation graphs shares the model computation parameters of the target model;
[0012] The calling unit is used to call the target static computation graph to process the sequence to be processed and obtain the model inference result corresponding to the sequence to be processed.
[0013] An electronic device, comprising:
[0014] A memory for storing computer programs and the data generated by the execution of said computer programs;
[0015] A processor for executing the computer program to achieve:
[0016] In response to the target model performing inference tasks, obtain the length information of the current sequence to be processed;
[0017] Based on the length information, a target static computation graph is determined in a static computation graph library, wherein the static computation graph library includes multiple static computation graphs corresponding to different context capacities, and the static computation graphs characterize the computation structure and data flow of the target model under the corresponding context capacity; each of the static computation graphs shares the model computation parameters of the target model;
[0018] The target static computation graph is invoked to process the sequence to be processed, and the model inference result corresponding to the sequence to be processed is obtained. Attached Figure Description
[0019] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only embodiments of this application. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.
[0020] Figure 1 A flowchart illustrating a data processing method provided in an embodiment of this application;
[0021] Figure 2 A flowchart illustrating a method for constructing a static computational graph library, provided in an embodiment of this application;
[0022] Figure 3 A data processing flowchart for a target model application scenario provided in this application embodiment;
[0023] Figure 4 This is a schematic diagram of the structure of a data processing device provided in an embodiment of this application. Detailed Implementation
[0024] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0025] The terms "first" and "second," etc., used in this application are used to distinguish different objects, not to describe a specific order. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units may include steps or units not listed, but may include steps or units not listed.
[0026] This application provides a data processing method, apparatus, and electronic device. The data processing method can be applied to the inference data processing of large language models on edge devices, such as edge devices equipped with hardware accelerators like Neural Processing Units (NPUs), including smartphones, tablets, smart home devices, and edge computing nodes. This method constructs a graph library containing multiple static computation graphs of different specifications and dynamically selects the most suitable static computation graph for execution during inference based on the length of the actual processing sequence. This maintains the advantages of static graph execution while achieving dynamic matching of computing resources with actual needs.
[0027] See Figure 1 The diagram illustrates a data processing method provided in an embodiment of this application, which may include the following steps:
[0028] S101. In response to the target model performing inference tasks, obtain the length information of the current sequence to be processed.
[0029] The target model refers to a neural network model that requires sequence processing, including language models based on the Transformer architecture (such as generative pre-trained transform models), multimodal models (such as vision-language models), visual models (such as image classification and object detection models), speech models (such as speech recognition and speech synthesis models), and other deep learning models that require sequence processing. The inference task refers to the process by which the target model generates a corresponding output sequence based on the input sequence. Examples include language models generating response text based on text prompts, visual models generating classification results or detection boxes based on image sequences, and multimodal models generating cross-modal understanding results based on text and image input.
[0030] The length information of the current sequence to be processed refers to the actual length of the sequence that the target model needs to process. This length is usually measured in units of tokens. A token is the basic unit of text that the target model processes; it can be a word or a character. It should be noted that the specific meaning of this length information may differ at different stages of model inference. For example, in the pre-filling stage, the sequence to be processed is the input prompt sequence, and its length is the number of tokens in the input prompt sequence. In the decoding and generation stage, the sequence to be processed is the concatenated sequence of the generated sequence and the input prompt sequence, and its length is the sum of the number of tokens in both sequences.
[0031] For example, a user opens a dialogue assistant application on their smartphone based on a target model (such as a language model) and inputs a 50-word sequence, such as "Introduce the main tourist attractions of city A...". In response to the start of this dialogue task, the electronic device first obtains the length of the current sequence to be processed, i.e., the length of the input sequence, 50. As the model begins to generate responses word by word, the length of the current sequence to be processed is updated with each word generated. For example, when the model has generated 10 words of response content, the length of the current sequence to be processed is updated to 60 (the 50-word input sequence plus the 10 generated words); when 20 words are generated, the length is updated to 70, and so on.
[0032] In this step, by obtaining the actual length information of the current sequence to be processed, basic data is provided for the subsequent dynamic selection of the appropriate static computation graph based on the sequence length. This enables the entire scheduling process to make decisions based on the real-time load of the target model, solving the problem of inaccurate preset assumptions or fixed processing methods regarding the sequence length.
[0033] S102. Based on the length information, determine the target static calculation graph in the static calculation graph library.
[0034] A static computation graph is a pre-built collection of multiple static computation graphs. A static computation graph refers to a computation graph whose computational structure, tensor shape, and data flow are completely determined during the compilation phase and cannot be changed during execution. Since dedicated hardware accelerators such as neural network processing units typically require models to be executed in the form of static computation graphs to achieve efficient execution and low power consumption, this application embodiment uses static computation graphs as the hardware execution carrier for the model.
[0035] The static computation graph library includes multiple static computation graphs corresponding to different context capacities. Context capacity refers to the maximum sequence length that a static computation graph can handle, such as 128, 256, 512, 1024, etc. Each static computation graph represents the computational structure and data flow of the target model under its corresponding context capacity. That is, the graph defines how the model's computational nodes (such as matrix multiplication MatMul, attention mechanisms, etc.) are organized and how data is transferred between nodes when processing sequences not exceeding its capacity limit. It should be noted that although the structures of the various static computation graphs differ due to different processing capacities—for example, the tensor dimension of the key-value cache (KV cache) and the size of the input tensor will adjust with changes in capacity—they share the same set of model computation parameters for the target model. That is, core parameters such as model weights and biases are shared. This ensures that the model's knowledge and capabilities remain consistent regardless of which graph is used for inference, and does not incur significant storage overhead from constructing multiple static graphs.
[0036] After obtaining the length information of the current sequence to be processed, the electronic device determines the target static computation graph from the static computation graph library based on this length information. Specifically, it can select a static computation graph from the static computation graph library that can accommodate the current sequence length as the target static computation graph. For example, it can select a static computation graph with a context capacity no less than the current length and the smallest capacity to achieve a precise match between computing resources and actual needs.
[0037] For example, continuing the dialogue assistant example above. In the pre-filling stage, the current sequence length to be processed is 50. The static computation graph library has four pre-set static computation graphs with context capacities of 128, 256, 512, and 1024. The electronic device determines the target static computation graph from the library based on the length 50. Since 50 is less than 128, the static computation graph with a capacity of 128 is selected as the target static computation graph. As the model enters the decoding and generation stage, when the generated content reaches 79 tokens, the current total length becomes 129 (50 + 79). At this point, the length of 129 exceeds the processing capacity of the currently used static graph with a capacity of 128. The electronic device reselects based on the length of 129. Since 129 is less than 256 and greater than 128, it switches to the static computation graph with a capacity of 256 as the new target static computation graph.
[0038] In this step, a suitable static computation graph is dynamically selected from a multi-specification static graph library based on real-time length information, achieving dynamic matching between computational resources and the scale of the processing task. Compared to the approach of using a single maximum static graph, this application can solve the problems of wasted computational resources and invalid padding computation caused by executing large-scale static graphs when processing short sequences, and can also solve the problem of low storage efficiency caused by reserving fixed memory for the maximum context length.
[0039] S103. Call the target static computation graph to process the sequence to be processed and obtain the model inference results corresponding to the sequence to be processed.
[0040] After the target static computation graph is determined, the electronic device calls this graph to process the sequence to be processed and obtain the corresponding model inference result. This process corresponds to the complete flow of target model inference, including a pre-filling stage and a decoding generation stage. In the pre-filling stage, the target static computation graph performs parallel computation on the sequence to be processed (such as an input prompt sequence) to generate an initial key-value cache. The key-value cache is an intermediate computation result in the target model inference process, used to store the key and value vectors of the generated lexical units so that they can be reused in the subsequent decoding generation stage to avoid redundant computation. In the decoding generation stage, the target static computation graph decodes word by word based on the generated key-value cache to generate new lexical units until the termination condition is met (such as generating an end symbol, reaching the maximum generation length, etc.), and finally outputs the complete word sequence as the model inference result. Since the target static computation graph is dynamically selected according to the current sequence length, its computational structure and memory layout match the scale of the current processing task, thus enabling computation to be performed in the most efficient way. At the same time, since all static computation graphs share the same set of model parameters, switching graphs does not require reloading the model weights; only the graph definition needs to be switched, resulting in extremely low switching overhead.
[0041] For example, in the dialogue assistant example above, a static computation graph with a capacity of 128 was selected during the pre-filling stage. The electronic device calls this graph to process an input prompt sequence of length 50. This graph uses its internal nodes for matrix multiplication, attention calculation, and other nodes to perform parallel computations on the input, generating an initial key-value cache, which is then written to a shared key-value cache area. Subsequently, in the decoding and generation stage, when 79 tokens have been generated and the total length reaches 129, the electronic device switches to a static computation graph with a capacity of 256. The newly loaded graph directly reads the key-value data of the existing 129 tokens from the shared key-value cache area and continues to decode and generate the 130th and subsequent tokens based on this data, ultimately outputting the complete answer text as the model's inference result. Throughout this process, the user experiences a smooth dialogue, while the system dynamically switches between computation graphs twice according to the processing progress, achieving efficient utilization of computing resources.
[0042] This application provides a data processing method that obtains the length information of the current sequence to be processed in response to the inference task performed by the target model. Based on this length information, a target static computation graph is determined in a static computation graph library, and the target static computation graph is invoked to process the sequence to be processed to obtain the model inference result. The static computation graph library includes multiple static computation graphs corresponding to different context capacities. These static graphs represent the computational structure and data flow of the target model at their respective capacities and share the same set of model computation parameters. This method dynamically selects the appropriate static computation graph for inference based on the actual processing sequence length. While meeting the static graph execution requirements of the hardware accelerator, it solves the problems of wasted computational resources and low memory efficiency caused by using a single maximum static graph, improving the inference performance and resource utilization of the target model on the edge device, and reducing first-word latency and overall response time.
[0043] The following section discusses relevant application scenarios. Figure 1 The technical features and possible implementation methods in the illustrated embodiments are explained. It should be noted that the specific implementation methods described in the related embodiments are not intended to limit the scope of protection of this application, but are exemplary descriptions of one or more preferred implementation paths based on the current application scenario. Furthermore, the implementation methods of some steps or features can be adaptively adjusted or replaced according to actual application needs or differences in device performance.
[0044] See Figure 2 The diagram illustrates a flowchart of a method for constructing a static computational graph library according to an embodiment of this application. The construction process of the static computational graph library mainly includes the following steps:
[0045] S201. Based on the statistical distribution of context length in the target application scenario, determine at least two context capacity ranges.
[0046] Before constructing a static computation graph library, it is necessary to determine the context capacity range corresponding to the static computation graph. The context capacity range refers to the interval of sequence lengths that the static computation graph can handle, such as [1-128], [129-256], [257-512], etc. Each context capacity range has a corresponding upper limit value, which is the actual value used when compiling the static computation graph, determining the tensor shape and memory layout size in the graph. In this embodiment, the context capacity range can be determined based on the statistical distribution of context lengths in the target application scenario. The target application scenario refers to the actual business scenario that can be applied, such as intelligent dialogue assistants, text summarization generation, code completion, machine translation, etc. Different application scenarios have different context length distribution characteristics. Some scenarios are mainly based on short text interaction, and user input is usually short; some scenarios need to process long documents or long dialogues, and the context length may be large. By statistically analyzing the historical context length data in the target application scenario, the pattern of length distribution can be obtained, such as normal distribution, long-tail distribution, etc., and the context capacity range that can cover most usage scenarios can be determined accordingly.
[0047] In this embodiment, the electronic device or development platform can first collect context length sample data in the target application scenario, then perform statistical analysis on these data, plot the length distribution curve, or calculate the probability density of each length interval. Based on the analysis results, at least two context capacity ranges are determined, such that each range can cover a certain proportion of samples, and the union of all ranges can cover most actual usage scenarios. A bucket strategy can typically be used to divide continuous length intervals into several discrete ranges, such as dividing the lengths into [1-128], [129-256], [257-512], [513-1024], [1025-2048], etc., according to business requirements. The size of each range can be the same, or it can be differentiated according to the distribution density, such as setting a smaller range in densely sampled areas to improve resource utilization efficiency, and setting a larger range in sparsely sampled areas to reduce the number of graphs.
[0048] In this step, the contextual capacity range is determined based on statistical distribution, ensuring that the generation of the static computational graph library is data-supported and business-specific, maximizing resource utilization efficiency without affecting user experience. Different application scenarios can flexibly adjust the capacity range division according to their own characteristics, improving the adaptability and configurability of this application.
[0049] S202. Based on each context capacity range, compile and generate the corresponding static computation graph.
[0050] Each static computation graph shares the model computation parameters of the target model, and each static computation graph is used to compute data sequences that fall within the corresponding context flux range.
[0051] After determining the context capacity range, a corresponding static computation graph is compiled and generated for each range. In one implementation of this application, the length-related parameter configurations in the static computation graph, such as tensor shape and memory layout, can be determined based on the capacity upper limit of each context capacity range. Then, based on these parameter configurations and the model computation parameters of the target model, the static computation graph is generated by the compiler. In this approach, the static computation graphs corresponding to different context capacity ranges typically have different tensor shapes and memory layouts, but share the same set of model computation parameters. In another implementation, a template-based compilation technique can be used. A basic computation graph template is first generated, and then the variable dimensions in the template are parametrically expanded according to different capacity upper limits to generate multiple specific instances. This approach can reduce the compilation workload and improve the efficiency of graph library construction.
[0052] It should be noted that regardless of the implementation method used to compile and generate static computation graphs, a static computation graph with a fixed computational structure that can be efficiently executed on the target hardware is generated for each context capacity range, and these graphs can share the model computation parameters of the target model, avoiding duplicate storage.
[0053] In one specific implementation of this application, the process of compiling and generating a corresponding static computation graph based on each context capacity range may include the following steps: determining the input-output tensor shapes and memory layouts of computation nodes in the static computation graph according to the upper limit of the context capacity range; and generating a static computation graph with the target computation structure based on the input-output tensor shapes, memory layouts, and model computation parameters of the target model. The static computation graphs corresponding to different context capacity ranges have different input-output tensor shapes and memory layouts.
[0054] Computational nodes refer to the basic operational units that make up a static computation graph, such as matrix multiplication nodes, attention mechanism nodes, layer normalization nodes, and activation function nodes. The input and output tensor shapes refer to the dimensions of the input tensor received by each computational node and the dimensions of the output tensor. Since the upper limit of capacity varies for different context capacities, the tensor shapes related to sequence length in the static computation graph will also change accordingly. For example, in attention computational nodes, the first dimension of the key-value buffer tensor is usually directly related to the sequence length: for a static computation graph with a capacity upper limit of 128, its key-value buffer tensor might be defined as [128, d...]. k The shape of ] (where d) k(where d is the dimension of the key vector); for a static computation graph with a capacity limit of 256, the key-value cache tensor at the same location is defined as [256, d...]. k The shape of a tensor. Memory layout refers to how data is stored in physical memory, including data arrangement order, alignment, address offset, etc. The compiler optimizes the memory layout of tensors of different sizes according to the memory access characteristics of the target hardware (such as NPU) to ensure that data is stored contiguously in memory and aligned to appropriate boundaries, thereby improving memory access efficiency.
[0055] Based on the defined input / output tensor shapes, memory layout, and model computation parameters of the target model, a static computation graph with the target computation structure is generated. Model computation parameters refer to the core weight parameters of the target model, including weight matrices and bias vectors for each layer; these parameters are shared across static computation graphs of different capacities. During generation, the compiler first loads the model computation parameters of the target model, and then, based on the defined tensor shape and memory layout, instantiates the model's computation logic into a specific sequence of computation instructions. The compiler also optimizes the computation graph, for example, by merging multiple consecutive small operators into a large computation kernel to reduce data transfer overhead, optimizing the storage location of intermediate results through memory reuse to reduce memory usage, and optimizing the computation order through pipeline scheduling to improve hardware utilization. The final generated static computation graph has a fixed computation structure; that is, the number of computation nodes, the connections between nodes, the data flow, and the memory access patterns are all predetermined and cannot be changed at runtime.
[0056] For example, continuing from the previous dialogue assistant example, given the four defined context capacity ranges [1-128], [129-256], [257-512], and [513-1024], the development team needs to compile and generate the corresponding static computation graphs. Taking G_128 as an example, the compiler first determines the shape of all length-related tensors in the graph based on the capacity limit of 128: in the attention calculation module, the key-value cache tensor shape is set to [128, 64] (assuming the key vector dimension is 64); in the matrix multiplication module, the input tensor dimension related to the sequence length is also set to 128. Simultaneously, the compiler optimizes the tensor memory layout based on the memory access characteristics of the target NPU, for example, requiring the key-value cache tensor to be stored in contiguous memory in row-major order and ensuring that the starting address is aligned to 64 bytes. Then, the compiler loads the shared weight parameters of the target model, which may be provided in binary format, containing billions of floating-point numbers. The compiler combines computational logic with tensor shapes, memory layout, and weight parameters, performing optimizations such as operator fusion and memory reuse, ultimately generating a static computation graph file G_128.npu that can be directly executed on the NPU. Similarly, the key-value cache tensor shape of G_256 is set to [256, 64], G_512 to [512, 64], and G_1024 to [1024, 64], but its weight parameters are exactly the same as G_128, all coming from the same model parameter file. Thus, the total storage space of the four static computation graph files is approximately equal to the size of one model parameter file plus four graph structure metadata files, and less than the size of four complete models.
[0057] Through the compilation process described above, each static computation graph is determined as a computational executor optimized for a specific capacity range, capable of efficiently processing sequences at runtime with its designed capacity as the upper limit. When the actual sequence length falls within the graph's capacity range, all computation nodes in the graph execute at a precisely fitted size, avoiding invalid fill computations. When the sequence length exceeds the current graph's capacity and a switch is required, because all graphs share weight parameters and key-value caches are managed uniformly, the switch process only requires loading the new graph structure without migrating weight data, achieving a low-overhead, seamless transition.
[0058] S203. Generate a static calculation graph library based on each static calculation graph.
[0059] After compiling the static computation graphs corresponding to all context capacity ranges, these static computation graphs are integrated together to form a static computation graph library. The static computation graph library can be generated in the form of file system directories, database records, memory index tables, etc., in which each static computation graph establishes a mapping relationship with its corresponding context capacity range, so that it can be quickly retrieved and loaded at runtime based on length information.
[0060] The generation of a static computation graph also includes configuring necessary metadata for each static computation graph, such as graph identifier, context capacity range, capacity limit, graph file storage path, and memory address offset parameters. This metadata is used during runtime scheduling to quickly locate the target graph and configure its execution environment. Furthermore, the static computation graph can also contain an index table or lookup function that, when given length information, returns either a suitable static computation graph identifier or a graph object directly.
[0061] For example, in the example above, the development team stores the four static computation graph files G_128, G_256, G_512, and G_1024 generated by compilation in a specified directory, and creates the following index table: the index corresponding to the context capacity range [1-128] can be G_128.npu; the index corresponding to the context capacity range [129-256] can be G_256.npu; the index corresponding to the context capacity range [257-512] can be G_512.npu; and the index corresponding to the context capacity range [513-1024] can be G_1024.npu.
[0062] Simultaneously, a capacity limit (128, 256, 512, 1024) is recorded for each static computation graph for subsequent shared key-value cache configuration and access range control. This index table and metadata can be packaged into a library configuration file and deployed to the edge device along with the static computation graph files. During device runtime, when the scheduler needs to select a target static computation graph based on length information, it can directly query this index table to obtain the corresponding graph file path, load it, and execute it.
[0063] By constructing a static computation graph library in the above manner, the number of static computation graphs in the library is limited and matches the business scenario, avoiding redundant storage and waste of computing resources. At the same time, by sharing model computation parameters among the graphs, efficient use of storage space is achieved, providing basic support for dynamic scheduling of adapted static computation graphs based on the actual sequence length during subsequent runtime.
[0064] This application embodiment also provides a method for configuring a shared key-value cache, which may include the following steps:
[0065] S301, Configure shared key-value cache.
[0066] The total capacity of the shared key-value cache must be no less than the upper limit of at least two of the context's capacity range. The key-value cache (KV cache) is the core intermediate data in the target model's inference process, used to store the key vectors and value vectors of generated terms. During autoregressive decoding, for each new term generated, the attention distribution needs to be calculated based on the key-value caches of all previous terms. By caching these intermediate results, redundant calculations can be avoided, significantly improving inference efficiency.
[0067] In this embodiment, based on the construction of a static computation graph library, a unified shared key-value cache area is further configured. This shared key-value cache area is a physically contiguous storage region allocated in memory (e.g., dynamic random access memory DRAM) to store key-value cache data generated by all static computation graphs during execution. The total capacity of the shared key-value cache area needs to be determined based on the maximum capacity limit of all context capacity ranges in the static computation graph library; that is, its total capacity is not less than the space required for key-value cache corresponding to the maximum capacity limit. For example, if the static computation graph library contains four static computation graphs with context capacities of 128, 256, 512, and 1024, and the maximum capacity limit is 1024, then the total capacity of the shared key-value cache area must be able to accommodate at least all key-value cache data corresponding to a sequence of length 1024.
[0068] It's important to note that the capacity of the shared key-value cache can be configured to equal the maximum capacity limit, or slightly larger to provide some buffer margin. Regardless of the specific value, its main characteristic is that this cache is a unified physical storage area accessed by all static computation graphs, rather than allocating a separate cache for each static computation graph. By configuring a unified, large-capacity shared cache, a unified physical storage foundation is provided for subsequent key-value cache reads and writes across all static computation graphs, resolving the issues of data redundancy and migration overhead between multiple caches.
[0069] S302. Configure each static computation graph in the static computation graph library to access the shared key-value cache, and ensure that the operation range of each static computation graph on the shared key-value cache does not exceed the capacity range of its corresponding context.
[0070] After configuring the shared key-value cache, it is necessary to associate each static computation graph in the static computation graph library with the cache to ensure that each static computation graph can correctly read and write data in the cache during execution, while limiting its operations to a range that matches its capacity.
[0071] Specifically, during compilation, each static computation graph's internal read and write operations on the key-value cache are configured to point to this unified shared cache area. However, due to the different context capacities of different static computation graphs, the actual range of cache areas they can access also differs. For example, a static computation graph G_128 with a context capacity of 128 is configured to only access the storage locations corresponding to the first 128 terms in the shared cache area (such as the region corresponding to address offsets 0 to 127); a static computation graph G_256 with a context capacity of 256 is configured to access the storage locations corresponding to the first 256 terms in the shared cache area (such as the region corresponding to address offsets 0 to 255), and so on. This control of the access range can be implemented in various ways, such as setting an address offset upper limit parameter for each static computation graph, or hardcoding the access range directly into the instructions of the static computation graph during compilation.
[0072] It's important to note that although the access ranges of each static computation graph differ, they access different contiguous intervals within the same physical cache, and these intervals are arranged consecutively starting from the cache's starting address. This ensures that when switching from a smaller static computation graph to a larger one, the cached data already written to the smaller static computation graph is still located at the beginning of the larger static computation graph's accessible range. The larger static computation graph can then directly read this data without any data migration or format conversion.
[0073] In this embodiment, all static computation graphs share the same cache, avoiding the memory waste caused by allocating a separate cache for each graph. When switching between static computation graphs of different capacities, since the cache is shared and the access range of each graph is contiguous, the generated key-value cache data is retained and can be directly reused in the new graph, achieving zero data migration during the switching process. Because the access range of each graph is limited by its capacity, the security of cache access is ensured, and the problem of data corruption caused by out-of-bounds operations can also be solved.
[0074] Based on the above data processing method, embodiments of this application also provide a method for dynamically scheduling a static computation graph, which may include the following steps:
[0075] S401. During the process of calling the target static graph for processing, monitor the changes in the length information of the current sequence to be processed.
[0076] The electronic device continuously monitors changes in the length of the current sequence to be processed, i.e., it tracks the dynamic growth of the sequence length in real time. During the target model decoding and generation process, the sequence length increases by 1 for each new term generated. The electronic device needs to obtain the updated length information after each generation operation or before each decoding step and compare it with the capacity of the currently executing static computation graph. This monitoring can be implemented in various ways, such as through counters or callback functions, and this application embodiment does not specifically limit the methods used.
[0077] S402. In response to detecting that the length of the sequence to be processed has increased and the increased length exceeds the context capacity range corresponding to the target static computation graph currently being called, a first static computation graph is determined in the static computation graph library based on the increased length.
[0078] For example, the current static computation graph G_128 has a context capacity range of [1-128], with a maximum capacity of 128. When the length increases to 129, it exceeds the processing capacity of G_128. At this point, it is necessary to switch to a static computation graph that can accommodate the new length. That is, based on the increased length, the first static computation graph in the static computation graph library is determined as the current target static computation graph.
[0079] There are several ways to determine the first static computation graph based on the increased length. One possible implementation is to select the static computation graph whose context capacity range can cover the increased length and has the smallest upper capacity value. For example, when the length reaches 129, G_256 with an upper capacity value of 256 is selected as the first static computation graph. Another implementation is to select any static computation graph with an upper capacity value greater than the increased length, such as directly selecting the graph with the largest capacity. This application does not limit the specific selection strategy, and a detailed description of the preferred method will be provided later. Here, it is only necessary to explain that the electronic device determines a new static computation graph in the graph library that can accommodate the current sequence based on the increased length.
[0080] S403. Switch the current processing task from the target static computation graph to the first static computation graph for processing.
[0081] After determining the first static computation graph, the electronic device switches the current processing task from the currently executing target static computation graph to the first static computation graph to continue execution. The switching process needs to ensure the continuity of the task and the integrity of the state, that is, after the switch, the new graph can seamlessly continue the processing progress of the old graph and continue to generate subsequent tokens.
[0082] There are several ways to implement the switching process. One possible approach is to pause the execution of the current static computation graph before switching, record the execution progress of the current processing task (e.g., the position already processed, the intermediate results already generated, etc.), then load the new static computation graph (such as the first static computation graph), and resume execution from the breakpoint based on the recorded execution progress and the configuration information of the new graph. Another approach is to utilize the characteristics of a shared key-value cache, where the new graph directly reads the existing key-value data in the cache and continues execution in conjunction with the current processing position information. Detailed implementation details of the switching process will be provided in subsequent embodiments of this application and will not be elaborated upon here.
[0083] Through the dynamic scheduling method described above, this embodiment of the application achieves automatic switching of the static computation graph based on real-time changes in sequence length during inference. This ensures that the target model always uses a static computation graph that matches the current sequence length, avoiding resource waste caused by using a large-capacity static computation graph for small sequences while guaranteeing smooth expansion for large sequences, thus solving the capacity limitation problem of a single static graph.
[0084] Furthermore, in this embodiment of the application, monitoring changes in the length information of the current sequence to be processed can be achieved by adopting corresponding monitoring methods in the pre-filling stage and the decoding generation stage.
[0085] During the pre-filling phase, the length of the input prompt sequence is monitored; changes in the length of the input prompt sequence are identified as changes in the length information of the sequence to be processed. The pre-filling phase is the first stage of the target model's inference process. It refers to the process where the target model performs a one-time calculation on the user's original input prompt sequence to generate initial KV cache data. At this stage, the sequence to be processed is only the input prompt sequence; there is no generated token sequence. The input prompt sequence refers to the original instructions, questions, text, etc., input by the user to the target model, serving as the starting data for inference. After receiving the user's inference request and entering the pre-filling phase, the electronic device tokenizes the input prompt sequence and calculates its initial length. If the input prompt sequence involves segmented input or content completion, its length will change. The monitoring module calculates the changed prompt sequence length in real time and directly uses this change as the change in the length information of the sequence to be processed, without adding other sequence lengths.
[0086] During the decoding and generation phase, the length of the generated sequence is monitored; the change in the total length of the generated sequence and the input prompt sequence is determined as the change in the length information of the sequence to be processed. The decoding and generation phase is the second stage of the target model's inference process. It refers to the process by which the target model generates a new token sequence word by word based on the initial KV cache data generated in the pre-filling phase. The sequence to be processed in this stage is the total sequence of the input prompt sequence and the generated sequence. The generated sequence refers to the token sequence generated word by word by the target model in the decoding and generation phase, and its length increases continuously as the inference process progresses.
[0087] Correspondingly, in one implementation of this application embodiment, a target static computation graph is determined in a static computation graph library based on length information; the target static computation graph is invoked to process the sequence to be processed, and the model inference result corresponding to the sequence to be processed is obtained, including:
[0088] During the pre-filling stage, a second static computation graph is determined from the static computation graph library based on the length of the input prompt sequence. The second static computation graph is then called to perform pre-filling calculations on the input prompt sequence and generate an initial key-value cache.
[0089] Specifically, the electronic device first obtains the length of the input prompt sequence and, based on this length, selects a static computation graph from the static computation graph library that can accommodate the input prompt sequence; this is denoted as the second static computation graph. In one implementation, the static computation graph with the smallest upper limit of its context capacity that can cover the length of the input prompt sequence is selected as the second static computation graph. For example, if the length of the input prompt sequence is 50, and the static computation graph library contains static computation graphs with capacities of 128, 256, 512, and 1024, then the static computation graph with a capacity of 128 is selected as the second static computation graph. This minimizes the consumption of computational resources while meeting processing requirements and avoids resource waste caused by using excessively large graphs.
[0090] After the second static computation graph is determined, the electronic device calls this graph to perform pre-filling calculations on the input prompt sequence. During the calculation process, the second static computation graph processes each word in the input prompt sequence according to its internally defined computation structure, calculates the key vector and value vector for each word, and stores these key-value data in a key-value cache. Because it is a parallel computation, all words in the input prompt sequence can be processed simultaneously, generating a complete initial key-value cache in one go.
[0091] The generated initial key-value cache is written to a shared key-value cache area. Based on the configuration of the shared cache area, the second static computation graph stores the key-value data within a range corresponding to its capacity. For example, if the capacity of the second static computation graph is 128, the key-value data it writes will be stored in the storage locations corresponding to the first 128 tokens in the shared cache area, with each location corresponding one-to-one with the token index of the input prompt sequence. Through the above pre-filling stage, the initial computation can be efficiently completed with a static computation graph matching the length of the input sequence, generating an initial key-value cache that can be reused in subsequent decoding stages, effectively reducing first-character latency.
[0092] During the decoding phase, a third static computation graph is determined from the static computation graph library based on the total length of the current sequence. The third static computation graph is then called to perform word-by-word decoding based on the initial key-value cache, and the generated word sequence is output as the inference result of the model.
[0093] At the start of the decoding generation phase, the total length of the current sequence is the sum of the length of the input prompt sequence and the length of the already generated sequence. Based on this total length, the electronic device determines a static computation graph from the static computation graph library that can accommodate the current sequence; this is denoted as the third static computation graph. It should be noted that this third static computation graph can be the same as the second static computation graph from the pre-filling phase, or it can be a different static computation graph. If the total length of the current sequence is still within the capacity of the second static computation graph, then that graph continues to be used; if it exceeds the capacity, then a graph with a larger capacity needs to be switched to.
[0094] After the third static computation graph is determined, the electronic device calls this graph to perform word-by-word decoding and generation. During the decoding process, the third static computation graph generates a new lexical unit at each step, based on the key-value data already stored in the shared key-value cache: first, a query vector is calculated based on the currently generated lexical unit; then, attention is calculated with all existing key vectors in the cache to obtain the attention distribution; next, the existing value vectors in the cache are weighted and summed based on the attention distribution; finally, the next lexical unit is predicted through the output layer. Each time a new lexical unit is generated, its corresponding key and value vectors are also stored in the corresponding positions in the shared key-value cache, updating the cache content.
[0095] The decoding and generation process continues until a preset termination condition is met, such as generating an end symbol or reaching the maximum generation length. When generation ends, the electronic device outputs the generated complete word sequence as the model's inference result. Through the above-described decoding and generation stages, the static computation graph can be adapted in real time according to the dynamic changes in sequence length, ensuring continuous generation while achieving efficient utilization of computational resources, and ultimately outputting a complete inference result.
[0096] This application also provides a static computation graph switching method, which switches the current processing task from the target static computation graph to the first static computation graph for processing. The process may include the following steps:
[0097] S501. Pause the execution of the current processing task by the target static computation graph and obtain the execution progress of the current processing task.
[0098] When the current sequence length exceeds the capacity of the currently used static computation graph, requiring a switch to a new static computation graph, the execution of the current static computation graph must first be paused. Pausing execution means stopping the current static computation graph's computational operations on the processing task, putting it into an interruptible state for subsequent switching. Simultaneously, the execution progress of the current processing task needs to be recorded, including the position the task has reached and the intermediate result data already generated. The position reached refers to the word positions processed during sequence generation, such as the number of words already generated and the word currently being generated. This position information determines where the new static computation graph should continue execution after the switch. The intermediate result data refers to data that has been computed but not yet output or still needs to be used in subsequent computations, mainly including generated key-value cache data. This data is the foundation of the decoding generation process; the new static computation graph needs this data to correctly generate subsequent words.
[0099] It should be noted that, in this embodiment of the application, since a shared key-value cache is configured, the generated intermediate result data is actually stored in the shared cache. Therefore, when recording the execution progress, the generated intermediate result data can be indirectly obtained by recording its storage location and length in the shared cache, without the need for data copying.
[0100] S502, Load the first static calculation graph.
[0101] After obtaining the current execution progress, the electronic device loads a new static computation graph, namely the first static computation graph. The loading process includes reading the file or memory image of the first static computation graph from the static computation graph library, loading it into the execution unit of the NPU or other hardware accelerator, and making it ready for execution.
[0102] When loading the first static computation graph, since it shares the same set of model computation parameters as the target static computation graph, there is no need to reload the model weights; only the graph structure definition and capacity-related metadata need to be loaded. This makes the loading process very lightweight with extremely low switching overhead.
[0103] S503. Based on the context capacity range corresponding to the target static computation graph, determine the data access range of the first static computation graph in the shared key-value cache.
[0104] After loading the first static computation graph, it is necessary to determine the range of data that the graph can access in the shared key-value cache. Since the shared key-value cache is a unified physical storage area, the access permissions of each static computation graph to it are limited by the capacity range of its corresponding context. In this embodiment, the determined data access range is based not only on the capacity of the first static computation graph itself, but also on the range of data already written to the target static computation graph, to ensure that the new graph can correctly read existing cached data.
[0105] Specifically, during execution, the target static computation graph has already written a portion of key-value cache data into the shared cache. This data is located within a contiguous region in the cache starting from the starting address, with a length equal to the length of the sequence already processed by the target static computation graph. The first static computation graph needs to be able to access this existing data while simultaneously being able to write subsequently generated data. Therefore, the data access range of the first static computation graph in the shared cache is defined as a contiguous region starting from the starting address and ending at the address corresponding to its own capacity limit. This range includes all the data already written by the target static computation graph, because the capacity of the target static computation graph is smaller than that of the first static computation graph, and its written data is located at the beginning of the cache.
[0106] S504. Configure the first static computation graph to read cached key-value data from the data access range.
[0107] After determining the data access range, the first static computation graph is configured to read cached key-value data from that range. The configuration process may include setting memory access address offsets, configuring DMA (Direct Memory Access) transfer parameters, and updating the address mapping of the memory management unit (MMU), etc. The specific implementation method can be determined based on the architecture of the target hardware.
[0108] Once configured, the first static computation graph has the ability to access existing data in the shared cache. This existing data consists of intermediate results generated previously in the target static computation graph, including the key and value vectors of all processed tokens, and is the foundation necessary for continued decoding and generation. For example, the electronic device configures G_256 to read key-value data from positions 0 to 255 in the shared cache. At this point, G_256 can directly access the data already written to G_128 in positions 0 to 127 without any data copying or migration.
[0109] S505. Based on the read key-value data and execution progress, control the first static graph to continue executing the current processing task.
[0110] After configuration, the electronic device, based on the key-value data read from the shared cache and the previously recorded execution progress, controls the first static computation graph to continue executing the current processing task. Specifically, the first static computation graph starts from the position after the "processed position" recorded in the execution progress, uses the existing key-value data in the cache as the context for attention calculation, and generates subsequent lexical units word by word.
[0111] Because the first static computation graph can directly access all existing data, and the execution progress indicates where to continue generation, the switching process has minimal impact on subsequent generation, achieving a seamless transition. From the user's perspective, the model's response is continuous, with no perceived interruption or delay.
[0112] Through the switching processing method described above, this embodiment of the application achieves seamless switching between static computation graphs of different capacities. Due to the existence of a shared key-value cache, there is no need to migrate a large amount of intermediate data during the switching process; only the execution progress needs to be recorded and the access configuration of the new graph needs to be adjusted, resulting in extremely low switching overhead. This enables the dynamic scheduling mechanism to operate efficiently in practical applications, ensuring both the continuity of model inference and the dynamic adaptation of computing resources.
[0113] In one implementation of this application, in order to minimize computational resource consumption and avoid resource waste and invalid fill computation caused by using an excessively large static computation graph while meeting sequence processing requirements, the method described above for determining the target static computation graph in the static computation graph library based on the length information can be implemented in the following preferred manner:
[0114] In the static computation graph library, candidate static computation graphs with a context capacity range not less than the length value corresponding to the length information are selected; static computation graphs with a context capacity range less than the target capacity range in the candidate static graphs are identified as the target static computation graphs.
[0115] Specifically, after obtaining the length information of the current sequence to be processed, the electronic device first traverses all static computation graphs in the static computation graph library and compares their corresponding context capacity range with the current length information. The filtering criterion is that the context capacity range of the static computation graph can cover the current length, that is, the upper limit of the capacity range is greater than or equal to the current length value. Through this filtering, static computation graphs whose capacity is insufficient to accommodate the current sequence can be eliminated, ensuring that the selected candidate graphs have the basic ability to process the current sequence.
[0116] Suppose the static computation graph library contains four static computation graphs with corresponding context capacity ranges of [1-128], [129-256], [257-512], and [513-1024], and capacity limits of 128, 256, 512, and 1024, respectively. If the length of the current sequence to be processed is 50, then the candidate static computation graphs whose capacity range can cover 50 are selected as follows: [1-128] (cap 128 ≥ 50), [129-256] (cap 256 ≥ 50), [257-512] (cap 512 ≥ 50), and [513-1024] (cap 1024 ≥ 50). That is, all four graphs meet the selection criteria. If the current length is 150, the candidate static computation graphs whose capacity range can cover 150 are selected as [129-256] (upper limit 256 ≥ 150), [257-512] (upper limit 512 ≥ 150), and [513-1024] (upper limit 1024 ≥ 150), while [1-128] (upper limit 128 < 150) is excluded. If the current length is 300, the candidate static computation graphs selected are [257-512] and [513-1024], while [1-128] and [129-256] are excluded.
[0117] After selecting candidate static computation graphs, the static computation graph with the smallest context capacity range among the candidate static graphs is determined as the target static computation graph. In this embodiment, "smallest context capacity range" refers to the smallest upper limit capacity, i.e., the smallest specification graph that can just meet the current length requirement. This ensures that the processing task is executed using the smallest capacity static computation graph that can accommodate the current sequence, avoiding resource waste caused by using excessively large graphs. For example, if the current length is 50, the candidate static computation graphs include all four graphs, and the one with the smallest capacity range is [1-128] (upper limit 128), so the corresponding G_128 is determined as the target static computation graph. If the current length is 150, the candidate static computation graphs include [129-256], [257-512], and [513-1024], and the one with the smallest capacity range is [129-256] (upper limit 256), so the corresponding G_256 is determined as the target static computation graph. If the current length is 300, the candidate static computation graphs include [257-512] and [513-1024], among which the smallest capacity range is [257-512] (upper limit 512), so the corresponding G_512 is determined as the target static computation graph.
[0118] Through the above screening and selection process, the selected target static computation graph can be chosen to be both capable of processing the current sequence and the smallest of all capable static computation graphs. This minimizes computational resource consumption while meeting processing requirements, avoiding unnecessary fill computations and memory access overhead caused by using excessively large graphs. This reduces inference latency and power consumption, providing a smoother transition basis for subsequent dynamic switching.
[0119] See Figure 3 The diagram illustrates a data processing flowchart for a target model application scenario provided by an embodiment of this application. In this application scenario, the target model can be a text generation model used to understand long documents and generate concise and accurate summaries. Figure 3 Taking a document summarization application running on an edge device as an example, this paper demonstrates the specific processing flow of dynamically scheduling the static computation graph based on the context length during the complete inference process from user uploading a long document to model outputting the summary content.
[0120] In document summarization applications, a user uploads a technical document that needs summarization via a device (such as a tablet). The document content is segmented to obtain an input sequence. Assume that the document, after lexicalization, yields an input sequence of 380 tokens. Upon receiving this input sequence, the electronic device first determines the context length corresponding to the input sequence, which is 380.
[0121] Based on the defined context length of 380, the electronic device determines a static computation graph A from a pre-built static computation graph library that can accommodate this length. The static computation graph library contains multiple static computation graphs corresponding to different context capacity ranges, such as graphs with capacities of 128, 256, 512, and 1024. According to the minimum fit filtering mode, the electronic device selects a static computation graph with a capacity of 512 as the current execution graph, because 380 is greater than 256 but less than 512, and a graph with a capacity of 512 is the smallest specification graph that can accommodate this length. The electronic device then performs a static computation graph switching process: loading static computation graph A into the NPU, and simultaneously setting the key-value cache pointer of static computation graph A to point to the starting address of a pre-configured shared key-value cache area. The total capacity of the shared key-value cache area is not less than the maximum capacity limit of all static computation graphs (e.g., 1024), ensuring that it can accommodate the processing requirements of the longest document.
[0122] The NPU begins computation based on the loaded static computation graph A. Since it is currently in the pre-filling phase, static computation graph A processes the long document of 380 terms in parallel, calculating the key and value vectors for each term, generating an initial key-value cache, and writing this data to the corresponding positions (positions 0 to 379) in the shared key-value cache. After the pre-filling computation is complete, the decoding generation phase begins, generating summary content word by word. In the decoding generation phase, the NPU continues to decode and generate word by word based on static computation graph A. For each generated summary term, the electronic device updates the key-value data at the corresponding position in the shared key-value cache and outputs the currently generated term. Subsequently, the electronic device determines whether the currently output term is an end-of-sequence (EOS). If it is not an EOS, it continues to generate the next term, continuously monitoring changes in the current sequence length throughout the process.
[0123] During execution, the electronic device continuously checks whether the currently used static computation graph A can still meet the current sequence length. As summary tokens are generated, the current sequence length gradually increases from 380. When the 133rd summary token is generated, the total length reaches 513 (380 input tokens plus 133 generated tokens), exceeding the capacity limit of static computation graph A (512). At this point, the electronic device determines that static computation graph A can no longer meet the current sequence length and a new static computation graph needs to be determined. Based on the increased length of 513, the electronic device re-determines a static computation graph B in the static computation graph library that can accommodate this length. According to the minimum fit filtering mode, a static computation graph with a capacity of 1024 is selected as the new execution graph. After determining static computation graph B, the electronic device unloads the currently executing static computation graph A, loads static computation graph B into the NPU, and sets the key-value cache pointer of static computation graph B to point to the starting address of the shared key-value cache area. Since the capacity of static computation graph B is 1024, it can access the range of positions 0 to 1023 in the shared cache, which already contains all the key-value data from positions 0 to 512 previously written to static computation graph A.
[0124] After loading, the NPU continues computation based on the static computation graph B, starting decoding and generation from the 514th term (i.e., the 134th summary term). Each time a new summary term is generated, the shared key-value cache is updated and output, while continuously monitoring length changes and the EOS (Extended Memory State) status. The process ends when the generated summary reaches a preset length (e.g., 200 terms) or encounters an EOS, and a complete document summary is output as the inference result.
[0125] Throughout the processing flow, a unified management and dynamic scheduling mechanism using a shared key-value cache enables seamless switching between static computation graphs of different capacities. During the switching process, only the new static computation graph structure and configuration access pointers need to be loaded; no key-value cache data needs to be migrated, ensuring the continuity and efficiency of long document summarization generation tasks. Users enjoy a smooth document summarization generation experience, while the system dynamically adapts computing resources based on document length and generation progress, achieving a balance between performance and efficiency.
[0126] See Figure 4 In another embodiment of this application, a data processing apparatus is also provided, the apparatus comprising:
[0127] The acquisition unit 10 is used to acquire the length information of the current sequence to be processed in response to the target model performing an inference task.
[0128] The determining unit 20 is used to determine a target static computation graph in a static computation graph library based on the length information. The static computation graph library includes multiple static computation graphs corresponding to different context capacities. The static computation graphs represent the computational structure and data flow of the target model under the corresponding context capacity. Each static computation graph shares the model computation parameters of the target model.
[0129] The calling unit 30 is used to call the target static computation graph to process the sequence to be processed and obtain the model inference result corresponding to the sequence to be processed.
[0130] In one possible implementation, the device further includes a gallery building unit, which comprises:
[0131] The first determining subunit is used to determine at least two context capacity ranges based on the statistical distribution of context length in the target application scenario;
[0132] The compilation subunit is used to compile and generate a corresponding static computation graph based on each of the context capacity ranges; wherein each of the static computation graphs shares the model computation parameters of the target model, and each of the static computation graphs is used to perform computation on the data sequence falling into the corresponding context capacity range;
[0133] A sub-unit is generated to generate the static computation graph library based on each of the static computation graphs.
[0134] Optionally, the gallery building block also includes:
[0135] The first configuration subunit is used to configure a shared key-value cache, wherein the total capacity of the shared key-value cache is not less than the upper limit of the capacity among the at least two context capacity ranges;
[0136] The second configuration subunit configures each static computation graph in the static computation graph library to access the shared key-value cache, and the operation range of each static computation graph on the shared key-value cache does not exceed the range of its corresponding context capacity.
[0137] In one possible implementation, the device further includes:
[0138] The monitoring unit is used to monitor the change in the length information of the current sequence to be processed during the process of calling the target static computation graph for processing;
[0139] An update unit is configured to, in response to detecting an increase in the length of the sequence to be processed, and the increased length exceeding the context capacity range corresponding to the currently invoked target static computation graph, determine a first static computation graph in the static computation graph library based on the increased length;
[0140] The switching unit is used to switch the current processing task from the target static computation graph to the first static computation graph for processing.
[0141] Optionally, the monitoring unit is configured as follows:
[0142] During the pre-filling stage, monitor the length of the input prompt sequence;
[0143] The change in the length of the input prompt sequence is determined as a change in the length information of the sequence to be processed;
[0144] During the decoding and generation phase, the length of the generated sequence is monitored;
[0145] The total length change between the generated sequence and the input prompt sequence is determined as the change in the length information of the sequence to be processed.
[0146] Specifically, the step of determining a target static computation graph in a static computation graph library based on the length information, and then using the target static computation graph to process the sequence to be processed to obtain a model inference result corresponding to the sequence to be processed, includes:
[0147] In the pre-filling stage, a second static computation graph is determined from the static computation graph library based on the length of the input prompt sequence, and the second static computation graph is called to perform pre-filling calculation on the input prompt sequence and generate an initial key-value cache;
[0148] During the decoding generation stage, a third static computation graph is determined from the static computation graph library based on the total length of the current sequence. The third static computation graph is then called to perform word-by-word decoding generation based on the initial key-value cache, and the generated word sequence is output as the model inference result.
[0149] In one possible implementation, the switching unit is configured as follows:
[0150] Pause the execution of the current processing task on the target static computation graph, and obtain the execution progress of the current processing task. The execution progress includes the position that the current processing task has reached and the intermediate result data that has been generated.
[0151] Load the first static computation graph;
[0152] Based on the context capacity range corresponding to the target static computation graph, determine the data access range of the first static computation graph in the shared key-value cache area;
[0153] Configure the first static computation graph to read cached key-value data from the data access range;
[0154] Based on the read key-value data and the execution progress, the first static graph is controlled to continue executing the current processing task.
[0155] In one possible implementation, the determining unit includes:
[0156] The filtering subunit is used to filter out candidate static computation graphs in the static computation graph library whose context capacity range is not less than the length value corresponding to the length information;
[0157] A sub-unit is defined to identify static computation graphs in the candidate static graphs whose context capacity range is smaller than the target capacity range as the target static computation graph.
[0158] Optionally, the compilation subunit is configured as follows:
[0159] Based on the upper limit of the context capacity range, determine the input / output tensor shape and memory layout of the computation nodes in the static computation graph;
[0160] Based on the input / output tensor shape, the memory layout, and the model calculation parameters of the target model, a static computation graph with the target computation structure is generated.
[0161] Among them, the static computation graphs corresponding to different context capacity ranges have different input-output tensor shapes and memory layouts.
[0162] It should be noted that the specific implementation of each unit and subunit in this embodiment can be referred to the corresponding content above, and will not be described in detail here.
[0163] In another embodiment of this application, a readable storage medium is also provided, on which a computer program is stored, which, when executed by a processor, implements the data processing method as described in any of the preceding claims.
[0164] In another embodiment of this application, an electronic device is also provided, which may include:
[0165] A memory for storing computer programs and the data generated by the execution of said computer programs;
[0166] A processor for executing the computer program to achieve:
[0167] In response to the target model performing inference tasks, obtain the length information of the current sequence to be processed;
[0168] Based on the length information, a target static computation graph is determined in a static computation graph library, wherein the static computation graph library includes multiple static computation graphs corresponding to different context capacities, and the static computation graphs characterize the computation structure and data flow of the target model under the corresponding context capacity; each of the static computation graphs shares the model computation parameters of the target model;
[0169] The target static computation graph is invoked to process the sequence to be processed, and the model inference result corresponding to the sequence to be processed is obtained.
[0170] It should be noted that the specific implementation of the processor in this embodiment can be referred to the corresponding content above, and will not be described in detail here.
[0171] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the apparatus disclosed in the embodiments, since they correspond to the methods disclosed in the embodiments, the description is relatively simple; relevant parts can be referred to the method section.
[0172] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0173] The steps of the methods or algorithms described in conjunction with the embodiments disclosed herein can be implemented directly by hardware, a software module executed by a processor, or a combination of both. The software module can be located in random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art.
[0174] The above description of the disclosed embodiments enables those skilled in the art to make or use this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims
1. A data processing method, comprising: In response to the target model performing inference tasks, obtain the length information of the current sequence to be processed; Based on the length information, a target static computation graph is determined in a static computation graph library, wherein the static computation graph library includes multiple static computation graphs corresponding to different context capacities, and the static computation graphs characterize the computation structure and data flow of the target model under the corresponding context capacity; each of the static computation graphs shares the model computation parameters of the target model; The target static computation graph is invoked to process the sequence to be processed, and the model inference result corresponding to the sequence to be processed is obtained.
2. The method according to claim 1, wherein the construction process of the static computational graph library includes: Based on the statistical distribution of context length in the target application scenario, determine at least two context capacity ranges; Based on each of the context capacity ranges, a corresponding static computation graph is compiled and generated; wherein each of the static computation graphs shares the model computation parameters of the target model, and each of the static computation graphs is used to perform computation on the data sequence falling into the corresponding context capacity range; The static computation graph library is generated based on each of the static computation graphs.
3. The method according to claim 2, further comprising: Configure a shared key-value cache, wherein the total capacity of the shared key-value cache is not less than the upper limit of the capacity of the at least two context capacity ranges; Each static computation graph in the static computation graph library is configured to access the shared key-value cache, and the operation range of each static computation graph on the shared key-value cache does not exceed the range of its corresponding context capacity.
4. The method according to claim 1, further comprising: During the process of calling the target static computation graph for processing, the change in the length information of the current sequence to be processed is monitored; In response to detecting an increase in the length of the sequence to be processed, and the increased length exceeding the context capacity range corresponding to the currently invoked target static computation graph, a first static computation graph is determined in the static computation graph library based on the increased length. The current processing task is switched from the target static computation graph to the first static computation graph for processing.
5. The method according to claim 4, wherein monitoring the change in the length information of the current sequence to be processed includes: During the pre-filling stage, monitor the length of the input prompt sequence; The change in the length of the input prompt sequence is determined as a change in the length information of the sequence to be processed; During the decoding and generation phase, the length of the generated sequence is monitored; The total length change between the generated sequence and the input prompt sequence is determined as the change in the length information of the sequence to be processed. Specifically, the step of determining a target static computation graph in a static computation graph library based on the length information, and then using the target static computation graph to process the sequence to be processed to obtain a model inference result corresponding to the sequence to be processed, includes: In the pre-filling stage, a second static computation graph is determined from the static computation graph library based on the length of the input prompt sequence, and the second static computation graph is called to perform pre-filling calculation on the input prompt sequence and generate an initial key-value cache; During the decoding generation stage, a third static computation graph is determined from the static computation graph library based on the total length of the current sequence. The third static computation graph is then called to perform word-by-word decoding generation based on the initial key-value cache, and the generated word sequence is output as the model inference result.
6. The method according to claim 4, wherein switching the current processing task from the target static computation graph to the first static computation graph for processing includes: Pause the execution of the current processing task on the target static computation graph, and obtain the execution progress of the current processing task. The execution progress includes the position that the current processing task has reached and the intermediate result data that has been generated. Load the first static computation graph; Based on the context capacity range corresponding to the target static computation graph, determine the data access range of the first static computation graph in the shared key-value cache area; Configure the first static computation graph to read cached key-value data from the data access range; Based on the read key-value data and the execution progress, the first static graph is controlled to continue executing the current processing task.
7. The method according to claim 1, wherein determining the target static computation graph in the static computation graph library based on the length information comprises: Candidate static computation graphs with a context capacity range not less than the length value corresponding to the length information are selected from the static computation graph library; Static computation graphs whose context capacity range is smaller than the target capacity range in the candidate static graphs are identified as the target static computation graphs.
8. The method according to claim 2, wherein compiling and generating a corresponding static computation graph based on each of the context capacity ranges comprises: Based on the upper limit of the context capacity range, determine the input / output tensor shape and memory layout of the computation nodes in the static computation graph; Based on the input / output tensor shape, the memory layout, and the model calculation parameters of the target model, a static computation graph with the target computation structure is generated. Among them, the static computation graphs corresponding to different context capacity ranges have different input-output tensor shapes and memory layouts.
9. A data processing apparatus, comprising: The acquisition unit is used to acquire the length information of the current sequence to be processed in response to the target model performing inference tasks. A determining unit is configured to determine a target static computation graph in a static computation graph library based on the length information, wherein the static computation graph library includes multiple static computation graphs corresponding to different context capacities, and the static computation graphs characterize the computation structure and data flow of the target model under the corresponding context capacity; each of the static computation graphs shares the model computation parameters of the target model; The calling unit is used to call the target static computation graph to process the sequence to be processed and obtain the model inference result corresponding to the sequence to be processed.
10. An electronic device, comprising: A memory for storing computer programs and the data generated by the execution of said computer programs; A processor for executing the computer program to achieve: In response to the target model performing inference tasks, obtain the length information of the current sequence to be processed; Based on the length information, a target static computation graph is determined in a static computation graph library, wherein the static computation graph library includes multiple static computation graphs corresponding to different context capacities, and the static computation graphs characterize the computation structure and data flow of the target model under the corresponding context capacity; each of the static computation graphs shares the model computation parameters of the target model; The target static computation graph is invoked to process the sequence to be processed, and the model inference result corresponding to the sequence to be processed is obtained.