A data processing method, apparatus, and system

By pre-releasing memory space within the deep learning framework, the memory management problem caused by multi-stream concurrency is solved, improving memory utilization and computational efficiency, and enhancing the concurrent performance of the computation stream.

CN122220084APending Publication Date: 2026-06-16HUAWEI TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HUAWEI TECH CO LTD
Filing Date
2024-12-16
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

In deep learning frameworks, multi-stream concurrency can lead to excessive peak memory usage or excessive fragmentation, resulting in OOM (out of memory), which reduces memory utilization and computational efficiency.

Method used

By releasing memory space in advance by utilizing mapping relationships and cross-stream operation control operators after the first processing stream determines that the second processing stream has received the second operator, and before the second processing stream has finished executing the second operator, memory space can be released in advance.

🎯Benefits of technology

It improves memory utilization and computational efficiency, reduces memory release waiting time, and enhances the concurrent performance of the computation stream.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122220084A_ABST
    Figure CN122220084A_ABST
Patent Text Reader

Abstract

A data processing method applied to a control module, the method comprising: issuing a first operator to a first processing flow; issuing a second operator to a second processing flow; wherein an input of the second operator comprises an output of the first operator; releasing a memory space when it is determined that the first processing flow is aware that the second processing flow receives the issued second operator; wherein the memory space stores an output calculated by the first processing flow through the first operator. The application can release the memory space when it is determined that the first processing flow is aware that the second processing flow receives the issued second operator, so that the time of issuing is much faster than the time of completing the execution of the operator, thereby accelerating the release of the memory, and further improving the utilization of the memory and the calculation efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of communication technology, and in particular to a data processing method, apparatus and system. Background Technology

[0002] Artificial intelligence (AI) is the theory, methods, technology, and application systems that use digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to achieve optimal results. In other words, AI is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a way similar to human intelligence. AI studies the design principles and implementation methods of various intelligent machines, enabling them to possess the functions of perception, reasoning, and decision-making.

[0003] Streams (i.e., processing streams) play a crucial role in deep learning frameworks. A stream is a mechanism for parallel computing that allows a series of operations, such as memory transfers and kernel calls, to be executed asynchronously on a device. Operations assigned to the same stream are executed in pre-order on the device side, increasing throughput and reducing latency. In deep learning frameworks, multi-stream concurrency significantly improves the efficiency of computing resource utilization and the speed of model training by processing multiple tasks simultaneously. Multi-stream concurrency allows the framework to process multiple independent computational tasks at the same time. This efficient concurrency mechanism enables deep learning models to iterate and optimize faster, especially in the era of large models, where the role of multi-stream concurrency is particularly evident, greatly promoting the advancement and application of deep learning technology.

[0004] However, multi-stream concurrency also increases the pressure on device memory. Each concurrent stream requires independent device memory to store its execution state and intermediate computation results; device memory cannot be directly reused between different streams. If device memory management is not done properly, it can lead to excessive peak memory usage or excessive fragmentation, resulting in OOM (out of memory) errors, reducing memory utilization and computational efficiency. Summary of the Invention

[0005] This application provides a data processing method, the method comprising: sending a first operator to a first processing stream; sending a second operator to a second processing stream; wherein the input of the second operator includes the output of the first operator; after the first processing stream determines that the second processing stream has received the sent second operator, and before the second processing stream has finished executing the second operator, releasing memory space; wherein the memory space stores the output calculated by the first processing stream through the first operator.

[0006] For example, memory space can be released when the first processing stream determines that the second processing stream has received the second operator.

[0007] In this process, after the first operator calculates the output, it needs to store the output in memory. The second operator, acting as a data consumer, can read the data from this memory and complete its own calculation. In existing technologies, the memory space is only released after the completion of the second operator's calculation is detected. For example, a reference event is inserted after the second operator to determine whether the second processing stream has completed the calculation of the second operator by polling the state of the reference event. However, since the calculation of the second operator often involves a certain delay, the release of the memory space requires a long wait, which leads to a decrease in memory utilization and computational efficiency.

[0008] In this embodiment, the memory space can be released before the second processing stream finishes executing the second operator, once the first processing stream determines that the second processing stream has received the second operator. The time of sending the operator is faster than the time of the operator's completion, thereby accelerating the release of memory and improving memory utilization and computational efficiency.

[0009] Wherein, the first processing stream determines that the second processing stream has received the second operator sent down, which can be understood as the first processing stream sensing that the second processing stream has received the second operator sent down.

[0010] In one possible implementation, the method further includes: obtaining a mapping relationship, the mapping relationship including the latest operator received by the second processing stream as determined by the first processing stream; the first processing stream determining that the second processing stream has received the second operator includes: if the mapping relationship indicates that the latest operator received by the second processing stream as determined by the first processing stream is an operator following the second operator, the first processing stream determines that the second processing stream has received the second operator.

[0011] The mapping relationship can be used to represent the latest issued operators that different data stream processing modules are aware of each other. For example, it can be in the form of a table. In this way, the latest received issued operators that different computing streams can be aware of each other can be determined based on the mapping relationship, thereby determining the timing of memory release.

[0012] For computational flow A, the ability to determine the latest operator received by computational flow B is often delayed. Only when a cross-flow operator is received will the latest operator received by other computational flows in the mapping relationship be updated.

[0013] In one possible implementation, the method further includes: updating the mapping relationship when a third operator is issued to the first processing stream, and the third operator is an operator issued after the second operator, wherein the updated mapping relationship indicates that the most recently received operator by the second processing stream, as determined by the first processing stream, is the second operator; the third operator is a control operator for cross-stream operations performed between the first processing stream and other processing streams.

[0014] In one possible implementation, the first operator and the second operator are identified from a code file; the method further includes: when the code file does not contain a cross-stream control operator between the first operator and the second operator, issuing a fourth operator to the first processing stream after issuing the first operator to the first processing stream, and issuing a fifth operator to the second processing stream before issuing the second operator to the second processing stream; the fourth operator and the fifth operator are control operators for cross-stream operations from the first processing stream to the second processing stream.

[0015] In this embodiment of the application, cross-stream communication operators can be automatically issued when cross-stream operations are present, thereby reducing the user's operating costs.

[0016] In one possible implementation, the method further includes: issuing a sixth operator to the other processing stream; the sixth operator and the third operator are control operators for cross-stream operations from the other processing stream to the first processing stream; the sixth operator is used to block the third operator, and the third operator is used to be blocked by the sixth operator until the sixth operator completes execution.

[0017] For example, the sixth operator can be the Event record operator, and the third operator can be the Event wait operator.

[0018] In one possible implementation, the first processing stream and the second processing stream are used to perform the same processing task for deep learning.

[0019] In one possible implementation, the method is executed on the host side, and the first processing stream and the second processing stream are executed on the device side.

[0020] In one possible implementation, the method further includes:

[0021] The sixth operator is sent to the second processing stream;

[0022] A third operator is issued to the first processing stream; wherein the third operator is an operator issued after the sixth operator and associated with the sixth operator, and the sixth operator and the third operator are control operators for cross-stream operations from the second processing stream to the first processing stream;

[0023] After the third operator is sent to the first processing stream, the mapping relationship is updated, and the latest received operator in the updated mapping relationship is the sixth operator.

[0024] When the first processing stream receives the third operator, and the sixth operator associated with the third operator is located in the second processing stream, the first processing stream can perceive that the latest operator sent by the second processing stream is the sixth operator. Since the sixth operator is sent after the second operator, it can be inferred that the first processing stream has perceived that the second processing stream has received the second operator, and thus can release memory space.

[0025] In one possible implementation, the method further includes:

[0026] The sixth operator is sent to the second processing stream;

[0027] A third operator is issued to the third processing stream; wherein the third operator is an operator issued after the sixth operator and associated with the sixth operator, and the sixth operator and the third operator are control operators for cross-stream operations from the second processing stream to the third processing stream;

[0028] Issue the seventh operator to the third or fourth processing stream;

[0029] An eighth operator is issued to the first processing stream; wherein the eighth operator is an operator issued after the seventh operator and associated with the seventh operator, and the seventh operator and the eighth operator are control operators for cross-stream operations from the third processing stream or the fourth processing stream to the first processing stream;

[0030] After the eighth operator is sent to the first processing stream, the mapping relationship is updated, and the latest received operator in the updated mapping relationship is the sixth operator.

[0031] When the first processing stream receives the eighth operator, although the seventh operator associated with the eighth operator is not located in the second processing stream, this cross-stream processing has historically passed through the second processing stream (based on the sixth and third operators). Therefore, the first processing stream can perceive that the latest operator sent by the second processing stream is the sixth operator. Since the sixth operator was sent after the second operator, it can be inferred that the first processing stream has perceived that the second processing stream has received the second operator, and thus can release memory space.

[0032] In one possible implementation, the sixth operator is used to block the third operator, which is blocked by the sixth operator until the sixth operator completes execution.

[0033] Secondly, this application provides a data processing apparatus, the apparatus comprising:

[0034] The transceiver module is used to send a first operator to the first processing stream and a second operator to the second processing stream; wherein the input of the second operator includes the output of the first operator;

[0035] The memory management module is used to release memory space after the first processing stream determines that the second processing stream has received the second operator, and before the second processing stream finishes executing the second operator; wherein the memory space stores the output calculated by the first processing stream through the first operator.

[0036] In one possible implementation, the device further includes:

[0037] A mapping relationship management module is used to obtain mapping relationships, which include the latest operator received by the second processing stream as determined by the first processing stream;

[0038] The first processing stream determines that the second processing stream has received the second operator, including:

[0039] If the mapping relationship indicates that the latest operator received by the second processing stream as determined by the first processing stream is an operator following the second operator, the first processing stream determines that the second processing stream has received the second operator.

[0040] In one possible implementation, the mapping relationship management module is further configured to update the mapping relationship when a third operator is issued to the first processing stream, and the third operator is an operator issued after the second operator. The updated mapping relationship indicates that the operator most recently received by the second processing stream as determined by the first processing stream is the second operator. The third operator is a control operator for cross-stream operations performed between the first processing stream and other processing streams.

[0041] In one possible implementation, the first operator and the second operator are identified from a code file; the transceiver module is further configured to:

[0042] When the code file does not contain a cross-stream control operator between the first operator and the second operator, after the first operator is sent to the first processing stream, a fourth operator is sent to the first processing stream, and before the second operator is sent to the second processing stream, a fifth operator is sent to the second processing stream; the fourth operator and the fifth operator are control operators for cross-stream operations from the first processing stream to the second processing stream.

[0043] In one possible implementation, the transceiver module is further configured to:

[0044] A sixth operator is issued to the other processing streams; the sixth operator and the third operator are control operators for cross-stream operations from the other processing streams to the first processing stream; the sixth operator is used to block the third operator, and the third operator is used to be blocked by the sixth operator until the sixth operator completes execution.

[0045] In one possible implementation, the first processing stream and the second processing stream are used to perform the same processing task for deep learning.

[0046] In one possible implementation, the method is executed on the host side, while the first processing stream and the second processing stream are executed on the device side.

[0047] Thirdly, embodiments of this application provide a data processing apparatus, which may include a memory, a processor, and a bus system, wherein the memory is used to store a program, and the processor is used to execute the program in the memory to perform the methods described in the first aspect above and any of its optional methods.

[0048] Fourthly, embodiments of this application provide a computer-readable storage medium storing a computer program that, when run on a computer, causes the computer to perform the methods described in the first aspect and any of its optional methods.

[0049] Fifthly, embodiments of this application provide a computer program that, when run on a computer, causes the computer to perform the first aspect and any of its optional methods described above.

[0050] Sixthly, this application provides a chip system including a processor for supporting an execution data processing device in implementing the functions involved in the foregoing aspects, such as transmitting or processing data involved in the foregoing methods; or, information. In one possible design, the chip system further includes a memory for storing program instructions and data necessary for the execution device or training device. This chip system may be composed of chips or may include chips and other discrete devices. Attached Figure Description

[0051] Figure 1 This application provides an architectural illustration;

[0052] Figure 2 This application provides an architectural illustration;

[0053] Figure 3 A flowchart illustrating a data processing method provided in an embodiment of this application;

[0054] Figures 4 to 9D This application system framework is illustrated in the embodiments of this application.

[0055] Figure 10 A schematic diagram of the structure of a data processing apparatus provided in an embodiment of this application;

[0056] Figure 11 A schematic diagram of the structure of the execution device provided in the embodiments of this application;

[0057] Figure 12 A schematic diagram of the structure of the training device provided in the embodiments of this application;

[0058] Figure 13 This is a schematic diagram of a chip structure provided in an embodiment of this application. Detailed Implementation

[0059] First, some expressions that may appear in this application will be explained.

[0060] "First" and "second" are used to distinguish different objects or to differentiate different treatments of the same object, rather than to describe a specific order of objects.

[0061] "At least one" means one or more, while "more" means two or more.

[0062] "And / or" describes the relationship between related objects, indicating that there can be three relationships. For example, A and / or B can mean: A exists alone, A and B exist simultaneously, or B exists alone. A and B can be singular or plural.

[0063] The character " / " generally indicates that the objects before and after it are in an "or" relationship. For example, A / B can mean A or B.

[0064] Furthermore, the terms "comprising," "including," and "having" used in the description of this application are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or apparatus that includes a series of steps or units is not limited to the steps or units listed, but may optionally include other steps or units not listed, or may optionally include other steps or units inherent to such process, method, product, or apparatus.

[0065] It should be noted that in this application, the terms "exemplary" or "for example" are used to indicate that something is being described or illustrated. Any implementation or design scheme described as "exemplary" or "for example" (such as the embodiments in this application) should not be construed as being more preferred or advantageous than other implementations or design schemes. Specifically, the use of the terms "exemplary" or "for example" is intended to present the relevant concepts in a concrete manner.

[0066] In the specification and drawings of this application, the terms "of", "relevant", and "corresponding" may sometimes be used interchangeably. It should be noted that when the distinction is not emphasized, they have the same meaning.

[0067] Below is a brief description of some of the terms used in this application.

[0068] Terminals can include desktop, laptop, handheld, and vehicle-mounted user equipment (UE) devices, such as smartphones, cellular phones, desktop computers, tablets, smart TVs, smart TV boxes, ultra-mobile personal computers (UMPCs), laptops, personal digital assistants (PDAs), portable multimedia players (PMPs), dedicated media players, consumer communication devices, wearable devices (such as smartwatches), AR (augmented reality) / VR (virtual reality) devices, and other types of communication devices.

[0069] Storage space, also called address space, refers to one or more segments of addresses that can be used by a device or instance. For example, the virtual address space of a device or instance is the segment or more virtual addresses that can be used by that device or instance. The virtual address space of a device or instance is allocated by the operating system running that device or instance. Similarly, the physical address space of a device or instance is the segment or more physical addresses allocated to that device or instance. When a device or instance uses this physical address space, other devices or instances cannot use addresses within that physical address space. An instance's physical address space is allocated by the operating system running that instance. This allocation may be dynamic; for example, as the instance runs, the physical address space it occupies may increase, but there is an upper limit. The size and range of a device's physical address space are usually fixed.

[0070] Streams (i.e., processing streams) play a crucial role in deep learning frameworks. A stream is a mechanism for parallel computing that allows a series of operations, such as memory transfers and kernel calls, to be executed asynchronously on a device. Operations assigned to the same stream are executed in pre-order on the device side, increasing throughput and reducing latency. In deep learning frameworks, multi-stream concurrency significantly improves the efficiency of computing resource utilization and the speed of model training by processing multiple tasks simultaneously. Multi-stream concurrency allows the framework to process multiple independent computational tasks at the same time. This efficient concurrency mechanism enables deep learning models to iterate and optimize faster, especially in the era of large models, where the role of multi-stream concurrency is particularly evident, greatly promoting the advancement and application of deep learning technology.

[0071] However, multi-stream concurrency also increases the pressure on device memory. Each concurrent stream requires independent device memory to store its execution state and intermediate computation results; device memory cannot be directly reused between different streams. If device memory management is not done properly, it can lead to excessively high peak memory usage or excessive fragmentation, resulting in OOM (out of memory) errors, reducing memory utilization and computational efficiency.

[0072] To address the aforementioned problems, this application provides a data processing method.

[0073] Reference Figure 1 , Figure 1 This is a schematic diagram of an application architecture for this application.

[0074] Figure 1 An example of system 100 in an embodiment of this application is shown. For example... Figure 1 As shown, the system 100 may include at least one control module 110 and multiple data processing modules 120, with the control module 110 communicatively connected to each data processing module 120.

[0075] The data processing module 120 can be one or more hardware devices, or different instances of software, such as different processes.

[0076] The control module 110 can be used to coordinate and manage the operation of each data processing module 120, such as allocating and distributing operators to each data processing module. When the system 100 includes one control module 110, the control module 110 performs the above operations within the system 110; when the system 100 includes multiple control modules 110, one of the multiple control modules 110 acts as the main control module, used to perform the above operations within the system 110, while the other control modules 110 act as backup control modules. The data processing module 120 can be a working node used to execute operators.

[0077] Reference Figure 2 , Figure 2 This is a schematic diagram of an application architecture for this application.

[0078] It includes software modules: code files and a deep learning framework, with a new module added to the deep learning framework: online dependency computation;

[0079] It includes hardware modules: CPU / NPU / GPU, used for processing framework and operator computations. The CPU / GPU / NPU and other hardware are used to execute software code and deep learning operator code.

[0080] 1. Code files: used to describe the deep learning algorithms required by the user.

[0081] 2. Deep Learning Framework: A framework software that compiles and executes deep learning models. It includes the following components:

[0082] a) Online computation-dependent operation, dynamically reusing operator memory as operators are deployed.

[0083] b) Operator distribution: Distribute operators from the host side to the device side.

[0084] Reference Figure 3 , Figure 3 This is a flowchart illustrating a data processing method provided in an embodiment of this application, such as... Figure 3 As shown in the embodiment of this application, a data processing method may include steps 301 to 303, which are described in detail below.

[0085] 301. Issue the first operator to the first processing stream;

[0086] 302. Send a second operator to the second processing stream; wherein the input of the second operator includes the output of the first operator;

[0087] The control module can be a module of a deep learning framework. The control module can obtain code files, such as code files that can describe the deep learning algorithms required by the user. In typical deep learning scenarios, users can write their own algorithms, and the deep learning framework can send the user-written operators to the device side according to the script.

[0088] In this embodiment of the application, the device side may include multiple data processing modules (e.g., a first processing stream, a second processing stream, a third processing stream, etc.). Each data processing module may be responsible for processing a processing stream (or simply a stream), such as a computing stream or a communication stream. Each stream may contain multiple operators, and the data processing module may execute the stream by executing the operators.

[0089] The first operator and the second operator can be operators identified by the control module from code files (e.g., user scripts). The first operator needs to be sent to the first processing stream for execution, and the second operator needs to be sent to the second processing stream for execution. The input of the second operator includes the output of the first operator. That is, the first operator acts as the producer of data, and the second operator acts as the consumer of the data obtained by the first operator.

[0090] In one possible implementation, the first processing stream is used to execute the first processing stream, and the second processing stream is used to execute the second processing stream; furthermore, the control module can issue the first operator to the first processing stream and issue the second operator to the second processing stream.

[0091] In one possible implementation, the first processing stream and the second processing stream are used to perform the same processing task for deep learning.

[0092] The control module can issue operators to various data processing modules (e.g., the processing streams handled by the data processing modules) based on the code file. For example, the control module may include an operator label generator, which can be a single-incrementing number generator, with each issued operation having a unique ID.

[0093] For example, the operators that can be issued may be as follows:

[0094] 1. A regular operator is issued, representing a computation or communication operation;

[0095] 2. An event record represents a record operation sent on this stream, which will block wait operations sent on other streams by the other end. It is used for execution order control between different streams.

[0096] 3. Event wait means that a wait operation is sent on this stream, which will be blocked by record operations sent on other streams on the other end. It is used to control the execution order of different streams.

[0097] 4. A reference event is a special type of event that only has record and query operations. It is used by the host to query whether the device has finished executing the reference event.

[0098] For example, operators are assigned to different streams based on different computation types and the computational characteristics on the device side. A common strategy is to assign ordinary computation operators to one stream and communication operators to another. When the computation graph is distributed: the computation graph obtained from the front end is executed at runtime, with online multi-stream memory reuse during execution. The following example uses a computation graph. Figure 4 The computation graph consists of 6 operators distributed across three flows. The flowchart can be distributed as follows: Figure 5 As shown.

[0099] 303. After the first processing stream determines that the second processing stream has received the second operator, and before the second processing stream finishes executing the second operator, the memory space is released; wherein the memory space stores the output calculated by the first processing stream through the first operator.

[0100] In this process, after the first operator calculates the output, it needs to store the output in memory. The second operator, acting as a data consumer, can read the data from this memory and complete its own calculation. In existing technologies, the memory space is only released after the completion of the second operator's calculation is detected. For example, a reference event is inserted after the second operator to determine whether the second processing stream has completed the calculation of the second operator by polling the state of the reference event. However, since the calculation of the second operator often involves a certain delay, the release of the memory space requires a long wait, which leads to a decrease in memory utilization and computational efficiency.

[0101] In this embodiment of the application, after the first processing stream detects that the second processing stream has received the second operator, the memory space can be released before the second operator is executed. The time of sending the operator is much faster than the time of the operator's execution, thereby accelerating the release of memory and improving memory utilization and computational efficiency.

[0102] It should be understood that when the data consumer of the output of the first operator includes other operators besides the second operator, the first processing stream releases memory space when it realizes that the other operators have successfully completed the delivery.

[0103] The following describes how to determine whether the first processing stream is aware that the second processing stream has received the second operator sent down.

[0104] In one possible implementation, a mapping relationship can be obtained, which indicates the latest operator received by the second processing stream that the first processing stream can perceive; when the mapping relationship indicates that the latest received operator is an operator following the second operator, it is determined that the first processing stream perceives that the second processing stream has received the second operator.

[0105] The mapping relationship can be used to represent the latest issued operators that different data stream processing modules are aware of each other. For example, it can be in the form of a table. In this way, the latest received issued operators that different computing streams can be aware of each other can be determined based on the mapping relationship, thereby determining the timing of memory release.

[0106] For example, a mapping relationship can contain two data structures: a dependency table and an event list.

[0107] The dependency table is a two-dimensional table consisting of the number of streams multiplied by the stream size. The horizontal index is x, and the vertical index is y. The table means that for stream x, the operator corresponding to which operator label has been executed in stream y.

[0108] The Event list is a list of elements, each element representing an event. The length of each element is the stream number, indicating which operator corresponding to which operator label has been executed for that event in other streams.

[0109] By maintaining topological relationships through dependency tables and erasing reference events based on these tables, memory can be released early based on these topological relationships, thus extending memory lifecycle and static memory management. Figure 1 To.

[0110] In one possible implementation, a sixth operator is sent to the second processing stream; a third operator is sent to the first processing stream; wherein the third operator is an operator sent after and associated with the sixth operator, and the sixth operator and the third operator are control operators for cross-stream operations from the second processing stream to the first processing stream; after the third operator is sent to the first processing stream, the mapping relationship is updated, and the latest received operator in the updated mapping relationship is the sixth operator.

[0111] In one possible implementation, the sixth operator is used to block the third operator, which is blocked by the sixth operator until the sixth operator completes execution.

[0112] For example, the sixth operator can be the Event record operator, and the third operator can be the Event wait operator.

[0113] When the first processing stream receives the third operator, and the sixth operator associated with the third operator is located in the second processing stream, the first processing stream can perceive that the latest operator sent by the second processing stream is the sixth operator. Since the sixth operator is sent after the second operator, it can be inferred that the first processing stream has perceived that the second processing stream has received the second operator, and thus can release memory space.

[0114] In one possible implementation, a sixth operator is sent to the second processing stream; a third operator is sent to the third processing stream; wherein the third operator is an operator sent after and associated with the sixth operator, and the sixth and third operators are control operators for cross-stream operations from the second processing stream to the third processing stream; a seventh operator is sent to the third or fourth processing stream; an eighth operator is sent to the first processing stream; wherein the eighth operator is an operator sent after and associated with the seventh operator, and the seventh and eighth operators are control operators for cross-stream operations from the third or fourth processing stream to the first processing stream; after the eighth operator is sent to the first processing stream, the mapping relationship is updated, and the latest received operator in the updated mapping relationship is the sixth operator.

[0115] For example, the sixth operator can be the Event record operator, and the third operator can be the Event wait operator.

[0116] For example, the seventh operator can be the Event record operator, and the eighth operator can be the Event wait operator.

[0117] When the first processing stream receives the eighth operator, although the seventh operator associated with the eighth operator is not located in the second processing stream, this cross-stream processing has historically passed through the second processing stream (based on the sixth and third operators). Therefore, the first processing stream can perceive that the latest operator sent by the second processing stream is the sixth operator. Since the sixth operator was sent after the second operator, it can be inferred that the first processing stream has perceived that the second processing stream has received the second operator, and thus can release memory space.

[0118] For example, when issuing an event record operation, the event list can be updated by inserting a row from the dependency table corresponding to the current stream into the event list. The purpose of this step is to update the dependency table state of the current event on the waiting stream when waiting occurs later.

[0119] Before issuing operators, if cross-stream memory references are found, the dependency table is checked to see if the producer has already executed for that stream. If not, an event record is issued to the producer stream and an eventwait to the consumer stream. A reference event is attached to the referenced tensor and issued to the consumer stream. This step ensures the legality of memory usage and reuse even when cross-stream dependencies and record stream markers are not explicitly written in the script. It also ensures compatibility with cross-stream dependency operations in the script, preventing the introduction of additional cross-stream dependencies into the framework's processing and thus avoiding impacts on the concurrency of different streams.

[0120] Issue an event wait, update the dependency table with the corresponding event item in the event list, take the maximum value for each element in the corresponding row, and clear the corresponding item in the event table. Erase all reference events on the stream whose labels are less than the corresponding positions in the dependency table. If a reference event is null, it means that the tensor can be released.

[0121] The purpose of this step is to release cross-stream memory in advance according to the topological relationship, achieving the best memory reuse effect. It can be observed that after this processing, the lifecycle of the tensor is consistent with the graph's topological analysis results. Cross-stream tensor references can be released before returning from the underlying device side.

[0122] The dependency table and event list can be updated to Figure 6 The table shown has a header representing the stream and entries indicating which operator was most recently received by stream Y for stream X. Since there are no cross-stream operations yet, only the data on the diagonal is updated.

[0123] In issuing operator 4 (process) Figure 5 Before operator 6), a cross-stream memory reference was detected, requiring the insertion of recordvent (label 4) and wait event (label 5) to ensure the validity of the cross-stream memory reference. After issuing label 5, the dependency table and event list are updated. Figure 7 The table shown:

[0124] Issue a reference event (labeled 7) to extend the lifecycle of tensor 0. In issuing operator 5 (operator label 10 in the flowchart), a cross-stream memory reference is detected. Refer to the previous step to issue a record event (labeled 8) and a wait event (labeled 9).

[0125] The dependency table and event list have been updated to Figure 8 The table shown:

[0126] Before issuing operator 6 (not shown in the flowchart), a cross-stream reference was discovered. Referring to the previous step of issuing record event (labeled 12) and wait event (labeled 13): the dependency table and event list were updated as follows: Figure 9A The table shown.

[0127] Erasing references (event #7) allows tensor 0 to be freed, with the timing consistent with that of the static graph.

[0128] In one possible implementation, the first operator and the second operator are identified from the algorithm script; if the algorithm script does not contain a cross-stream control operator between the first operator and the second operator, a fourth operator may be issued to the first processing stream after the first operator is issued, and a fifth operator may be issued to the second processing stream before the second operator is issued; the fifth operator is an operator issued after the fourth operator and associated with the fourth operator, and the fourth operator and the fifth operator are control operators for cross-stream operations from the first processing stream to the second processing stream.

[0129] In this embodiment of the application, cross-stream communication operators can be automatically issued when cross-stream operations are present, thereby reducing the user's operating costs.

[0130] For example, the component structure of this application embodiment is as follows: Figure 9B As shown, this implementation utilizes the open-source MindSpore deep learning framework to achieve memory reuse for static graphs. The component structure involves three modules: the MindSpore front-end Python graph construction module, the stream allocation module, and the runtime module. The runtime module adds dependency tables and event list data structures, as well as two sub-modules: an operator label generator and an online dependency calculation algorithm.

[0131] For example, the component structure of this application embodiment is as follows: Figure 9C As shown, Figure 9CThe scenario shown is a dynamic graph. In a dynamic graph, there is no concept of a computation graph; instead, single operators are deployed according to the script, and the user determines the stream allocation for each operator. At runtime, online dependency calculations for memory can still be performed using reference counting to achieve optimal multi-stream memory reuse.

[0132] Reference Figure 9D , Figure 9D This is a schematic diagram of a process according to an embodiment of the present application, including:

[0133] S1: Operator issuance.

[0134] In typical deep learning scenarios, users write their own algorithms in Python, and the deep learning framework distributes the user-written operators to the device side according to the script. Depending on the operator type, different operations from S2 to S5 are performed.

[0135] S2: When issuing an event record operation, update the event list by inserting a row from the dependency table corresponding to the current stream into the event list.

[0136] The purpose of this step is to update the current event dependency table state to the wait stream when waiting is called later.

[0137] S3: Before issuing operators, if cross-stream memory references are found, first check the dependency table to see if the producer has already executed for that stream. If not, issue an event record to the producer stream and an event wait to the consumer stream. The referenced tensor is attached with a reference event and issued to the consumer stream.

[0138] The purpose of this step is to ensure the legality of memory usage and reuse even when the script does not explicitly specify cross-stream dependencies and record stream markers. It also ensures compatibility with cross-stream dependency operations in the script, preventing the introduction of additional cross-stream dependencies into the framework's processing and thus avoiding impacts on the concurrency of different streams.

[0139] S4: Issue an event wait, update the dependency table with the corresponding event item in the event list, take the maximum value for each element in the corresponding row, and clear the corresponding item in the event table. Erase all reference events on this stream whose labels are less than the corresponding positions in the dependency table. If a reference event is null, it means that the tensor can be released.

[0140] The purpose of this step is to release cross-stream memory in advance according to the topological relationship, achieving the best memory reuse effect. It can be observed that after this processing, the lifecycle of the tensor is consistent with the graph's topological analysis results. Cross-stream tensor references can be released before returning from the underlying device side.

[0141] S5: All distributed operations trigger the operator label generator, updating the diagonal portion of the dependency table.

[0142] S6: Execute S1 in a loop, with the operator being issued in a loop, and then return to S1 to continue execution.

[0143] Reference Figure 10 , Figure 10 This is a schematic diagram of the structure of a data processing apparatus provided in an embodiment of this application, such as... Figure 10 As shown in the embodiment of this application, a data processing apparatus 1000 is provided, the apparatus comprising:

[0144] The transceiver module 1001 is used to send a first operator to the first processing stream and a second operator to the second processing stream; wherein the input of the second operator includes the output of the first operator;

[0145] For a detailed description of the transceiver module 1001, please refer to the description of step 501 in the above embodiment. The similarities will not be repeated here.

[0146] The memory management module 1002 is used to release memory space after the first processing stream determines that the second processing stream has received the second operator, and before the second processing stream finishes executing the second operator; wherein the memory space stores the output calculated by the first processing stream through the first operator.

[0147] For a detailed description of the memory management module 1002, please refer to the descriptions of steps 502 and 503 in the above embodiments. The similarities will not be repeated here.

[0148] In one possible implementation, the device further includes:

[0149] A mapping relationship management module is used to obtain mapping relationships, which include the latest operator received by the second processing stream as determined by the first processing stream;

[0150] The first processing stream determines that the second processing stream has received the second operator, including:

[0151] If the mapping relationship indicates that the latest operator received by the second processing stream as determined by the first processing stream is an operator following the second operator, the first processing stream determines that the second processing stream has received the second operator.

[0152] In one possible implementation, the mapping relationship management module is further configured to update the mapping relationship when a third operator is issued to the first processing stream, and the third operator is an operator issued after the second operator. The updated mapping relationship indicates that the operator most recently received by the second processing stream as determined by the first processing stream is the second operator. The third operator is a control operator for cross-stream operations performed between the first processing stream and other processing streams.

[0153] In one possible implementation, the first operator and the second operator are identified from a code file; the transceiver module 1001 is further configured to:

[0154] When the code file does not contain a cross-stream control operator between the first operator and the second operator, after the first operator is sent to the first processing stream, a fourth operator is sent to the first processing stream, and before the second operator is sent to the second processing stream, a fifth operator is sent to the second processing stream; the fourth operator and the fifth operator are control operators for cross-stream operations from the first processing stream to the second processing stream.

[0155] In one possible implementation, the transceiver module 1001 is further configured to:

[0156] A sixth operator is issued to the other processing streams; the sixth operator and the third operator are control operators for cross-stream operations from the other processing streams to the first processing stream; the sixth operator is used to block the third operator, and the third operator is used to be blocked by the sixth operator until the sixth operator completes execution.

[0157] In one possible implementation, the first processing stream and the second processing stream are used to perform the same processing task for deep learning.

[0158] In one possible implementation, the method is executed on the host side, while the first processing stream and the second processing stream are executed on the device side.

[0159] The following describes a terminal device provided in an embodiment of this application. Please refer to [link / reference]. Figure 11 , Figure 11This is a schematic diagram of a terminal device provided in an embodiment of this application. The terminal device 900 can specifically be a virtual reality (VR) device, a mobile phone, a tablet, a laptop computer, a smart wearable device, etc., and is not limited thereto. Specifically, the terminal device 900 includes: a receiver 901, a transmitter 902, a processor 903, and a memory 904 (wherein the terminal device 900 may have one or more processors 903). Figure 11 (Taking a processor as an example), processor 903 may include application processor 9031 and communication processor 9032. In some embodiments of this application, receiver 901, transmitter 902, processor 903 and memory 904 may be connected via a bus or other means.

[0160] Memory 904 may include read-only memory and random access memory, and provides instructions and data to processor 903. A portion of memory 904 may also include non-volatile random access memory (NVRAM). Memory 904 stores processor and operation instructions, executable modules, or data structures, or subsets thereof, or extended sets thereof, wherein the operation instructions may include various operation instructions for implementing various operations.

[0161] Processor 903 controls the operation of the execution device. In specific applications, the various components of the execution device are coupled together through a bus system, which may include not only the data bus, but also power buses, control buses, and status signal buses. However, for clarity, all buses are referred to as the bus system in the diagram.

[0162] The methods disclosed in the embodiments of this application can be applied to or implemented by the processor 903. The processor 903 can be an integrated circuit chip with signal processing capabilities. During implementation, each step of the above method can be completed by the integrated logic circuitry in the hardware of the processor 903 or by instructions in software form. The processor 903 can be a general-purpose processor, a digital signal processor (DSP), a microprocessor, or a microcontroller, and may further include an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. The processor 903 can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the methods disclosed in the embodiments of this application can be directly manifested as being executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software module can reside in a mature storage medium in the art, such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, or registers. This storage medium is located in memory 904, and processor 903 reads the information from memory 904 and, in conjunction with its hardware, completes the steps of the above method.

[0163] Receiver 901 can be used to receive input digital or character information, and to generate signal inputs related to the settings and function control of the execution device. Transmitter 902 can be used to output digital or character information through the first interface; transmitter 902 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; transmitter 902 may also include a display device such as a display screen.

[0164] This application also provides a server; please refer to [link / reference]. Figure 12 , Figure 12This is a schematic diagram of a server structure provided in an embodiment of this application. The server 1000 can vary significantly due to different configurations or performance. It may include one or more central processing units (CPUs) 1010 (e.g., one or more processors) and memory 1032, and one or more storage media 1030 (e.g., one or more mass storage devices) for storing application programs 1042 or data 1044. The memory 1032 and storage media 1030 can be temporary or persistent storage. The program stored in the storage media 1030 may include one or more modules (not shown in the diagram), each module may include a series of instruction operations on the server. Furthermore, the CPU 1010 may be configured to communicate with the storage media 1030 and execute the series of instruction operations in the storage media 1030 on the server 1000.

[0165] Server 1000 may also include one or more power supplies 1026, one or more wired or wireless network interfaces 1050, one or more input / output interfaces 1058; or, one or more operating systems 1041, such as Windows Server™, Mac OS X™, Unix™, Linux™, FreeBSD™, etc.

[0166] In this embodiment, the central processing unit 1010 is used to execute the actions described in the above embodiments.

[0167] This application also provides a computer program product that, when run on a computer, causes the computer to perform steps as performed by the aforementioned execution device, or causes the computer to perform steps as performed by the aforementioned training device.

[0168] This application also provides a computer-readable storage medium storing a program for signal processing, which, when run on a computer, causes the computer to perform steps as performed by the aforementioned execution device, or causes the computer to perform steps as performed by the aforementioned training device.

[0169] The execution device, training device, or terminal device provided in this application embodiment can specifically be a chip. The chip includes a processing unit and a communication unit. The processing unit can be, for example, a processor, and the communication unit can be, for example, an input / output interface, pins, or circuits. The processing unit can execute computer execution instructions stored in the storage unit to cause the chip within the execution device to execute the data processing method described in the above embodiments, or to cause the chip within the training device to execute the data processing method described in the above embodiments. Optionally, the storage unit can be a storage unit within the chip, such as a register or cache. Alternatively, the storage unit can be a storage unit located outside the chip within the wireless access device, such as a read-only memory (ROM) or other types of static storage devices capable of storing static information and instructions, such as random access memory (RAM).

[0170] For details, please refer to Figure 13 , Figure 13 This is a schematic diagram of a chip provided in an embodiment of this application. The chip can be represented as a neural network processor (NPU) 1100. The NPU 1100 is mounted as a coprocessor on the host CPU, and tasks are assigned by the host CPU. The core part of the NPU is the arithmetic circuit 1103, which is controlled by the controller 1104 to extract matrix data from the memory and perform multiplication operations.

[0171] In some implementations, the arithmetic circuit 1103 internally includes multiple processing engines (PEs). In some implementations, the arithmetic circuit 1103 is a two-dimensional pulsating array. The arithmetic circuit 1103 can also be a one-dimensional pulsating array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 1103 is a general-purpose matrix processor.

[0172] For example, suppose we have an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit retrieves the corresponding data of matrix B from the weight memory 1102 and caches it in each PE of the arithmetic circuit. The arithmetic circuit retrieves the data of matrix A from the input memory 1101 and performs matrix operations with matrix B. The partial result or the final result of the obtained matrix is ​​stored in the accumulator 1108.

[0173] Unified memory 1106 is used to store input and output data. Weight data is directly transferred to weight memory 1102 via Direct Memory Access Controller (DMAC) 1105. Input data is also transferred to unified memory 1106 via DMAC.

[0174] BIU stands for Bus Interface Unit, which is used for interaction between the AXI bus and the DMAC and the Instruction Fetch Buffer (IFB) 1109.

[0175] The Bus Interface Unit (BIU) 1110 is used by the instruction fetch memory 1109 to fetch instructions from external memory, and also by the memory access controller 1105 to fetch the original data of the input matrix A or the weight matrix B from external memory.

[0176] The DMAC is mainly used to move input data from external memory DDR to unified memory 1106, or to weight data to weight memory 1102, or to input data to input memory 1101.

[0177] The vector computation unit 1107 includes multiple processing units that, when needed, further process the output of the computation circuit 1103, such as vector multiplication, vector addition, exponential operations, logarithmic operations, size comparisons, etc. It is mainly used for computation in non-convolutional / fully connected layers of neural networks, such as batch normalization, pixel-level summation, and upsampling of feature planes.

[0178] In some implementations, vector computation unit 1107 can store the processed output vector in unified memory 1106. For example, vector computation unit 1107 can apply a linear function, or a nonlinear function, to the output of computation circuit 1103, such as linear interpolation of feature planes extracted by convolutional layers, or, for example, a vector of accumulated values, to generate activation values. In some implementations, vector computation unit 1107 generates normalized values, pixel-level summed values, or both. In some implementations, the processed output vector can be used as activation input to computation circuit 1103, for example, for use in subsequent layers of the neural network.

[0179] The instruction fetch buffer 1109 connected to the controller 1104 is used to store the instructions used by the controller 1104;

[0180] Unified memory 1106, input memory 1101, weight memory 1102, and instruction fetch memory 1109 are all on-chip memories. External memory is proprietary to this NPU hardware architecture.

[0181] The processor mentioned above can be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits used to control the execution of the above program.

[0182] It should also be noted that the device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. In addition, in the device embodiment drawings provided in this application, the connection relationship between modules indicates that they have a communication connection, which can be implemented as one or more communication buses or signal lines.

[0183] Through the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general-purpose hardware, or it can be implemented by special-purpose hardware including application-specific integrated circuits, special-purpose CPUs, special-purpose memory, special-purpose components, etc. Generally, any function performed by a computer program can be easily implemented by corresponding hardware, and the specific hardware structure used to implement the same function can also be diverse, such as analog circuits, digital circuits, or special-purpose circuits. However, for this application, software program implementation is more often the preferred implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a readable storage medium, such as a computer floppy disk, USB flash drive, mobile hard disk, ROM, RAM, magnetic disk, or optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, training equipment, or network device, etc.) to execute the methods described in the various embodiments of this application.

[0184] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product.

[0185] The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a training device or data center that integrates one or more available media. The available media may be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., DVDs), or semiconductor media (e.g., solid-state drives (SSDs)).

Claims

1. A data processing method, characterized in that, The method includes: Issue the first operator to the first processing stream; A second operator is issued to a second processing stream; wherein the input of the second operator includes the output of the first operator; After the first processing stream determines that the second processing stream has received the second operator, and before the second processing stream finishes executing the second operator, the memory space is released; wherein, the memory space stores the output calculated by the first processing stream through the first operator.

2. The method according to claim 1, characterized in that, The method further includes: Obtain the mapping relationship, which includes the operator most recently received by the second processing stream as determined by the first processing stream; The first processing stream determines that the second processing stream has received the second operator, including: If the mapping relationship indicates that the latest operator received by the second processing stream as determined by the first processing stream is an operator following the second operator, the first processing stream determines that the second processing stream has received the second operator.

3. The method according to claim 2, characterized in that, The method further includes: When a third operator is issued to the first processing stream, and the third operator is an operator issued after the second operator, the mapping relationship is updated. The updated mapping relationship indicates that the operator most recently received by the second processing stream, as determined by the first processing stream, is the second operator. The third operator is a control operator for cross-stream operations performed between the first processing stream and other processing streams.

4. The method according to any one of claims 1 to 3, characterized in that, The first operator and the second operator are identified from the code file; the method further includes: When the code file does not contain a cross-stream control operator between the first operator and the second operator, after the first operator is sent to the first processing stream, a fourth operator is sent to the first processing stream, and before the second operator is sent to the second processing stream, a fifth operator is sent to the second processing stream; the fourth operator and the fifth operator are control operators for cross-stream operations from the first processing stream to the second processing stream.

5. The method according to claim 3 or 4, characterized in that, The method further includes: A sixth operator is issued to the other processing streams; the sixth operator and the third operator are control operators for cross-stream operations from the other processing streams to the first processing stream; the sixth operator is used to block the third operator, and the third operator is used to be blocked by the sixth operator until the sixth operator completes execution.

6. The method according to any one of claims 1 to 5, characterized in that, The first processing stream and the second processing stream are used to implement the same processing task for deep learning.

7. The method according to any one of claims 1 to 6, characterized in that, The method is executed on the host side, while the first processing stream and the second processing stream are executed on the device side.

8. A data processing apparatus, characterized in that, The device includes: The transceiver module is used to send a first operator to the first processing stream and a second operator to the second processing stream; wherein the input of the second operator includes the output of the first operator; The memory management module is used to release memory space after the first processing stream determines that the second processing stream has received the second operator, and before the second processing stream finishes executing the second operator; wherein the memory space stores the output calculated by the first processing stream through the first operator.

9. The apparatus according to claim 8, characterized in that, The device further includes: A mapping relationship management module is used to obtain mapping relationships, which include the latest operator received by the second processing stream as determined by the first processing stream; The first processing stream determines that the second processing stream has received the second operator, including: If the mapping relationship indicates that the latest operator received by the second processing stream as determined by the first processing stream is an operator following the second operator, the first processing stream determines that the second processing stream has received the second operator.

10. The apparatus according to claim 9, characterized in that, The mapping relationship management module is further configured to update the mapping relationship when a third operator is issued to the first processing stream and the third operator is an operator issued after the second operator. The updated mapping relationship indicates that the operator most recently received by the second processing stream as determined by the first processing stream is the second operator. The third operator is the control operator for cross-stream operations performed between the first processing stream and other processing streams.

11. The apparatus according to any one of claims 8 to 10, characterized in that, The first operator and the second operator are identified from the code file; the transceiver module is further configured to: When the code file does not contain a cross-flow control operator between the first operator and the second operator, after the first operator is sent to the first processing stream, a fourth operator is sent to the first processing stream, and before the second operator is sent to the second processing stream, a fifth operator is sent to the second processing stream. The fourth operator and the fifth operator are control operators for cross-stream operations from the first processing stream to the second processing stream.

12. The apparatus according to claim 10 or 11, characterized in that, The transceiver module is also used for: A sixth operator is issued to the other processing streams; the sixth operator and the third operator are control operators for cross-stream operations from the other processing streams to the first processing stream; the sixth operator is used to block the third operator, and the third operator is used to be blocked by the sixth operator until the sixth operator completes execution.

13. The apparatus according to any one of claims 8 to 12, characterized in that, The first processing stream and the second processing stream are used to implement the same processing task for deep learning.

14. The apparatus according to any one of claims 8 to 13, characterized in that, The method is executed on the host side, while the first processing stream and the second processing stream are executed on the device side.

15. A computer storage medium, characterized in that, The computer storage medium stores one or more instructions, which, when executed by one or more computers or processors, cause the one or more computers or processors to perform the method of any one of claims 1 to 7.

16. A computer program product, characterized in that, Includes computer-readable instructions that, when executed on a computer device or processor, cause the computer device or processor to perform the method as described in any one of claims 1 to 7.

17. A system comprising at least one processor and at least one memory; The at least one processor and the at least one memory are connected via a communication bus; The at least one memory is used to store code; The at least one processor is used to execute the code to perform the method as described in any one of claims 1 to 7.

18. A chip, comprising a processor, characterized in that, The processor is used to support the data processing device in implementing the method as described in any one of claims 1 to 7.