Tensor processing method, apparatus, and system
The method optimizes tensor processing by managing memory and iterating through operations in loops, addressing the challenge of balancing performance and resource constraints in limited environments, enhancing efficiency and reducing power consumption.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- AIMOTIVE KFT
- Filing Date
- 2024-06-26
- Publication Date
- 2026-07-02
AI Technical Summary
High-performance computing systems struggle with balancing processing performance and memory consumption, particularly in limited-resource environments such as portable or on-board platforms, especially when handling complex tasks like machine learning computations and tensor data structures.
A method for processing tensors that involves iterating through operations in loops, managing memory usage efficiently by deallocating unused data and using a local memory unit to reduce external memory access, with strategies like depth-first search or breadth-first search to optimize resource utilization.
Enhances processing efficiency and reduces power consumption by minimizing memory access and optimizing resource use, particularly beneficial for devices with limited form factor or silicon area.
Smart Images

Figure 2026521863000001_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to optimizations in tensor processing, as well as related methods, media, devices, and systems.
Background Art
[0002] Modern data computing is increasing the need for high-performance computing systems. They often involve the use of high-speed processors and high memory consumption. However, standard high-performance computing systems cannot be used on certain platforms such as portable or on-board platforms, and there is often a trade-off between the required processing performance and affordable computing means. This is particularly true in the case of on-board processing systems where there may be a need to perform complex tasks involving limited computing resources, complex data structures such as machine learning computations, tensor data structures, and complex processes including convolutional neural networks (CNNs).
Summary of the Invention
[0003] There is a need for optimization of processing techniques to better meet performance expectations with limited computing resources (e.g., in terms of processing power and memory) and reduced power consumption. Therefore, it is an object of the present invention to address the drawbacks identified above.
[0004] The present invention is defined by a method, one or more computer-readable media, and a device, as described in the independent claims. Preferred embodiments are specified in the dependent claims.
[0005] A first aspect of the present disclosure provides a method for processing a tensor. The method includes the steps of: providing components of a first tensor in a memory unit; determining one or more first operations that use at least the components of the first tensor as inputs; and iterating through one or more first operations in a first loop. The present operation of one or more first operations in the first loop includes the steps of: calculating the components of the second tensor using the present operations applied to at least the components of the first tensor for the present operation to calculate the components of the second tensor using the components of the first tensor, provided that there is sufficient data in the first tensor containing the components of the first tensor; writing the components of the second tensor to a memory unit; dealing out at least the portion of the components of the first tensor from the memory unit if at least the portion of the components of the first tensor is no longer needed by any subsequent operation in the first loop; and starting and iterating over a second loop embedded in the first loop, wherein the second loop includes one or more second operations that use the components of the second tensor as input. For the current operation to compute the components of the second tensor using the components of the first tensor, if there is not enough data in the first tensor containing the components of the first tensor, then the operation continues for one or more subsequent operations of the first operation in the first loop.
[0006] In a preferred embodiment, one or more first operations and one or more second operations may be operations on an operation graph that show input and output dependencies between operations. The input and output dependencies may be, for example, tensor dependencies between output tensors generated by operations, which can be used as input tensors for subsequent operations. An operation that generates inputs for other operations may be designated as a parent operation, and / or other operations that receive outputs may be designated as child operations. Thus, one or more loop-embedded parent operations of child operations may be designated as the parent loop of each child operation or the loop containing each child operation. Conversely, one or more loop-embedded child operations of a parent operation may be designated as the child loop of each parent operation or the parent loop containing each parent operation. For example, the first loop may be the parent loop of a second loop based on an operation graph.
[0007] In one embodiment, the operation graph may represent one or more operations performed on an input tensor to generate an output tensor. Thus, the execution of the entire operation graph on the input tensor may result in an output tensor. In one embodiment, the first tensor may be an input tensor to the operation graph. Additionally or alternatively, the second tensor may be an output tensor.
[0008] In a particularly preferred embodiment, the method may include the steps of: determining whether the first loop is embedded in another parent loop if the current operation is the last operation in the first loop; and, if the first loop is embedded in another parent loop, returning to the other parent loop and continuing the iteration through one or more operations in the other parent loop.
[0009] According to another embodiment, the iteration through the second loop includes, for one or more second current operations in the second loop, the steps of: calculating the third tensor component using the second current operations applied to the second tensor component if there is sufficient data in the second tensor containing the second tensor component; writing the third tensor component to a memory unit; and ensuring that at least any part of the second tensor component is required by any subsequent operation in the second loop. If it is no longer required, the method may include the steps of deallocating at least that portion of the components of the second tensor from the memory unit, and initiating and iterating through a third loop embedded in the second loop, wherein the third loop includes one or more third operations that use at least the components of the third tensor as input, and if there is not enough data in the second tensor containing the components of the second tensor for a second current operation to compute the components of the third tensor using the components of the second tensor, the method may continue with one or more second operations in the second loop. According to some aspects of the present disclosure, the method for processing a tensor may preferably follow a depth-first search (DFS) strategy. However, the present disclosure is not limited to the application of such a strategy, and other types of graph traversal strategies are also intended. For example, the method may alternatively apply a breadth-first search (BFS) strategy or a hybrid form including DFS and BFS.
[0010] In yet another embodiment, for one or more current operations of the first operation in the first loop, if there is sufficient data in the first tensor containing the components of the first tensor for the current operation to compute the components of the second tensor using the components of the first tensor, the method may further include the steps of allocating memory space for the components of the second tensor in a memory unit and writing the components of the second tensor to the allocated memory space in the memory unit.
[0011] According to another embodiment, the first tensor may be a three-dimensional tensor having horizontal, vertical, and depth dimensions, and the components of the first tensor may be groups of data of the first tensor arranged in at least partially one dimension. The first tensor may be defined as having x elements (components) in the horizontal dimension, y elements (components) in the vertical dimension, and / or z elements (components) in the depth dimension. Preferably, the components of the first tensor are groups of data of the first tensor arranged in at least partially the horizontal and / or depth dimensions. Alternatively, the components of the first tensor are groups of data of the first tensor arranged in at least partially the vertical and / or depth dimensions. Furthermore, iterating over one or more first operations in the first loop may include determining, for the current operation of one or more first operations in the first loop, whether there is enough data in the first tensor containing the components of the first tensor for the current operation to compute the components of a second tensor using the components of the first tensor. In some embodiments, this determination may include determining whether the available data in the first tensor on which the current operation is performed is located throughout the depth dimension. In one embodiment, if it is determined that the available data in the first tensor on which the current operation is performed is located throughout the depth dimension, it is determined that there is sufficient data in the first tensor containing the components of the first tensor for the current operation to compute the components of the second tensor using the components of the first tensor. Additionally or alternatively, if it is determined that the available data in the first tensor on which the current operation is performed is not located throughout the depth dimension, it is determined that there is not sufficient data in the first tensor containing the components of the first tensor for the current operation to compute the components of the second tensor using the components of the first tensor.
[0012] In a preferred embodiment, one or more first operations may be kernel operations, each having a kernel having a respective kernel size, such as a filter kernel operation, and computing a component of a second tensor using the current operation applied to at least the component of a first tensor includes applying the kernel of the current operation to the data of the component of the first tensor. Preferably, the method may further include determining, based on the maximum kernel size of the kernel of the next operation in the first loop, whether at least any part of the component of the first tensor is no longer needed by any subsequent operation in the first loop. Throughout this disclosure, the kernel may be referred to as a filter kernel or simply a filter. The kernel is a matrix of weights (typically 3x3, 5x5, 7x7, or any particular size) applied to a tensor (e.g., a first tensor) or a component of a tensor (e.g., a component of the first tensor), preferably through a convolution operation. This may involve sliding the kernel over the tensor, preferably taking the element-wise product of the overlapping entries in the tensor's components with each entry in the kernel, and then summing them up to produce a single output value. This process is repeated over the entire tensor to obtain a new tensor. The tensor and / or the new tensor may be called a feature map or convolutional layer input / output.
[0013] In one embodiment, the method may further include, for one or more current operations of a first operation in a first loop, marking the current operation as processed if there is sufficient data in the first tensor containing the components of the first tensor for the current operation to compute the components of a second tensor using the components of the first tensor. For example, such marking may be performed after or before computing the components of the second tensor using the current operation applied to at least the components of the first tensor.
[0014] In another embodiment, the memory unit may be the device's local memory unit. In this embodiment, providing the components of the first tensor into the memory unit may include receiving the components of the first tensor from an external memory and storing the components of the first tensor in the device's local memory unit.
[0015] A second aspect of the present disclosure defines one or more computer-readable media containing instructions that, when executed on the device, constitute the device to perform a method according to any embodiment of the first aspect of the present disclosure.
[0016] A third aspect of the present disclosure provides a device comprising a local memory unit and a processing unit, the processing unit configured to perform a method according to any embodiment of the first aspect of the present disclosure.
[0017] Preferably, a fourth aspect of the present disclosure provides a computing system comprising at least one processor, system memory, and at least one device. The at least one device may be a device according to a third aspect of the present disclosure. In a preferred embodiment, the at least one processor may be configured to store an input tensor in system memory, provide components of the input tensor to at least one device, receive an output tensor from at least one device, and store the output tensor in system memory.
[0018] Many of the associated features will become easier to understand as they are better understood by referring to the following detailed descriptions, which are considered in conjunction with the attached drawings. [Brief explanation of the drawing]
[0019] The specific features, aspects, and advantages of this disclosure will be better understood in relation to the following description and accompanying drawings. [Figure 1A] This disclosure shows exemplary computing systems and devices according to embodiments of this disclosure. [Figure 1B] An example of a tensor and its components according to an embodiment of the present disclosure is shown. [Figure 2] An example of a computation graph associated with a device according to an embodiment of the present disclosure is shown. [Figure 3A] A flowchart of a method according to an embodiment of the present disclosure is shown. [Figure 3B] Additional exemplary steps of a method according to an embodiment of the present disclosure are shown. [Figure 3C] Details of a method according to an embodiment of the present disclosure are shown. [Figure 4] An exemplary flowchart of a method for processing an input tensor according to an embodiment of the present disclosure is shown. [Figure 5A] Another example of a computation graph according to an embodiment of the present disclosure is shown. [Figure 5B] An instantaneous diagram of tensor memory allocation during the processing of operations in the computation graph of FIG. 5A according to an embodiment of the present disclosure is shown.
Best Mode for Carrying Out the Invention
[0020] [Detailed Description] In the following description, reference is made to the drawings which show various embodiments by way of example. Also, various embodiments are described below with reference to several examples. It should be understood that embodiments may include changes in design and structure without departing from the scope of the subject matter described in the claims. It is not intended that this example represents the only form in which this example can be configured or utilized. However, the same or equivalent functions and sequences may be achieved by different examples. Further, as used in this application and the claims, the singular forms "a", "an", and "the" include the plural unless the context clearly dictates otherwise. Further, the term "includes" means "comprises". Also, the term "coupled" encompasses ways of coupling or connecting mechanical, electrical, magnetic, optical, and other practical items, and does not exclude the presence of intermediate elements between the coupled items.
[0021] The techniques described herein may be implemented in various computing environments, examples of which are described in more detail below. Such environments generally involve the use of a suitably configured computing device that implements multiple modules, each of which provides one or more operations necessary to complete the execution of such techniques. Each module may be implemented in its own way, and not all need to be implemented in the same way. As used herein, a module may be a structural component of a system that performs an operational role, which may be part or all of a software element (e.g., a function of a process, a discrete process, or any other suitable implementation). A module may include computer-executable instructions and may be encoded on a computer storage medium. Modules may be executed in parallel or serially as needed and may exchange information with each other using shared memory on the executing computer, using a message-passing protocol, or in any other suitable way.
[0022] FIG. 1A shows an exemplary computing system 100 and device 101 according to an embodiment of the present disclosure. The computing system 100 may be provided to perform operations (e.g., arithmetic operations) or calculations on one or more tensors. A tensor may represent a multi-dimensional dataset. For example, a tensor may represent a data matrix of two or more dimensions.
[0023] FIG. 1B shows a tensor as a 3D data array having a horizontal dimension, a vertical dimension, and a depth dimension. Other types of tensors may be used, such as one-dimensional datasets sometimes called scalars or vectors, two-dimensional datasets sometimes called matrices, three-dimensional datasets, four-dimensional datasets, or datasets of any other dimension, and are contemplated by the present disclosure.
[0024] The computing system 100 may include at least one processor (not shown) and system memory, and may be configured to perform at least one global computing task based on input data. The computing system 100 may also include data to be processed, stored in system memory, which is processed to obtain an output result. The global computing task may include one or more mathematical operations, including floating-point operations, integer operations, or Boolean operations, and may include one or more operations on one or more tensors. In some embodiments, at least some of the operations of the global computing task may represent operations of a convolutional neural network (CNN).
[0025] To efficiently handle global computing tasks, one or more devices 101 associated with the computing system 100 may be configured to process data to be processed and generate the required output results. For example, as shown in Figure 1A, at least one device 101 may communicate with the computing system 100 (e.g., at least one processor and / or system memory of the computing system 100) to process input data associated with one or more computational operations and return the respective output data to the computing system. The device 101 may be located within the computing system 100 (e.g., integrated within the computing system 100). For example, it may be a separate, dedicated device having its own processing unit within the computing system 100, such as an accelerator, CPU, or GPU platform, and / or may be embodied as a SoC (system on a chip), etc. For example, device 101 may be an accelerator device and / or may be at least partially implemented using a GPU-based component connected to the computing system 100. In other embodiments, device 101 may communicate with computing system 100 via (local) intermediate means (e.g., a bus, interconnect, or network) and / or be remote from computing system 100 and communicate with computing system 100 using network connectivity. Device 101 may include a local memory unit and / or a processing unit. The processing unit may be an arithmetic unit and may have one or more processing cores.
[0026] One or more computational operations associated with device 101 may include all operations of the global computation task, or only some of them. For example, one or more computational operations associated with device 101 may correspond to one or more operations of a particular subtask of the global computation task. In this case, the computing system 100 may comprise multiple devices 101, each configured to process a respective subtask of the global computation task. Each device 101 of the multiple devices 101 may be configured to process an associated subtask of the global computation task. The processing may include cooperative interaction with one or more other devices 101, and some of the multiple devices 101 may be configured to execute some subtasks of the global computation task in parallel. Each device 101 of the multiple devices 101 may receive its respective input data from system memory and return its respective output data to system memory.
[0027] In one embodiment, the data to be processed according to the global computing task may be one or more images or image frames (e.g., RGB and / or depth image data) captured by an imaging device and acquired by the computing system 100, and the global computing task may include image data processing subtasks based on one or more images or image frames. For example, the global computing task may include image filtering processing preferably performed in real time or near real time, and / or aimed at extracting one or more features from one or more images or image frames. However, the data to be processed is not limited to one or more images or image frames, and may alternatively or additionally include any combination of data from other hardware or composite sensors, such as acoustic data, radio wave data, lidar data, radar data, etc.
[0028] A global computation task may be based on one or more operations associated with a tensor and / or a CNN. For example, in one embodiment, the computing system 100 may be configured to pass one or more portions of the data to be processed to a device 101 formatted as one or more components of an input tensor. The device 101 may be configured to perform one or more associated computation operations based on one or more components of the input tensor to generate one or more components of an output tensor. The device 101 may then return one or more components of the output tensor to the computing system 100, for example, to be stored in system memory. For example, one or more components of the output tensor received from the device 101 may be written to system memory outside the device 101 (i.e., external memory of the device 101). In one embodiment, the device 101 may be configured to return one or more components of the output tensor only if all components of the output tensor (e.g., all components constituting the output tensor) have been generated.
[0029] In one embodiment, any tensor involved in one or more computational operations (e.g., an input tensor, an arbitrary intermediate tensor, and / or an output tensor) may be a 2D data array (e.g., a 2D matrix). In such a case, the components of the tensor may correspond to data lines or groups of data lines. For example, if the tensor has horizontal and vertical dimensions, the components may correspond to multiple vertically adjacent lines.
[0030] In one embodiment, device 101 may be configured to sequentially receive and process one or more components of an input tensor received from computing system 100, and sequentially generate the corresponding one or more components of an output tensor. For example, some components of the input tensor may be sequentially supplied to device 101 and processed in parallel within device 101 as soon as they are received to sequentially generate the corresponding components of the output tensor. This has the advantage of accelerating the generation of all components of the output tensor.
[0031] In one embodiment, considering a case where the data to be processed includes one or more images or image frames, the computing system 100 may be configured to acquire one or more portions of the image frame of the data to be processed sequentially and transmit them to the device 101 as one or more components of an input tensor. In one embodiment, at least a portion of the one or more portions of the image frame may correspond to sequential data acquired from scanning a portion of the image frame. For example, the portion may be acquired by scanning the frame horizontally or vertically (e.g., according to the raster scan order) and / or may correspond to sequential adjacent data or adjacent data blocks within the frame. For example, the portion may correspond to adjacent pixel data (e.g., a pixel row, a portion of a pixel row, or sequential data in one or more rows of the image frame). Additionally or alternatively, the portion of the image frame may correspond to data in adjacent blocks of pixels in the image frame (e.g., an 8x8 pixel cell).
[0032] In one embodiment, for example, one or more components of an input tensor received by device 101 from computing system 100 may be one or more data rows of a frame (e.g., one or more pixel rows of an image frame). As soon as one data row is received in the input tensor, device 101 may begin processing the data row and perform one or more computational operations related to device 101 based on the data row. In one embodiment, device 101 may be particularly configured to process one or more sequentially received data rows in parallel, for example, processing each data row according to one or more computational operations of device 101 as soon as each data row is received.
[0033] In some embodiments, the input tensor may correspond to a three-dimensional tensor (e.g., a 3D data array) having horizontal, vertical, and depth dimensions. This is illustrated in Figure 1B. One of the one or more components of the input tensor received by device 101 may correspond to a group of data arranged at least partially in a certain dimension, for example, in the horizontal dimension (e.g., x dimension), vertical dimension (e.g., y dimension), and / or depth dimension (e.g., z dimension). For example, if one or more components of the input tensor are received sequentially from a computing system, one or more components of the input tensor may be stored sequentially in the input tensor according to the order of arrival along the vertical dimension (e.g., y dimension). In this example, a stack of components stored sequentially in the input tensor may be constructed. For example, each component may correspond to a slice, a portion of a slice, or a group of one or more portions of a slice in the input tensor. One or more components of the input tensor may correspond to one or more data rows in the input tensor (e.g., juxtaposed in the horizontal dimension and / or each arranged across the entire depth dimension). In one embodiment, data forming one or more components of an input tensor may be received in the depth direction. In one embodiment, components of an input tensor may include one or more adjacent data rows that are arranged across the entire depth and cover both limited ranges in the vertical and / or horizontal directions. For example, a component of an input tensor may be considered successfully received only if its data is arranged across the entire depth in the input tensor.
[0034] According to one embodiment of the present disclosure, one or more computational operations related to device 101 may require a large number of memory accesses. However, while the system memory of the computing system 100 typically provides a large memory capacity, it can be slow and consume a lot of power. To avoid a decrease in computational throughput due to multiple memory accesses to the system memory and to minimize power consumption, device 101 may include a (local) memory unit.
[0035] For example, the (local) memory unit may preferably be configured to be faster and / or more power-efficient than the system memory. For example, the memory unit may be implemented as one or more dedicated cache memories and / or include one or more SRAMs and / or one or more high-speed DRAMs, but this disclosure is not limited to any particular hardware implementation. Device 101 may be configured to store in the (local) memory unit one or more components of an input tensor, one or more components of an output tensor, and / or one or more components of one or more intermediate tensors corresponding to one or more intermediate results of one or more computational operations. Thus, by using the (local) memory unit in device 101, the amount of memory access to (external) system memory can be reduced, thereby reducing the power consumption caused by external memory access. Furthermore, it is possible to maximize the data rate and / or throughput of the processing unit of device 101.
[0036] However, the size of the (local) memory unit may be limited because the space on the device 101 may be limited, for example, by constraints in terms of a limited form factor or silicon mounting area or chip area. Depending on the amount of data being processed, the device may be configured to implement memory management techniques for the (local) memory unit to avoid costly access to system memory.
[0037] Figure 2 shows an illustrative graph of one or more computational operations. For example, the graph may be an operation graph and may correspond to one or more computational operations associated with device 101, as shown in Figure 1A. The operation graph may represent one or more operations performed on an input tensor to generate an output tensor. Each operation may receive a tensor as input and produce an output (intermediate) tensor that can be supplied as input to other operations. For example, device 101 may be configured to receive one or more components of an input tensor and process one or more components of the input tensor according to the operation graph to generate one or more corresponding components of an output tensor.
[0038] An operation graph may define computational interdependencies between operations. It may contain any number N operations, where N is an integer greater than or equal to 1. Figure 2 shows an exemplary operation graph containing nine operations (operation 0, operation 1, ..., operation 8). For example, operation 0 of the operation graph is performed on an input tensor. The result is an intermediate tensor T1, which then serves as input to operations 1 and 2 to produce another intermediate tensor T2. Furthermore, T2 is used as input by operations 3, 4, and 5. The results of operations 3 and 4 are provided to intermediate tensor T3, which itself is used as input by operations 6 and 7. However, the result of operation 5 is not provided to intermediate tensor T3, but is combined with the results of operations 6 and 7 to provide intermediate tensor T4. Finally, intermediate tensor T4 is used as input by operation 8 to produce an output tensor. Thus, the operation graph illustrates the input and output tensor dependencies between operations in the operation graph.
[0039] Preferably, operations in the operation graph may be organized into one or more operation loops. For example, one or more operations that use the same tensor as input may belong to the same operation loop. For example, operations 1 and 2 in Figure 2 use the intermediate tensor T1 as input and may therefore belong to the same operation loop (e.g., loop B in Figure 2). By analogy, operation 0, despite containing only one operation, can also represent its own operation loop (e.g., loop A in Figure 2). Similarly, operations 3, 4, and 5 in Figure 2 all use the intermediate tensor T2 as input and may therefore belong to the same operation loop (e.g., loop C in Figure 2). In Figure 2, operations 6 and 7 may belong to loop D, and operation 8 may belong to loop E.
[0040] In relation to device 101, one or more computational operations shown in Figure 2 may include one or more of the following: floating-point operations, integer operations, Boolean operations, convolution operations, maximum pooling operations, and arg max operations. However, this list is not exhaustive, and other types of computational operations are also assumed by this disclosure.
[0041] One or more computational operations associated with device 101 may include all operations of the global computation task defined above. For example, one or more components of the input tensor of device 101 may correspond to one or more parts of the data to be processed received from computing system 100, and one or more components of the output tensor that device 101 returns to computing system 100 may correspond to the required output results of the global computation task. Thus, the operation graph of device 101 may also be the global operation graph of the global computation task.
[0042] In another embodiment, one or more computational operations associated with device 101 correspond to each subtask of a global computation task. One or more components of the input tensor of device 101 may correspond to one or more components of each input tensor of a subtask, one or more components of the output tensor may correspond to one or more components of each output tensor of a subtask, and the operation graph of device 101 may correspond to a subgraph of the global operation graph of the global computation task.
[0043] In some embodiments, various strategies may be applied to process the computations in the computation graph of device 101. For example, breadth-first search (BFS), depth-first search (DFS) strategies, or hybrid approaches that may include a combination of BFS and DFS according to levels or thresholds may be used to iterate through the computations and levels of the computation graph. The choice of strategy may, overall, depend on the nature of one or more computations related to device 101, i.e., the nature of the tasks related to device 101, and / or the configuration of device 101.
[0044] In some embodiments, operations on an operation graph may require sufficient input data to be available for execution as a whole. For example, each operation on the graph uses one or more components of a first tensor to compute components of a second tensor. However, depending on how the operation graph is traversed, components obtained from a first operation may become available and stored earlier than other components obtained from a second operation. If a third operation uses components obtained from both the first and second operations, it may be necessary to pause or wait for the second operation to complete until it can be executed. For example, operation 3 on the operation graph in Figure 2 may require that the result components of both operations 1 and 2 be available and stored in T2 in order to be executed. If the result component of operation 2 is not yet available, operation 3 may not be able to be executed. This can negatively impact execution speed and memory consumption. Therefore, the strategy used to traverse the operation graph is crucial for efficiently processing the received components of the input tensor and obtaining the corresponding components of the output tensor quickly and with low resource consumption.
[0045] Furthermore, in some embodiments, the input tensor, intermediate tensors T1 to T4, and output tensor, or at least some of them, may correspond to three-dimensional tensors having horizontal, vertical, and depth dimensions. Some operations on the operation graph may provide result components only over limited dimensions. For example, taking the example in Figure 2, one can imagine that operation 3 provides a first component of T3 only along a first depth range, and operation 4 provides a second component of T3 over a second depth range. In one embodiment, a global condition for at least some operations on the graph to be performed may be that there is sufficient data in the tensor used as input for them to perform. In some embodiments, this may involve verifying whether the data for one or more components of the tensor used as input provides data over the full depth. This is because the operation may always operate at full depth and may be configured to operate only in limited ranges, for example, vertical and horizontal.
[0046] For example, in some embodiments, if at least some operations are related to a CNN, one or more operations in the operation graph may be kernel operations, such as filter kernel operations, each having its own kernel size. A kernel may be called a filter kernel or simply a filter. Typically, a kernel is a matrix of weights applied to a tensor or components of a tensor (typically 3x3, 5x5, 7x7, or any particular size), preferably applied through a convolution operation. This may involve sliding the kernel over the tensor, preferably taking the element-wise product of overlapping entries in the tensor's components and each entry in the kernel, and then summing them up to produce a single output value.
[0047] In some embodiments, the filter kernel operation in the operation of the operation graph can be implemented as a planar operation. For example, the kernel / matrix (e.g., 3x3, 5x5, etc.) may be arranged along two dimensions (e.g., vertical and horizontal) but configured to be applied to the tensor components (used as input) over the entire range in a third dimension (e.g., depth dimension). For example, the matrix may be arranged vertically and horizontally but applied to the tensor components at full depth. However, if the tensor components are not yet available over the entire range in the third dimension (e.g., full depth), the execution of the filter kernel operation may be prevented. Therefore, in some embodiments, the global conditions for the execution of at least some operations of the graph may include, specifically, a particular execution condition for an operation that is a filter kernel operation having kernels arranged over a limited range in two dimensions but applied over the entire range in a third dimension, that the tensor used as input by the operation should have data provided over the entire range in the third dimension within the tensor. For example, this could mean that for a given action in an action graph, data placed at full depth should be provided to the tensor used as input by that action. In a particular example, a particular execution condition might be that for a given action in an action graph, data placed at full depth should be provided to the components of the tensor used as input by that action in order for the given action to perform.
[0048] Therefore, in some embodiments, satisfying this particular execution condition may be sufficient to determine that the operation can be performed on one or more components of the tensor used as input. For example, this means that if it is determined that one or more components of the tensor provide data over their entire range in a third dimension (e.g., full depth), then the particular execution condition may be considered satisfied. In that case, it may mean that there is enough data to perform the operation using one or more components of the tensor and to compute the resulting components to be stored in another tensor in the graph. Conversely, if the particular execution condition is not satisfied, it may be determined (e.g., directly) that there is not enough data to perform the operation and compute the resulting components to be stored in another tensor in the graph.
[0049] As described above, having multiple devices on a computing system including device 101, with each device dedicated to computing its own subtask of the global computing task, can primarily help to compute the global computing task efficiently and quickly. However, the method of traversing the operation graph can have a significant impact on processing speed and memory usage, especially when available resources are limited. Optimized methods for processing tensors to reduce the consumption of computing resources are particularly needed for devices with limited form factor or silicon area, such as in portable or onboard platform scenarios.
[0050] Figure 3A discloses a flowchart of a method for addressing this problem. Method 300 may be applicable to any level of the operation graph (e.g., loops), for example, the operation graph in Figure 2 or the operation graph of one or more computational operations associated with device 101.
[0051] Method 300 may begin in step 302, where the components of the first tensor are provided to the memory unit. For example, taking the exemplary device 101 described above, this step may correspond to providing the components of the first tensor to the local memory unit of device 101. The first tensor may be, for example, the input tensor of device 101. In this case, providing the components of the first tensor to the memory unit may include receiving the components of the first tensor from external memory (e.g., the system memory of computing system 100) and storing the components of the first tensor in the local memory unit of device 101. Alternatively, the first tensor may be any intermediate tensor by an operation graph. For example, the first tensor may be any intermediate tensor of an operation graph of one or more computational operations associated with device 101 (see, for example, any of T1 to T4 in Figure 2).
[0052] The processing of method 300 may be followed by step 304, which includes determining one or more first operations that use at least the components of a first tensor as inputs. For example, in the above example of device 101, the components of the first tensor may be objects that function as inputs for one or more first operations. For example, in the operation graph of Figure 2, taking the example of a given component of the intermediate tensor T2, one or more first operations that use at least the given component of T2 as inputs can be determined (e.g., operations 3, 4, and 5).
[0053] Method 300 may be followed by repeating one or more first operations in the first loop (step 306), i.e., repeating one or more operations determined in step 304. For example, taking the above exemplary example based on T2 in Figure 2, the first loop may include operations 3, 4 and 5, i.e., loop C in Figure 2.
[0054] An iteration through the first loop may begin with a first operation and proceed through subsequent operations until the last operation in the loop is reached. Each iteration may be associated with a current operation, which may be the first operation, any subsequent operation, and the last operation in the loop. For one or more current operations of the first operation in the first loop, if the first tensor containing the components of the first tensor has enough data for the current operation to compute the components of the second tensor using the components of the first tensor, then method 300 may perform the following steps:
[0055] Step 310: Calculate the components of the second tensor using the current behavior applied to at least the components of the first tensor. Step 312: Write the components of the second tensor to the memory unit. Step 314: If any portion of at least one component of the first tensor is no longer needed by any of the subsequent operations of the first loop, deallocate at least one portion of the first tensor component from the memory unit, and Step 316: The second loop embedded in the first loop is invoked and iterated over, the second loop including one or more second operations that use at least the components of the second tensor as input.
[0056] However, if the current operation does not have enough data in the first tensor containing the components of the first tensor to compute the components of the second tensor using the components of the first tensor, method 300 may proceed to the next operation of one or more first operations in the first loop (step 320).
[0057] For example, the current operation may be any operation in the first loop that is currently being processed after one or more operations preceding the current operation have been processed in the first loop. For example, taking again the above exemplary example based on T2 in Figure 2, the iteration of one or more first operations in the first loop (step 306) may begin with operation 3 in loop C in Figure 2. For example, it may be determined whether there is enough data in T2 containing the given components of T2 for operation 3 to compute components of a second tensor (i.e., components of the intermediate tensor T3 in Figure 2) using the given components of T2. For example, consider a case where operation 1 has provided a given component of T2, but operation 2 has not yet been processed. The given component of T2 may itself provide enough data for operation 3 to perform and / or generate at least one component of T3. However, operation 3 may alternatively require the provision of a resulting component from operation 2 in T2 in order to perform it. Therefore, in the embodiment, the execution of an operation that uses at least the components of a first tensor as input may be globally conditioned on whether there is enough data in the first tensor containing the components of the first tensor for the operation to compute the components of the second tensor using the components of the first tensor.
[0058] In a preferred embodiment, step 306 may include determining, for one or more first operations in the first loop, whether there is enough data in the first tensor containing the components of the first tensor for the current operation to compute the components of the second tensor using the components of the first tensor (see step 308 in Figure 3C). Thus, in an embodiment, if it is determined that there is enough data in the first tensor containing the components of the first tensor for the current operation to compute the components of the second tensor using the components of the first tensor, method 300 may proceed to steps 310 to 316 in Figure 3A. However, if it is determined that there is not enough data in the first tensor containing the components of the first tensor for the current operation to compute the components of the second tensor using the components of the first tensor, method 300 may proceed to step 320 in Figure 3A.
[0059] In some embodiments, simply providing components of the first tensor may not be sufficient on its own for the current operation to perform. This is because the current operation may require the results of two parent operations in the first tensor to perform. In some embodiments, taking the example of Figure 2, operation 3 may be an operation that can already be performed based on a given component of T2 generated by operation 1, while operation 4 may be an operation that further requires the result component of operation 2 in T2 to perform, based on a given component of T2 generated by operation 1 and the result component of operation 2 in T2.
[0060] In embodiments, some tensors associated with one or more computational operations performed by device 101 may be three-dimensional tensors having horizontal, vertical, and depth dimensions. In embodiments, as described above, the determination 308 may also include determining whether certain execution conditions are met. For example, if the first tensor is a three-dimensional tensor having horizontal, vertical, and depth dimensions, and the current operation is a filter kernel operation having a kernel that is located in a limited range in two dimensions but applies to the entire range in the third dimension, step 308 may include determining whether the available data in the first tensor on which the current operation is performed is located to the entire range in the third dimension. Additionally or alternatively, if the first tensor is a three-dimensional tensor having horizontal and vertical dimensions, step 308 may also include determining whether the available data in the first tensor on which the current operation is performed is located to the full depth.
[0061] In one embodiment, if the available data in the first tensor on which the current operation is performed is located throughout the entire range or depth in the third dimension, it may mean that there is sufficient data in the first tensor containing the components of the first tensor for the current operation to compute the components of the second tensor using the components of the first tensor. Conversely, additionally or alternatively, if it is determined that the available data in the first tensor on which the current operation is performed is not located throughout the entire range or depth in the third dimension, it may mean that there is not sufficient data in the first tensor containing the components of the first tensor for the current operation to compute the components of the second tensor using the components of the first tensor. For example, in the example in Figure 2, the components obtained as a result of operation 3 in T3 may not be sufficient for operation 6 to be performed. It is possible that the component of T3 covers only a limited range in the third dimension (for example, in depth), and operation 4 provides another component of T3 that completes this component of T3, and both of these components within T3 together provide data that is located across the entire range in the third dimension (for example, at the entire depth). If this condition is met, operation 6 will have enough data in T3 to perform to compute the component of T4.
[0062] In one embodiment, the determination 308 may include determining whether the components of the first tensor are located across the entire depth dimension. In this case, it may be determined that there is sufficient data in the first tensor containing the components of the first tensor for the current operation to compute the components of the second tensor using the components of the first tensor. In this case, it is determined that the components of the first tensor are sufficient for the current operation to compute the components of the second tensor.
[0063] If there is not enough data in the first tensor containing the components of the first tensor for the current operation to compute the components of the second tensor using the components of the first tensor, method 300 proceeds to step 320, i.e., continues with the next operation of one or more first operations in the first loop (see the left branch in Figure 3A). In one example, in that case, method 300 may stop processing the current operation and retry at a later point when there is enough input data to allow the current operation to be completed, and continue with the next operation in the first loop. In the explanatory example in Figure 2, this means, for example, stopping processing the current operation 3 and continuing with operation 4 in loop C. Thus, step 320 performs a provisional execution of any further operations in the first loop that may already have enough data to perform in the first tensor (containing the components of the first tensor).
[0064] If there is enough data in the first tensor containing the components of the first tensor for the current operation to compute the components of the second tensor using the components of the first tensor (for example, for operation 3, in the explanatory example above, if the given components of T2 are actually sufficient for operation 3 to compute at least one component of T3), then method 300 proceeds to 310 to compute the components of the second tensor using the current operation applied to at least the components of the first tensor, and then proceeds to 312 to write the components of the second tensor to a memory unit (see the right branch in Figure 3A).
[0065] In some embodiments, if there is sufficient data in the first tensor containing the components of the first tensor for the current operation to compute the components of the second tensor using the components of the first tensor, method 300 may include marking the current operation as processed. This may be done, for example, after step 310 in Figure 3A, or alternatively, before. Thus, if not all operations of the first loop are processed during the first iteration through the first loop, the second iteration through the first loop may still be performed later based on the components of the first tensor, with operations of the first loop that were marked as processed during the first iteration being skipped and any operations that were not processed during the first iteration being retried (for example, according to step 306 of method 300). For example, using the example in Figure 2, if in the first iteration through loop C operation 3 could be performed based on a given component of T2, but operations 4 and 5 could not be performed, then operation 3 is marked as processed and is therefore skipped in the second iteration through loop C.
[0066] In one embodiment, marking the current operation as processed may include removing the current operation from the list of operations to be processed. For example, the operation graph of device 101 may be associated with a list of operations to be processed. In one embodiment, the marking may correspond to the removal of the current operation from the list. Thus, in some embodiments, verification that an operation of one or more computational operations performed by device 101 has been processed may correspond to verification that none of those operations are present in the list of operations to be processed.
[0067] In some embodiments, in order to conserve memory space in the memory unit of device 101, the allocation of memory space for one or more tensors, and in particular one or more of their components, may depend on whether the one or more tensors, or their components, are to be computed. Therefore, in some embodiments, a step of allocating memory space for the components of the second tensor in the memory unit may precede step 310. For example, memory space may be allocated from a starting address applied by a command. Furthermore, in some embodiments, once the components of the second tensor have been computed using the current operation, step 312 may include writing the components of the second tensor to the allocated memory space in the memory unit. Accordingly, the memory space usage of the memory unit of device 101 is actively controlled and managed.
[0068] Furthermore, in some embodiments, if any portion of at least one component of the first tensor is no longer needed by any subsequent operation in the first loop, method 300 may proceed to deallocate the portion of at least one component of the first tensor from the memory unit 314. This may include removing the portion of at least one component of the first tensor from the memory unit in order to free up memory space. This may also include freeing the memory space associated with that portion so that it is overwritten, for example, by a subsequent write operation. Step 314 may include determining whether any portion of at least one component of the first tensor is no longer needed by any subsequent operation in the first loop. For example, it may include determining which portion of the first tensor is no longer needed by any subsequent operation in the first loop. For example, in the explanatory example of T2 in Figure 2, it may be determined which portions of a given component of T2 are still needed by operations 4 and 5 of loop C, and / or which portions of a given component of T2 are no longer needed. The latter portions are the portions that should be deallocated. In some embodiments, deallocation may not involve an erase operation. It may simply involve freeing up the portion for further use. Additionally or alternatively, it may include determining which data items in the first tensor are not needed by any of the subsequent operations in the first loop. For example, for each subsequent operation in the first loop, it may include identifying which data in the first tensor stored in the memory unit is still needed. If any data item of the first tensor is not needed by any of the subsequent operations in the first loop, that data item may correspond to the portion that should be deallocated.
[0069] In some embodiments, additionally or alternatively, in which the next operation in any of the first loops includes at least one kernel operation, the determination of whether at least any portion of the components of the first tensor is no longer needed by the next operation in any of the first loops may be based on the size of at least one kernel operation. As described above, at least part of the operations in the operation graph may be filter kernel operations, which may have a kernel with limited dimensions along two dimensions and a full range in a third dimension. The limited dimensions may be the kernel size (e.g., 3×3, 5×5, 7×7, or any specific size). Applying a filter kernel may involve sliding the kernel over the tensor, preferably taking the element-wise product of the overlapping entries of the tensor's components and each entry in the kernel, and then summing them up to produce a single output value. Thus, in one embodiment, the determination of whether at least any portion of the components of the first tensor is no longer needed by the next operation in any of the first loops may be based on the maximum kernel size of the kernel for the next operation in the first loop. In a preferred embodiment, calculating the maximum kernel size may include taking the maximum horizontal and / or vertical size of each kernel in each subsequent operation of the first loop.
[0070] Therefore, during the processing of one or more computational operations associated with device 101 on a given component of the input tensor, the memory unit is managed to hold only the components of the intermediate tensor, or any part thereof, that are still needed to compute the corresponding one or more components of the output tensor of device 101. This is illustrated in Figure 5B (see below). As a result, the memory usage of the memory unit dynamically adapts to the instantaneous memory needs of the tensor computation. Memory space consumption is actively reduced and optimized during computation, thereby avoiding memory congestion or a decrease in the data throughput of device 101.
[0071] Returning to Figure 3A, after step 314, method 300 may proceed to step 316, which includes invoking a second loop embedded in the first loop and iterating through it, where the second loop includes one or more second operations that use at least one component of the second tensor as input. Thus, this step attempts, early on, to perform one or more computational operations that use components of the second tensor as input, based on the newly obtained result of the current operation in the first loop. In some embodiments, this strategy may follow a depth-first search (DFS) strategy that device 101 follows to traverse the computation graph. For example, in the exemplary example of Figure 2, after performing steps 310-314 in the current operation 3, step 316 may include invoking loop D and iterating through loop D, where loop D includes operations 6 and 7 that use at least components of T3 obtained from operation 3 as input. Thus, the second loop containing one or more second operations corresponds to the (sub)loop embedded in the first loop. In the exemplary example in Figure 2, this means that loop D is embedded within loop C. In some embodiments, step 316 may correspond to a recursive iteration of step 306. Thus, step 316 may include the same substeps as step 306 (e.g., steps 310-316 or 320 in Figure 3A), but based on a subsequent loop of the operation (e.g., loop D in Figure 2). In this regard, the first loop (e.g., loop C) may be considered the parent loop or embedded loop of the second loop (e.g., loop D). Conversely, the second loop (e.g., loop D) may be considered the child loop or embedded loop of the first loop (e.g., loop C).
[0072] Therefore, method 300 does not proceed directly to the execution of the next operation in the first loop, but instead may proceed to the execution of the second loop embedded in the first loop before continuing with the next operation in the first loop. This has the advantage of allowing the execution of the embedded loop according to the operation graph to start at an early stage. This strategy rapidly distributes the amount of processing work across all levels of the operation graph and avoids congestion of processing means dedicated to the first level of the operation graph when one or more computations are performed by device 101. Furthermore, by attempting to process the lower levels of the operation graph at an early stage, the memory space associated with the tensor components that serve as inputs to these operations can already be deallocated and freed up for other operations, thereby improving the overall execution of one or more computations performed by device 101 in terms of resource consumption and throughput.
[0073] In an embodiment, method 300, shown in Figure 3A, processes operations according to a depth-first search (DFS) strategy. However, the disclosures herein are not limited to the application of such strategies, and other types of operational graph traversal strategies are also contemplated. For example, method 300 may alternatively apply a breadth-first search (BFS) strategy. In such a case, step 316 may be omitted, and the completion of step 314 may instead be directed to a modified step 320 to continue with the next operation of one or more first operations in the first loop if there is not enough data in the first tensor containing the components of the first tensor for the current operation to compute the components of the second tensor using the components of the first tensor, or after the completion of step 314. According to such a strategy, the first loop is repeated until each first operation in the first loop is completed, after which processing of the next (child) loop in the operational graph may be triggered.
[0074] Figure 3B shows one or more steps included in and / or following step 316. It illustrates the recursive nature of method 300 by an embodiment (such as Figure 3A) that follows a depth-first search (DFS) strategy. For example, initiating and iterating over a second loop embedded in a first loop (step 316) may include iterating over the second loop (step 306'), for example, iterating over one or more second operations that use components of a second tensor as input. Similar to the iterations in the first loop, it may include sequentially taking each operation of the second loop as a second current operation and performing one or more steps in Figure 3B at the second current operations of the second loop.
[0075] In one embodiment, if there is sufficient data in the second tensor containing the components of the second tensor for the second current operation of one or more second operations in the second loop to compute the components of the third tensor using the components of the second tensor, then Method 300 may perform the following steps: Step 310': Calculate the components of the third tensor using the second current operation which has been applied to at least the components of the second tensor. Step 312': Write the components of the third tensor to the memory unit. Step 314': If at least one part of the components of the second tensor is no longer needed by the next operation of the second loop, deallocate at least one part of the components of the second tensor from the memory unit, and Step 316': Initiate and iterate through the third loop embedded in the second loop, the third loop including one or more third operations that use at least the components of the third tensor as input; and In contrast, if there is not enough data in the second tensor containing the components of the second tensor for the second current operation to compute the components of the third tensor using the components of the second tensor, iteration 306' may continue with one or more subsequent operations of the second operation in the second loop, which then represent the next current operation in the second loop (step 320').
[0076] In one embodiment, steps 310' to 316' and 320' substantially correspond to steps 310 to 316 and 320, although they differ in that they relate to child loops. Therefore, the implementation details and embodiments described above for steps 310 to 316 and 320 can also be applied to steps 310' to 316' and 320'.
[0077] Figure 3C shows further aspects of Method 300 according to several embodiments, in particular details of transitions between operations in different loops. In one embodiment, Method 300 may further include determining whether the first loop is embedded in another parent loop if the current operation of the first loop is the last operation of the first loop, and if the first loop is embedded in another parent loop, returning to the other parent loop and continuing the iteration through one or more operations within the other parent loop. Thus, once all operations of a given loop have been iterated through, the Method may, if necessary, return to the parent loop to continue the iteration within that parent loop. This may be necessary, for example, to complete all remaining operations in the parent loop in order to provide all the components necessary for all operations of the first loop to be performed. This also allows returning to the top / root loop of the operation graph once all operations in the operation graph have been traversed or performed.
[0078] For example, as shown in Figure 3C, method 300 may, in particular, after step 306, include determining 308 whether there is sufficient data in the first tensor containing the components of the first tensor for the current operation of one or more first operations in the first loop to compute the components of the second tensor using the components of the first tensor. If there is sufficient data in the first tensor containing the components of the first tensor for the current operation to compute the components of the second tensor using the components of the first tensor, method may proceed to steps 310-316 in Figure 3A as described above. However, if the opposite is true, method 300 may first proceed to determine 318 whether the current operation is the last operation in the first loop. If it is determined that the current operation is not the last operation in the first loop, method may proceed to step 320, i.e., to continue with the next operation of one or more first operations in the first loop, as described above for Figure 3A. However, if the current operation is determined to be the last operation in the first loop, the method may instead proceed to determine whether the first loop is embedded in another parent loop 322. If the first loop is determined to be embedded in another parent loop, the method may proceed to return to the other parent loop 324 and continue the iteration through one or more operations within the other parent loop.
[0079] In some embodiments, continuing the iteration through one or more operations in other parent loops may itself involve performing steps 318, 320, 322 and / or 324 for the other parent loops. This means that, in a particularly preferred embodiment, the method may perform step 318 in the other parent loop and proceed to step 320 or steps 322, 324 depending on the result of the determination in step 318 before continuing with the next operation in the other parent loop. For example, it may first be determined whether the current operation in the other parent loop is the last operation in the other parent loop (step 318 for the other parent loop). If not, the method proceeds to continue with the next operation in the other parent loop (step 320 for the other parent loop). However, if the current operation in the other parent loop is indeed the last operation in the other parent loop, the method may proceed to move further back in the operation graph. For example, it may next be determined whether the other parent loop is embedded in another higher loop in the operation graph (step 322 for the other parent loop). If so, the method can return to the other higher loop in the operation graph and continue the iteration through one or more operations in that other higher loop in the operation graph (step 324 of the other parent loop). Further details of the transitions between loops are described below with reference to Figure 4.
[0080] In a preferred embodiment, method 300 is applicable to any component of any tensor once provided in the memory unit. Furthermore, it should be noted that further embodiments of method 300 shown in Figure 3C with respect to the current operation of the first loop are applicable to any level of the operation graph. For example, they can be similarly applied to iteration 306' through the second loop, as shown in Figure 3B, by considering the component of the second tensor (used as input), the second current operation of the second loop, and the component of the third tensor (used as output), instead of the component of the first tensor (used as input), the current operation of the first loop, and the component of the second tensor (used as output) in Figure 3C.
[0081] The following exemplary scan by Method 300 is provided. For example, taking the above exemplary example in Figure 2, consider the case where a given component of T2, used as input by operations 3, 4 and 5 of loop C (i.e., the first loop), is generated by operation 1 of loop B. Furthermore, consider, for example, that operation 2 has not yet been processed, and operation 3 can be processed so that the given component of T2 provides itself sufficient data for the first tensor for operation 3 to compute the component of T3 (see steps 302-312 in Figure 3A). After step 314 is performed, an iteration through loop D (i.e., the second loop) is initiated, where loop D is embedded in loop C and includes operations 6 and 7 that use at least the component of T3 (step 316, step 306' in Figure 3B). In loop D, it can be determined whether there is sufficient data in T3, including the component of T3 for operation 6 to compute the component of T4 using the component of T3 (step 308 in the second loop, i.e., loop D). However, assuming this is not the case, the method iterates further through loop D (steps 320, 306). Specifically, it first determines whether operation 6 is the last operation in loop D (step 318 in Figure 3C, which applies to the second loop, i.e., loop D), and if not, it can proceed to the next operation in loop D, i.e., operation 7 (step 320 in Figure 3C, which applies to loop D). Suppose there is not enough data in T3 to include the components of T3 for operation 7, which uses the components of T3 to calculate the components of T4. The method then iterates further through loop D (steps 320, 306). However, specifically, since operation 7 is currently the last operation in loop D (step 318 in Figure 3C, which applies to loop D, if "YES"), it can determine whether loop D is embedded in another loop (step 322 in Figure 3C, which applies to loop D). Since it is embedded in loop C, the method returns to loop C and continues iterating through loop C (step 324 in Figure 3C, which applies to loop D). The current operation in loop C is still operation 3.Since it is not the last operation in loop C (step 318, if "NO"), we continue with operation 4 as the current operation in loop C (step 306 in Figures 3A and 3C). However, we assume that there is not enough data in T2 to include the given components of T2 for operation 4 to compute the components of T3 using the given components of T2 (step 308 in Figure 3C, if "NO"). We then determine whether operation 4 is the last operation in loop C (step 318 in Figure 3C). Since operation 4 is not the last operation in loop C, the method proceeds to an iteration through loop C (step 320) and selects operation 5 as the current operation in loop C (step 306). Again, we assume that there is not enough data in T2 to include the given components of T2 for operation 5 to compute the components of T3 using the given components of T2 (step 308 in Figure 3C, if "NO"). We then determine whether operation 5 is the last operation in loop C (step 318 in Figure 3C). Since operation 5 is the last operation in loop C (if step 318, "YES"), the method returns to the parent loop B and continues the iteration within it, i.e., continues with operation 2 in loop B.
[0082] In that case, operation 2 is then executed, providing all the missing components of T2 for operations 4 and 5 to execute. After the successful execution of operation 2 (steps 308-314 in Figure 3A, which applies to the parent loop, i.e., loop B), the method proceeds again with the initiation and iteration of the first loop (i.e., loop C), but skips operation 3 because it has already been executed in the previous iteration of loop C (for example, marked as processed as described above). In that case, operation 4 is executed (steps 306, 310-314 in Figure 3A, 3C), followed by a new iteration of loop D.
[0083] According to a further aspect of this disclosure, Figure 4 provides a flowchart summarizing the processing of multiple components of an input tensor. The method 400 shown in Figure 4 relates in particular to a method for processing components of an input tensor, for example, multiple components of an input tensor, according to one or more computational operations associated with the device 101 for the components of the input tensor. The method 400 is configured in particular to traverse an operation graph like that shown in Figure 2.
[0084] In one embodiment, the computation graph traversed by method 400 may correspond to a computation graph associated with a specific subtask of a global computation task, such as the global computation task defined above in Figure 1A or 2. Alternatively, it may correspond to a global computation graph of a global computation task. For example, it may include all the computations of a global computation task performed on the data to be processed provided by computing system 100.
[0085] In one embodiment, Method 400 is preferably performed by a device such as the device 101 described above. For example, Method 400 may be applied to device 101 to process an operation graph associated with device 101, for example, an operation graph associated with a particular subtask of a global computing task. However, Method 400 may also be performed by any kind of data processing system having one or more processors and one or more memories and / or local and / or remote computing resources. It can also be performed by multiple devices, for example, multiple devices 101 associated with a computing system 100. In such a case, the multiple devices 101 may be configured to cooperatively process the same operation graph associated with their respective computations. The multiple devices 101 may be configured to cooperatively perform Method 400 to process, for example, the global operation graph of the global computing task described above. For example, one or more first devices of the multiple devices 101 may be configured to compute a first group of operations in the operation graph, and one or more second devices of the multiple devices 101 may be configured to compute a second group of operations in the operation graph. The first group and the second group may belong to the same loop in the operation graph, or they may belong to different loops in the operation graph (for example, consecutive loops).
[0086] Method 400 begins with receiving data for the components of an input tensor (step 401). In step 403, it can be determined whether the components of the input tensor provide enough data for a first operation of the operation graph to be performed. The first operation may be a first operation in the top / root loop of the operation graph. In embodiments similar to step 308, where the tensor processed by the operation graph is a three-dimensional tensor having horizontal, vertical, and depth dimensions, and at least a first operation of the operation graph is a filter kernel operation having a kernel located in a limited range in two dimensions but applied to the entire range in a third dimension, step 403 may include determining whether the components of the input tensor are located in the entire range in the third dimension. For example, it may include determining whether the components of the input tensor are located in the entire range in the depth dimension, e.g., whether they have reached the full depth in the input tensor.
[0087] If the condition in step 403 is not met, the method loops back as shown in Figure 4. If it is met, the components of the input tensor are selected to iterate through the first loop of the operation graph (e.g., loop A in Figure 2) 405.
[0088] In one embodiment, method 400 relates to method 300 applied to each level / loop of the operation graph. For example, it may include repeating method 300 to iterate through different loops of the operation graph. In some embodiments, steps 406, 408, 410-414, 416, 418, 420, 422, and 424 of method 400 shown in Figure 4 essentially correspond to steps 306, 308, 310-314, 316, 318, 320, 322, and 324 of method 300 described above in Figures 3A-C, respectively. For example, steps 406-424 may be performed according to steps 306-324 of method 300. In one example, the components of the input tensor selected in step 405 may correspond to the components of the first tensor provided in step 302 of method 300. For example, the repetition of steps 406-416 for iterating through an embedded (child) loop of the operation graph may be carried out according to steps 306'-316' in Figure 3B, for example. In another example, the next tensor component selected in step 416, used for iterating through an embedded loop restarting from 406, may correspond to the second tensor component used for iterating through a second loop embedded in the first loop according to steps 316 and 306' in Figures 3A and 3B. Therefore, further explanation of these steps is omitted. Furthermore, any details or features described above in method 300 are also applicable to method 400. Conversely, any additional steps or details of method 400 presented herein may also be applicable to method 300 presented earlier.
[0089] In particular, method 400 may include a step 407 that determines whether the current operation of the loop has already been performed, for example, during a previous iteration of the loop. If yes, method 400 proceeds to step 418 to verify whether the current operation is the last operation of the loop. Otherwise, method 420 continues with the next operation of the loop.
[0090] Method 400 may also include step 415, which may be performed after step 414 but before step 416. This step determines whether the next tensor component computed by the current operation (e.g., the component of the second tensor computed using the current operation applied to the component of the first tensor according to step 310 of Method 300) is only for the output tensor of the operation graph. If NO, it means that the loop currently being iterated through is not the last loop of the operation graph, and Method may continue in step 416 to iterate through an embedded loop, which starts again from step 406. If YES, it means that the loop currently being iterated through is the last loop of the operation graph (e.g., loop E in Figure 2), and the next tensor component is the component to be written to the output tensor of the operation graph. In that case, Method proceeds to step 418 to verify whether the current operation is the last operation in the loop. If not, Method continues with the next operation in the loop (step 420). Otherwise, if the current operation is indeed the last operation in the loop, Method proceeds to step 422 to determine whether the loop is embedded in another loop. This is true if the operation graph contains multiple loops, since the loop is the last loop in the operation graph. Therefore, the method may proceed to step 424, step back to the embedding (parent) loop, and continue within the current operation of that embedding (parent) loop. Then, it may proceed to step 418, determining whether the current operation of that embedding (parent) loop is the last operation of that loop. In other words, method 400 thereby returns to the parent loop and performs the relevant potential remaining operations.
[0091] In one embodiment, if all operations in the operation graph have been performed, method 400 may, accordingly, repeat steps 418, 422, and 424 one or more times until it finally returns to the top loop of the operation graph (e.g., loop A in Figure 2). In this case, if the current operation in the top loop of the operation graph is the last operation in the top loop (step 418, "YES"), method 400 may proceed to step 422. However, in this case, since the top loop is not embedded in any other loop (step 422, "NO"), method 400 has successfully processed the components of the input tensor and written one or more components corresponding to the output tensor. In one embodiment, method 400 may terminate at this stage. Alternatively, method 400 may proceed to step 426 to verify whether all components of the output tensor have been generated. If not, method 400 may be repeated with the next components of the input tensor. If YES, it means that the output tensor has been fully computed and can be returned to the computing system 100.
[0092] Figures 5A and 5B illustrate another example of processing an input tensor using multiple operations to generate an output tensor, according to one or more embodiments of the present disclosure. Figure 5A shows another example of an operation graph applicable to any device (e.g., device 101) or any multiple devices (e.g., multiple devices 101), as described above. Figure 5B illustrates a memory allocation scheme in processing the operation graph for multiple components of an input tensor, in particular according to method 400 or 300 described above. Specifically, it shows a kind of instant view of memory allocation. Figure 5B also specifically refers to the data of the components of the tensor as tensor "pixels," however, this is for illustrative purposes only. For example, the tensor "pixels" in the input tensor may correspond to the data of rows in an image frame, for example. However, the memory management scheme shown in Figure 5B is not limited to this type of component and is equally applicable to various types of components of a tensor, as described above.
[0093] In the embodiment shown in Figure 5B, the tensors of the operation graph (e.g., input tensors T5-T9 and output tensors) may be three-dimensional tensors having horizontal, vertical, and depth dimensions, respectively. In one embodiment, the operations (e.g., operations 9-16) may be kernel operations, such as filter kernel operations, each associated with a corresponding filter kernel. Each operation belongs to a corresponding loop (e.g., loops F-K) of the operation graph, as shown in Figure 5A. The processing may reflect methods 300 and 400 described with respect to Figures 3A-3C and Figure 4. This may reflect the processing of layers in a convolutional neural network.
[0094] Figure 5B illustrates the application of an operation implemented as a filter kernel to the components of a tensor. For example, operation 9 may be a convolution with a filter kernel of size 5x5. The filter kernel is moved horizontally, i.e., along the lines of the input tensor, to compute the components of tensor T5. The filter kernel of an operation may also have its own input stride and output stride. The input stride and output stride parameters can describe how the filter kernel (or pooling operation) moves across the tensor (e.g., a feature map) and how it affects the size of the tensor that stores the results of applying the filter kernel at each layer. For example, the input stride of the filter kernel of operation 9 may be 2 and its output stride may be 1. Figure 5B specifically refers to an exemplary output stride associated with an operation.
[0095] In one embodiment, the input stride refers to the number of steps the filter kernel (or pooling window) moves between consecutive operations when applied to an input tensor. A common choice for the input stride is 1, which means the kernel moves one step at a time (both horizontally and vertically), resulting in an output tensor with similar dimensions to the input tensor. Increasing the input stride value reduces the spatial dimension of the output tensor, making subsequent operations faster and less memory-intensive, but potentially losing some fine-grained spatial information.
[0096] In one embodiment, the output stride refers to the ratio between the spatial dimension of the tensor to which the filter kernel is applied and the tensor that stores the result of applying the filter kernel. This can be used as a measure of how much the spatial resolution has been expanded after passing through one or more computational layers. For example, an output stride of 8 means that the height and width of the final output tensor are each 8 times larger than those of the input tensor. The role of the output stride is to restore spatial detail in the output tensor.
[0097] In one embodiment, the filter kernel operation in Figure 5A may be configured to apply its filter kernel over the entire range along a third dimension, for example, the depth direction. Some operations in Figure 5A, such as operations 9, 10, 15, or 16, may provide data over the entire range in the third dimension. However, some of them may provide data over a limited range in the third dimension, and the resulting components may not be sufficient on their own for the child loop operations to perform. As described above in Figures 2 and 3A-C, two operations may each provide results with a limited range in the third dimension in the tensor, but their results may complement each other to provide data distributed over the entire range in the third dimension, enabling the child loop operations to perform. This is particularly illustrated in operations 14 and 13 in Figure 5B (see below).
[0098] In one embodiment, Figure 5B shows an instant view of memory usage during the processing of the computation graph. This specifically illustrates the active deallocation of memory space in the aforementioned memory unit (e.g., device 101) as soon as any component (or part thereof) of any tensor is no longer needed to process the computation graph. This allows for the rapid freeing of memory space that can be directly used for other ongoing computations. For example, the white area at the top of the tensor in Figure 5B represents the tensor component deallocated from the memory unit. In the case of an output tensor, this means that the corresponding component has already been returned to the computing system 100 (e.g., stored in its system memory). Areas marked with a forward slash correspond to tensor "pixels" that still need to be allocated to the memory unit, i.e., need to be processed. Thus, for all three-dimensional tensors in Figure 5B, the shown processing according to, for example, method 400 or 300, requires that only the tensor "pixels" marked with a backslash be stored and retained in the memory unit. These tensor "pixels" represent the tensor components stored across the entire depth dimension in Figure 5B. Therefore, the use of available computing resources is optimized, and power consumption is reduced. This may mean that it may be possible to process the computation graph with a limited-sized memory unit.
[0099] In one embodiment, in the instant view of Figure 5B, operations 11 and 12 of loop H are successfully executed for components marked with backslashes in T6. The resulting components are stored in T7 (see the tensor "pixel" marked with backslashes), for example, as a concatenation of the first component obtained from the execution of operation 11 and the second component obtained from the execution of operation 12. Both of these components together cover the entire depth of T7. The current operation in loop H is operation 12. According to methods 300, 400, and especially steps 316 / 416, after operation 12 has been successfully executed (steps 310-314 / 410-414), iteration through a child loop using T7 is initiated, i.e., initiated through loop I (steps 316 / 416). Its single operation 14 is successfully executed (steps 310-314 of operation 14 in loop I), providing the resulting components to be stored in T8. However, since operation 13 in loop H has not yet been performed, the component resulting from operation 13, which should enter T8 according to the operation graph, is not yet available. The component resulting from operation 14 is marked in black in T8 and is held in the memory unit. This covers only a limited range in the depth direction in T8. Furthermore, according to methods 300, 400, and especially according to steps 316 / 416 in loop I, the next child loop (e.g., loop J) is tentatively iterated over. However, since the component resulting from operation 14 in T8 is not located across the entire range in the depth dimension, it means that there is not enough data in T8 containing this component for operation 15 in loop J (i.e., a single operation in loop J) to be performed. Therefore, the process branches in loop J through steps 318-322-324 / 418-422-424 and returns to loop I. Similarly, since operation 14 is the last operation in loop I, the process branches again in loop I through steps 318-322-324 / 418-422-424 and returns to loop H, where the current operation is operation 12.Since operation 12 is not the last operation in loop H (if step 318 / 418 is "NO"), the process proceeds to step 320 / 420 to process the next operation in loop H, i.e., operation 13, which is executed (steps 310-314 / 410-414 of operation 13), providing the component obtained as a result of operation 13, i.e., the previously missing component of T8 for operation 15 to execute. The process then executes step 316 / 416 to start and iterate through a loop embedded in loop H, i.e., loop J, which contains one or more operations that use the component obtained as a result of operation 13 as input. The new iteration through loop J then successfully executes operation 15, allowing the graph traversal to be completed through the remaining operations in the operation graph (operation 16 in this case).
[0100] As shown in Figure 5B, the components obtained as a result of operations 11 and 12 may be concatenated at T7. Similarly, the components obtained as a result of operations 14 and 13 may be concatenated at T8. In one embodiment, only the concatenation of the resulting components allows these resulting components to provide sufficient data to perform a predetermined operation in the next loop. In one embodiment, only the concatenation of the resulting components allows there to be sufficient data arranged across the entire depth dimension to perform a predetermined operation.
[0101] Figures 5A and 5B discuss an operation graph containing seven operations, but this technique is applicable to any operation graph where N and k are positive integers greater than or equal to 1, regardless of the amount of operations N and / or the amount of loops k. Furthermore, this technique is particularly well-suited for computing operations in one or more convolutional neural networks (CNNs).
[0102] In some embodiments of this disclosure, methods 300 and 400 in Figures 3A-3C and Figure 4, as well as the processes related to Figure 5, may be implemented using the following first pseudocode.
[0103]
number
[0104] This first pseudocode implements some of the main aspects of steps 401–414 of Method 400. In one example, the variable "Tensor" may be considered a first tensor, specifically, each component of the first tensor used by each UserOperation, i.e., each operation in a loop of one or more operations that takes "Tensor" as input.
[0105] Furthermore, the first pseudocode specifies an additional condition for executing the UserOperation (e.g., the current operation of iteration through a loop), which is that "all generating operations of the UserOperation" should have been processed. In one embodiment, this condition may complement steps 308 or 408 described above. For example, before steps 308 / 408, a step can be taken to determine whether all operations that produce result components in the first tensor have been processed. If not, the process can proceed directly to steps 318 / 418. This avoids the execution of steps 308 / 408, thereby saving computational effort and resources.
[0106] In yet another exemplary embodiment using the second pseudocode, the operations may be included in the array opsRemaining, which initially contains all operations, including all operations in the operation graph that may be relevant to the neural network.
[0107] vector <operation>opsRemaining; The processed operations can be removed from the array. This solution may be a preferred embodiment of the step of marking the processed operations described above.
[0108] The main loop of the code can continue processing until the operation graph is fully scheduled and processed. The main loop can be defined as follows:
[0109]
number
[0110] In this main loop, the variable "inputTensor" can, in one embodiment, refer to a component of the input tensor that has been received at full depth. For example, if the component is an input line and the input line is ready at full depth in the input tensor, its index is returned (i.e., "inputTensorIndex"), and inputTensor specifically contains that input line and uses it to perform "processGraphRecursive(inputTensor)".
[0111] The procedure processGraphRecursive() can define a recursive function that traverses the operation graph.
[0112]
number
[0113] The procedure opProcessed() can be used to determine whether an operation has already been processed. This can be done by checking whether the operation is included in the opsRemaining array. The procedure opProcessed() can be implemented as follows:
[0114]
number
[0115] The procedure opProducersProcessed() is used to determine whether all parent operations have already been processed.
[0116]
number
[0117] In some embodiments, this procedure can represent the target embodiment of the determination of whether "all generation operations of UserOperation" of the first pseudocode have been processed, i.e., an optional step that can be performed in method 300 or 400 before step 308 / 408, which determines whether all operations that generate result components in the first tensor have been processed.
[0118] The procedure processOp() starts processing the operation. The operation can first verify whether there are enough lines in the input tensor for the operation. This may depend on the filter kernel width and the processing state of the operation in higher (embedded) layers. In this case, memory space for the output can be allocated, the result can be computed, and lines of the input kernel that are no longer needed can be freed.
[0119]
number
[0120] It should be understood that the first and second pseudocodes described above represent examples of use cases illustrating the implementation and processing of embodiments of the present disclosure. Many other embodiments are possible, and the present disclosure is not limited to any particular exemplary embodiment.
[0121] Embodiments of the subject matter, as well as actions and operations described herein, can be implemented in digital electronic circuits, in tangibly embodied computer software or firmware, in computer hardware including structures disclosed herein and their structural equivalents, or in one or more combinations thereof. Embodiments of the subject matter described herein can be implemented as one or more computer programs, e.g., one or more modules of computer program instructions, encoded on a computer program carrier to be executed by a data processing device or to control the operation of a data processing device. For example, the computer program carrier may include one or more computer-readable storage media on which instructions are encoded or stored. The carrier may be a tangible, non-transient computer-readable medium such as a magnetic disk, magneto-optical disk, or optical disk, solid-state drive, random-access memory (RAM), read-only memory (ROM), or other types of media. Alternatively or additionally, the carrier may be an artificially generated propagating signal, e.g., a machine-generated electrical, optical, or electromagnetic signal generated to encode information for transmission to a suitable receiver device for execution by a data processing device. Computer storage media may be machine-readable memory devices, machine-readable memory boards, random or serial access memory devices, or a combination of one or more of these, or parts thereof. Computer storage media are not propagating signals.
[0122] Computer programs are also called, or may be written as, programs, software, software applications, apps, modules, software modules, engines, scripts, or code, and can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and can be deployed in any form, including as standalone programs or as modules, components, engines, subroutines, or other units suitable for execution in a computing environment that may include one or more computers interconnected by a data communication network in one or more locations.
[0123] Computer programs can, but are not required to, correspond to files in a file system. They can be part of files that hold other programs or data, such as one or more scripts stored in a markup language document, a single file dedicated to the program, or multiple collaborative files, such as one or more modules, subprograms, or files containing parts of code.
[0124] Processors for executing computer programs include, for example, both general-purpose and dedicated microprocessors, as well as any one or more processors in any type of digital computer. Generally, a processor receives instructions for a computer program to be executed, as well as data from non-temporary computer-readable media coupled to the processor.
[0125] The term "data processing device" encompasses all kinds of equipment, devices, and machines for processing data, including, for example, a programmable processor, a computer, or multiple processors or computers. A data processing device may include dedicated logic circuits, such as FPGAs (Field-Programmable Gate Arrays), ASICs (Application-Specific Integrated Circuits), or GPUs (Graphics Processing Units). In addition to hardware, the device may also include code that creates an execution environment for computer programs, such as processor firmware, protocol stacks, database management systems, operating systems, or code comprising one or more of these.
[0126] The processes and logic flows described herein can be executed by one or more computers or processors running one or more computer programs that perform operations by manipulating input data and producing outputs. The processes and logic flows can also be executed by dedicated logic circuits, such as FPGAs, ASICs, or GPUs, or by a combination of dedicated logic circuits and one or more programmed computers.
[0127] A computer suitable for running computer programs can be based on a general-purpose or dedicated microprocessor, or both, or any other type of central processing unit. Generally, the central processing unit receives instructions and data from read-only memory or random-access memory, or both. The elements of a computer may include a central processing unit for executing instructions and one or more memory devices for storing instructions and data. The central processing unit and memory may be supplemented by or incorporated into dedicated logic circuits.
[0128] Generally, a computer is also operable to be coupled to or include one or more storage devices to receive data from or transfer data to one or more storage devices. Storage devices can be, for example, magnetic disks, magneto-optical disks, or optical disks, solid-state drives, or any other type of non-temporary computer-readable medium. However, a computer does not need to have such devices. Thus, a computer can be coupled to one or more storage devices, such as one or more local and / or remote memories. For example, a computer can include one or more local memories that are integral components of the computer, or a computer can be coupled to one or more remote memories located in a cloud network. Furthermore, a computer can be embedded in other devices, such as mobile phones, personal digital assistants (PDAs), portable audio or video players, game consoles, Global Positioning System (GPS) receivers, or portable storage devices, such as Universal Serial Bus (USB) flash drives.
[0129] Components can be “coupled” to one another by being interconnected in an intercommunicative manner, such as by being electrically or optically connected to each other directly or through one or more intermediate components. Components can also be “coupled” to one another if one component is integrated into another. For example, a memory component integrated into a processor (e.g., an L2 cache component) is “coupled” to the processor.
[0130] To provide user interaction, embodiments of the subject matter described herein may be implemented on or configured to communicate with a computer having a display device for displaying information to the user, such as an LCD (liquid crystal display) monitor, and an input device, such as a keyboard and a pointing device, such as a mouse, trackball, or touchpad, to which the user can provide input to the computer. Other types of devices may also be used to provide user interaction, for example, the feedback provided to the user may be any form of sensory feedback, such as visual feedback, auditory feedback, or haptic feedback, and the input from the user may be received in any form, including acoustic, voice, or haptic input. Furthermore, the computer may interact with the user by sending documents to and receiving documents from the user's device, for example, by sending a web page to a web browser on the user's device in response to a request received from a web browser, or by interacting with an application running on the user's device, such as a smartphone or electronic tablet. The computer may also interact with the user by sending text messages or other forms of messages to a personal device, such as a smartphone running a messaging application, and receiving response messages from the user in return.
[0131] In this specification, the term "configured to..." is used in relation to systems, devices, and computer program components. One or more computer systems are configured to perform a particular operation or action to mean that the system has software, firmware, hardware, or a combination thereof installed that causes the system to perform the operation or action when in operation. One or more computer programs are configured to perform a particular operation or action to mean that one or more programs, when executed by a data processing device, contain instructions that cause the device to perform the operation or action. An application-specific logic circuit is configured to perform a particular operation or action to mean that the circuit has the electronic logic to perform that operation or action.
[0132] While this specification includes details of many specific embodiments, these should not be interpreted as limitations on the claimed scope as defined by the claims themselves, but rather as descriptions of features that may be specific to particular embodiments. Certain features described in this specification in the context of separate embodiments may also be realized in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be realized separately in multiple embodiments or in any suitable subcombination. Furthermore, even if features are described above as acting in a particular combination and initially claimed as such, one or more features from the claimed combination may, in some cases, be excluded from the combination, and the claims may cover subcombinations or variations of subcombinations.
[0133] Similarly, while operations are depicted in drawings and described in claims in a specific order, this should not be understood as requiring that such operations be performed in a specific order shown, or sequentially, or that all illustrated operations be performed in order to achieve the desired result. In certain circumstances, multitasking and parallel processing may be advantageous. Furthermore, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged in multiple software products.
[0134] While several embodiments have been described in detail, it should be understood that aspects of this disclosure can take many forms. In particular, the claimed subject matter may be carried out or implemented in ways different from the examples described, and the described features and characteristics may be carried out or implemented in any combination. For example, the process shown in the accompanying figures does not necessarily require a specific or sequential order shown to achieve the desired result. In some cases, multitasking and parallel processing may be advantageous. The embodiments shown herein are intended to be illustrative and not limit to the invention as defined by the claims.< / operation>
Claims
1. A method for processing tensors, A step of providing the components of the first tensor to the memory unit, A step of determining one or more first operations that use at least the components of a first tensor as input, A step of repeating one or more first operations in a first loop, For the current operation among one or more first operations in the first loop, if there is enough data in the first tensor containing the components of the first tensor for the current operation, in order to compute the components of the second tensor using the components of the first tensor, A step of calculating the components of the second tensor using the current operation applied to at least the components of the first tensor, The process involves writing the components of the second tensor to the memory unit, If at least one portion of the components of the first tensor is no longer needed by any subsequent operation in the first loop, the steps include: freeing that portion of at least one component of the first tensor from the memory unit; A step of initiating and repeating a second loop embedded in a first loop, wherein the second loop includes one or more second operations that use at least the components of a second tensor as input, A method comprising the step of continuing one or more first operations in a first loop to compute the components of a second tensor using the components of a first tensor, if there is not enough data in the first tensor containing the components of the first tensor for the current operation.
2. The method according to claim 1, wherein one or more first operations and one or more second operations are operations on an operation graph that shows input and output dependencies between operations.
3. The method according to claim 2, wherein the first loop is the parent loop of a second loop based on an operation graph.
4. The method according to claim 2 or 3, wherein the operation graph represents one or more operations performed on an input tensor to generate an output tensor.
5. The method according to claim 4, wherein the first tensor is an input tensor to the operation graph.
6. The method according to claim 4 or 5, wherein the second tensor is an output tensor.
7. The method according to any one of claims 1 to 6, further comprising the steps of: determining whether the first loop is embedded in another parent loop if the current operation is the last operation in the first loop; and, if the first loop is embedded in another parent loop, returning to the other parent loop and continuing the iteration through one or more operations in the other parent loop.
8. The process of repeating the second loop is For the second current operation of one or more second operations in the second loop, if there is enough data in the second tensor containing the components of the second tensor for the second current operation, in order to compute the components of the third tensor using the components of the second tensor, A step of calculating the components of a third tensor using a second current operation applied to at least the components of the second tensor, The process involves writing the components of the third tensor to the memory unit, If at least one of the components of the second tensor is no longer needed by any subsequent operation in the second loop, the steps include freeing that portion of the components of the second tensor from the memory unit, A step of initiating and repeating a third loop embedded in a second loop, wherein the third loop includes one or more third operations that use at least the components of a third tensor as input, The method according to any one of claims 1 to 7, further comprising the step of continuing one or more second operations in the second loop if there is not enough data in the second tensor containing the components of the second tensor for the current second operation, in order to compute the components of a third tensor using the components of the second tensor.
9. For the current operation among one or more first operations in the first loop, if there is enough data in the first tensor containing the components of the first tensor for the current operation, in order to compute the components of the second tensor using the components of the first tensor, A step of allocating memory space for the components of the second tensor in the memory unit, The method according to any one of claims 1 to 8, comprising the step of writing the components of a second tensor into the memory space allocated to the memory unit.
10. The method according to any one of claims 1 to 9, wherein the first tensor is a three-dimensional tensor having horizontal, vertical, and depth dimensions, and the components of the first tensor are groups of data of the first tensor arranged in at least partially one dimension.
11. The method according to any one of claims 1 to 10, wherein one or more first operations are kernel operations, each having a kernel having a respective kernel size, and the step of computing a component of a second tensor using a current operation applied to at least a component of a first tensor includes the step of applying the kernel of the current operation to the data of the component of the first tensor.
12. The method according to claim 11, further comprising the step of determining whether at least any portion of the components of the first tensor is no longer needed by any subsequent operation in the first loop, based on the maximum kernel size of the kernel of the next operation in the first loop.
13. The method according to any one of claims 1 to 12, further comprising the step of marking the current operation as processed if there is sufficient data in the first tensor containing the components of the first tensor for the current operation, in order to compute the components of the second tensor using the components of the first tensor for the current operation.
14. The method according to any one of claims 1 to 13, wherein the memory unit is the local memory unit of the device.
15. The method according to claim 14, wherein the step of providing the components of the first tensor to a memory unit includes the step of receiving the components of the first tensor from an external memory and storing the components of the first tensor in the local memory unit of the device.
16. One or more computer-readable media storing instructions that, when executed on the device, configure the device to perform the method according to any one of claims 1 to 15.
17. It is a device, Local memory unit and Equipped with a processing unit, A device wherein the processing unit is configured to perform the method according to any one of claims 1 to 15.
18. A computing system, At least one processor, System memory and A computing system comprising at least one of the devices described in claim 17.
19. The computing system according to claim 18, wherein at least one processor is configured to store an input tensor in system memory, provide components of the input tensor to at least one device, receive an output tensor from at least one device, and store the output tensor in system memory.