Flow-based accelerator processing of computational graphs
By allocating subgraphs of the computation graph to multi-stream accelerator devices, optimizing the operation sequence and memory usage, the problem of long computation graph processing time is solved, and more efficient neural network operation execution is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- GOOGLE LLC
- Filing Date
- 2016-10-28
- Publication Date
- 2026-06-30
Smart Images

Figure CN115840643B_ABST
Abstract
Description
[0001] Case Analysis
[0002] This application is a divisional application of Chinese Invention Patent Application No. 201680063365.6, filed on October 28, 2016. Technical Field
[0003] This application relates to flow-based accelerator processing of computational graphs. Background Technology
[0004] This specification relates to processing computation graphs representing neural networks by assigning subgraphs to accelerator devices (e.g., graphics processing units (GPUs)) with multiple streams and / or to using such processed computation graphs to process model inputs.
[0005] A neural network is a machine learning model that uses one or more layers to generate an output (e.g., one or more classifications) from received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as the input to the next layer in the network (i.e., the next hidden layer or output layer of the network). Each layer of the network generates an output from the received input based on the current values of the corresponding set of parameters used by the layer.
[0006] In existing systems, operations on the computation graph can be handled by individual devices. In some implementations, the device is a GPU. The device may have a processor that performs operations (e.g., generates output from input at a layer) and stores the output from the operations in memory. Due to the large number and size of operations typically required to generate output in the computation graph, a single device can spend a significant amount of time processing graph operations. Summary of the Invention
[0007] Generally, this specification describes a system or method for processing a subgraph of a computation graph using a stream-based accelerator device (e.g., a GPU).
[0008] Generally, an innovative aspect of the subject matter described in this specification can be embodied in a method comprising the following actions: receiving a request to process a computation graph; obtaining data representing a subgraph of the computation graph, the computation graph comprising a plurality of nodes and directed edges, wherein each node represents a corresponding operation, wherein each directed edge connects a corresponding first node to a corresponding second node, the corresponding second node representing an operation that receives the output of the operation represented by the corresponding first node as input, the subgraph being assigned to a first device by a deployer in the computation graph system; determining that the first device includes a hardware accelerator having a plurality of streams; generating instructions in response to determining that the first device includes a graphics processing unit having a plurality of streams, the instructions, when executed by the first device, causing the first device to: assign the operation represented by each node in the subgraph to a corresponding stream among the plurality of streams of the graphics processing unit; and execute the operation represented by the node in the subgraph according to the assignment; and provide the instructions and the data to the first device. This aspect of the method can be a computer-implemented method. This aspect of the method can be performed by one or more computing devices, for example by one or more computing devices including a computation graph system.
[0009] The implementation may include one or more of the following features. The request specifies identifying one or more specific outputs from one or more corresponding nodes in the subgraph, further including: receiving the one or more specific outputs from the first device; and providing the one or more specific outputs to the client. The instructions further cause the first device to store the one or more specific outputs in the memory of the first device. The operation for the subgraph includes partial inference or training computation for a neural network. The subgraph is analyzed to identify groups of nodes in the subgraph in a chain structure; wherein the instructions cause the first device to assign the group of nodes to a flow. The assignment includes: analyzing the subgraph to identify a first node in the subgraph having a plurality of directed edges as outputs; wherein the instructions cause the first device to assign the node pointed to by each of the directed edges to a unique flow of the graphics processing unit. The instructions cause the first device to determine, for each node, the corresponding amount of memory resources in the graphics processing unit consumed by the operation represented by that node based on the directed edges to that node, wherein the assignment is based at least on the corresponding amount of memory resources. The instructions cause the first device to determine that a specific operation represented by a node has ended at a specific flow; in response to determining that the specific operation has ended: determining a first amount of memory consumed by the specific operation to be released; determining a corresponding estimated amount of memory consumed by each unassigned node in the unassigned node group; determining a first unassigned node from the unassigned node group that has the estimated amount of memory that maximizes the use of the first amount of memory; and assigning the operation represented by the first unassigned node to the specific flow. In one embodiment, the method further includes: receiving model input; and processing the model input by the hardware accelerator according to the operation represented by the node in the subgraph.
[0010] In another aspect, the subject matter described in this specification can be embodied in a method that may include actions such as: providing a machine learning model corresponding to the processed computational graph obtained by the method of the first aspect; and using the machine learning model to process model inputs.
[0011] In another aspect, the subject matter described in this specification can be embodied in a method that may include actions such as executing a subgraph of the processed computation graph obtained by the method of the first aspect by a hardware accelerator.
[0012] In these respects, the computational graph can be a representation of a machine learning model, such as a neural network.
[0013] Another innovative aspect includes the following actions: a graphics processing unit having multiple streams receives data representing a subgraph of the computation graph, the computation graph including multiple nodes and directed edges, wherein each node represents a corresponding operation, and each directed edge connects a corresponding first node to a corresponding second node, the corresponding second node representing an operation that receives the output of the operation represented by the corresponding first node as input, the subgraph being assigned to the graphics processing unit by a placer in the computation graph system; assigning the operation represented by each node in the subgraph to a corresponding stream among the multiple streams of the graphics processing unit; and executing the operation represented by the node in the subgraph according to the assignment.
[0014] The implementation may include one or more of the following features: receiving a request to identify one or more specific outputs from one or more corresponding nodes of the subgraph; and providing the one or more specific outputs to the client. Receiving data identifying a group of nodes in the subgraph in a chain structure; and assigning the group of nodes to a stream. The assignment includes: receiving data identifying a first node in the subgraph with multiple directed edges as outputs; and assigning a unique stream to the node pointed to by each of the directed edges to the graphics processing unit. For each node, determining the corresponding amount of memory resources in the graphics processing unit consumed by the operation represented by that node based on the directed edges to that node, wherein the assignment is based at least on the corresponding amount of memory resources. The process involves: determining that a specific operation represented by a node has ended at a specific flow; in response to determining that the specific operation has ended, determining a first amount of memory consumed by the specific operation to be released; determining a corresponding estimated amount of memory consumed by each unassigned node in the unassigned node group; determining a first unassigned node from the unassigned node group that has the estimated amount of memory that maximizes the use of the first amount of memory; and assigning the operation represented by the first unassigned node to the specific flow.
[0015] Other implementations of these and other aspects include corresponding systems, apparatuses, and computer programs configured to perform actions of methods encoded on computer storage devices (which may or may not be non-transitory storage devices).
[0016] Specific embodiments of the subject matter described in this specification can be implemented to achieve one or more of the following advantages. The operations of a neural network (e.g., operations for generating inference from input) can be represented as a computation graph of nodes and directed edges. The system processes this computation graph representation to perform operations efficiently. The system achieves this efficiency because the computation graph has multiple flows. Using multiple flows allows logically independent operations to be reordered or executed concurrently. When the system has a goal of reducing the end-to-end latency of the overall computation, the example system can reorder logically independent operations. When the system has a goal of achieving higher throughput, the example system can execute operations concurrently. The computation graph can be more easily partitioned for parallel operations compared to a conventional representation. As an illustration, subgraphs of the computation graph can be assigned to unique devices, each of which performs operations in its respective subgraph, to reduce the total time required to perform the operations of the neural network.
[0017] The device assigned to the subgraph may be a GPU. The subgraph can be divided into multiple streams of the GPU to perform subgraph operations more efficiently. Details of one or more embodiments of the subject matter of this specification are set forth in the following figures and description. Other features, aspects, and advantages of the subject matter will become apparent from the specification, figures, and claims. It should be understood that aspects and embodiments can be combined, and the features described in one aspect or embodiment can be implemented in the context of other aspects or embodiments. Attached Figure Description
[0018] Figure 1 The illustration shows an example computation graph system used to represent the distributed operations of a neural network as a computation graph.
[0019] Figure 2 This is a flowchart of an example process for using a GPU to process a subgraph of a computation graph.
[0020] Figure 3 The illustration shows an example subgraph of the computation graph processed by the GPU.
[0021] Figure 4 This is a flowchart of an example process for assigning nodes to streams.
[0022] Similar reference numerals and names in the various figures indicate similar elements. Detailed Implementation
[0023] This specification generally describes a computation graph system that performs operations represented by a computation graph in a distributed manner.
[0024] A computation graph consists of nodes connected by directed edges. Each node in the computation graph represents an operation. An incoming edge to a node represents the flow of input to that node, i.e., the input to the operation represented by the node. An outgoing edge from a node represents the flow of output from the operation represented by that node, which can be used as the input to an operation represented by another node. Therefore, a directed edge connecting a first node in the graph to a second node in the graph indicates that the output generated by the operation represented by the first node is used as the input to the operation represented by the second node.
[0025] Generally, the inputs and outputs flowing along directed edges in a computation graph are tensors. A tensor is a multidimensional array of numerical or other values (e.g., strings) that have a specific order corresponding to the dimensions of the array. For example, scalar values are 0th-order tensors, vectors of numerical values are 1st-order tensors, and matrices are 2nd-order tensors.
[0026] In some implementations, the operations represented in the computation graph are neural network operations or operations for different types of machine learning models. A neural network is a machine learning model that uses one or more layers of non-linear units to predict outputs from received inputs. Some neural networks are deep neural networks that include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to another layer in the network (i.e., another hidden layer, an output layer, or both). Some layers of the network generate outputs from the received inputs based on the current values of their respective parameter sets; however, other layers of the network may not have parameters.
[0027] For example, operations represented by a computation graph can be neural network computational inference—that is, the operations necessary for processing inputs through the layers of a neural network to generate neural network outputs for those inputs. As another example, operations represented by a computation graph can be the operations necessary for training a neural network by executing a neural network training procedure to adjust the values of the network's parameters (e.g., to determine training values for the parameters based on their initial values). In some cases, such as during the training of a neural network, operations represented by a computation graph may include operations performed by multiple copies of the neural network.
[0028] As an illustration, a neural network layer receiving input from the previous layer can use a parameter matrix to perform matrix multiplication between the parameter matrix and the input. In some cases, this matrix multiplication can be represented as multiple nodes in a computation graph. For example, matrix multiplication can be divided into multiple multiplication and addition operations, and each operation can be represented by a different node in the computation graph. The operation represented by each node generates a corresponding output, which flows along directed edges to subsequent nodes. After the operation represented by the final node generates the result of the matrix multiplication, the result flows along directed edges to another node. This result is equivalent to the output of the neural network layer that performed the matrix multiplication.
[0029] In some other cases, matrix multiplication is represented as a node in a graph. The operation represented by the node can receive an input tensor on a first directed edge and a weight tensor (e.g., a parameter matrix) on a second directed edge as input. In some implementations, the weight tensor is associated with a shared persistent state of the model. The node can process, for example, performing matrix multiplication of the input and weight tensors to output an output tensor on a third directed edge, equivalent to the output of a neural network layer.
[0030] Other neural network operations that can be represented by nodes in the computation graph include other mathematical operations, such as subtraction, division, and gradient calculation; array operations, such as concatenation, splicing, splitting, or ranking; and neural network building block operations, such as SoftMax, Sigmoid, Rectified Linear Unit (ReLU), or convolution.
[0031] Representing neural networks as computational graphs provides a flexible and granular way to efficiently implement neural networks, especially if the operation of the neural network spans multiple devices with different hardware profiles.
[0032] Figure 1 The illustration shows an example computation graph system 100 for operation on a neural network distribution represented as a computation graph. System 100 is an example of a system implemented as a computer program on one or more computers in one or more locations, wherein the systems, components and techniques described below can be implemented.
[0033] The user of client 102 can request to perform actions on the computation graph representing the neural network. For example, the client can register the graph with a session manager, feed data inputs into the graph, or evaluate one or more of the graph's outputs. Client 102 can be an application running on a computer.
[0034] As part of the request, client 102 provides system 100 with data that identifies the computation graph and specifies the type of action to be performed on the computation graph.
[0035] For example, a request could identify a computational graph representing inference for a particular neural network and identify the inputs on which inference should be performed.
[0036] As another example, the request can identify a computational graph representing a training process for a specific neural network and can identify the inputs, such as training data, on which training should be performed. In this example, upon receiving a request to process the computational graph representing the training procedure, system 100 can, for example, use conventional backpropagation or other neural network training techniques to determine modified values for the parameters of one or more edges of the computational graph. System 100 can store the modified parameters in the device's memory, and executor 106 can retrieve and store the addresses of the modified weights at system 100. Based on further requests from client 102 for inference, training, or other operations requiring the modified weights, system 100 can use the addresses to access the modified weights.
[0037] In some cases, a request can specify the response that should be transmitted in response to the request. For example, for a neural network training request, client 102 may request an indication that the requested neural network training operation has been completed, and optionally, an indication of the training values of the neural network's parameters or a memory location where the training values can be accessed by client 102. As another example, for a neural network inference request, client 102 may request the output values representing an inference operation from one or more specific nodes in the computation graph.
[0038] System 100 performs operations to generate specific outputs by segmenting operations represented by a computational graph across multiple devices 116-122. System 100 segments the operations across the multiple devices 116-122 via a data communication network 114 (e.g., a local area network (LAN) or a wide area network (WAN)). Devices 116-122 perform operations and, if applicable, return corresponding outputs or indications to system 100, which may then return the requested outputs or indications to client 102.
[0039] Any device performing neural network operations (e.g., devices 116-122) may include memory (e.g., random access memory (RAM)) for storing instructions and data, and a processor for executing the stored instructions. Generally, each device is a hardware resource that performs operations independently of other devices. For example, each device may have its own processing unit. A device may be a graphics processing unit (GPU), a central processing unit (CPU), or other accelerators. As an illustration, a machine may host one or more devices, such as multiple CPUs and GPUs.
[0040] Each device can also have corresponding computing capabilities. That is, devices can have different amounts of memory, processing speeds, or other architectural characteristics. Therefore, some devices can perform operations that other devices cannot. For example, some operations require a certain amount of memory that only a specific device has, or some devices are configured to perform only specific types of operations, such as inference operations.
[0041] The session manager 104 in system 100 receives a request from client 102 to initiate a session to perform operations on the computation graph. The session manager 104 manages a set of devices capable of performing operations on the computation graph, such as devices 116-122, and can provide the deployer 108 with a set of devices available for performing operations.
[0042] For each operation to be performed in the computation graph, deployer 108 determines the corresponding target device, such as device 116, to perform the operation, and in some embodiments, determines the time it will take for the corresponding target device to perform the operation. Deployer 108 performs optimal device assignment by knowing how long the operation will take on each available device given the size of the input data. Deployer 108 uses a measured or predictive performance model to obtain an estimate of the processing time. Some operations can be performed in parallel; however, other operations need to complete the preceding operation in the computation graph, for example, other operations process the output of the preceding operation as input.
[0043] After the device performs the operation assigned by deployer 108 to generate output, actuator 106 can retrieve the output. Actuator 106 can generate an appropriate response to the request, such as processing the completed output or indicating that it has been processed. Actuator 106 can then return the response to client 102. Although Figure 1 The illustration shows an actuator 106; however, in one embodiment, there is one actuator per device. This actuator issues an operation to the device when the operation becomes runnable (i.e., all its inputs have been computed). This embodiment also features a graph manager that divides the graph to run on multiple devices and creates the necessary actuators by invoking deployer 108.
[0044] Session manager 104 also provides executor 106 with a set of operations to be executed in the computation graph. Executor 106 periodically retrieves runtime statistics from devices 116-122 associated with the graph execution of the operations. Executor 106 provides the runtime statistics to deployer 108, which can re-optimize the deployment and scheduling of further operations.
[0045] Figure 2 This is a flowchart of an example process 200 for using a GPU to process a subgraph of a computation graph. For convenience, process 200 will be described as being executed by a system of one or more computers located in one or more locations. For example, a properly programmed computation graph system (e.g., Figure 1 The computational graph system 100) executable process 200.
[0046] The system receives a request from the client to process the computation graph (step 202). For example, the request could be to perform neural network inference represented by the computation graph on a specified input, to perform neural network training operations represented by the computation graph on a specified training dataset, or to perform other neural network operations represented by the computation graph, as referenced above. Figure 1 As described.
[0047] In some cases, the computation graph is sent along with a request from the client. In other cases, the request identifies the computation graph, and the system retrieves data representing the identified graph from memory.
[0048] The system can partition a computation graph into multiple subgraphs. In some implementations, the subgraphs are specified by the client sending the request, and the system partitions the computation graph according to a specification. In other implementations, the system partitions the computation graph such that each subgraph requires a similar amount of resources to perform operations compared to other subgraphs.
[0049] The system can, for example, use Figure 1 The deployer 108 is used to assign each submap to an available device.
[0050] The system obtains data representing specific subgraphs of the segmented computation graph (step 204). This data can be obtained from the system's database or memory. For illustration, operations on specific subgraphs represent partial inference or training computations.
[0051] The system determines that the device to which the subgraph is assigned is a graphics processing unit or other hardware accelerator device with multiple streams (step 206). For illustration, the system can assess whether a device is a GPU with multiple streams by requesting the device type from the resource manager that manages the devices to be assigned to the computation graph. Each stream is an independent hardware queue that processes its operations sequentially.
[0052] The system generates instructions that cause the device to perform a specific operation when executed by the device (step 208). In particular, the instructions cause the device to assign the operation represented by each node in the subgraph to the corresponding flow of the device.
[0053] The example system can assign computations from some hardware accelerators to streams in a specific way (e.g., if an operation is performed on stream A, then subsequent related operations must also be performed on stream A). For example, the first operation can be stateful and performed on stream A. Through execution, the first operation can change the internal state of the hardware in a way that must occur before the second operation can be performed. After the first operation completes, the second operation can then be performed on stream A.
[0054] In some implementations, two internal hardware resources cannot be used simultaneously and therefore need to be serialized.
[0055] Generally, devices assign operations that do not depend on each other to different streams. By assigning operations that do not depend on each other to different streams, the hardware does not need to know how long an operation will take and can choose from many available operations to perform the first operation that is ready to be performed without costly host intervention.
[0056] Instructions also cause the device to execute operations represented by nodes in the subgraph according to assignments. When operations are assigned to a specific flow, these operations are queued. The device can execute operations in a first-in, first-out (FIFO) manner. Therefore, if the device has only one flow, the operations assigned to the device are executed serially. If the device has multiple flows, operations in different flows can be executed in parallel and reordered relative to each other, while operations in a given flow are executed serially. Using multiple flows to execute operations reduces the total time for executing subgraph operations. See below for reference. Figure 3 and Figure 4 This will be described further.
[0057] The system provides instructions and data to the device (step 210). In some implementations, the system sends a request to the device to initiate an operation. The device receives the request and, in response, executes the instructions received from the system. For example, the device may receive model input and process the model input according to the operations represented by nodes in the subgraph.
[0058] Figure 3 The illustration shows an example subgraph 316 of the computation graph processed by accelerator 302. Subgraph 316 has nodes 308-314, each of which represents an operation to be performed by accelerator 302. Computation graph system (e.g., Figure 1 The system 100) assigns subgraph 316 to accelerator 302.
[0059] Accelerator 302 has two streams, 304 and 306. These streams share the utilization of accelerator 302. In a GPU, streams can be symmetric, meaning that all operations can be performed on any stream. This symmetry may not apply to all accelerator devices. For example, on a particular accelerator device, certain streams must be used to perform the operation of copying data between host and device memory.
[0060] The computation graph system can analyze subgraph 316 to determine how subgraph 316 is assigned to multiple flows 304 and 306. In some implementations, the system generates instructions that allow accelerator 302 to assign nodes of subgraph 316 in a manner that minimizes the number of directed edges connected to different flows. Implementing dependencies between flows can have performance costs. Sorting instructions have some overhead cost. Each sorting dependency reduces the number of possible sorting operations that can be performed by the device, thus reducing scheduling flexibility. Whenever a directed edge from the first flow connects to the second flow, the second flow waits for operations on the directed edge from the first flow to the second flow to complete processing. This waiting can keep the second flow idle, leading to inefficient use of the GPU.
[0061] In some implementations, the system generates instructions that cause accelerator 302 to assign nodes to subgraph 316 based on the characteristics of accelerator 302. For example, accelerator 302 has a fixed number of streams, namely streams 304 and 306. The system can assign nodes, so each stream will be used similarly by accelerator 302. For accelerators acting as GPUs, all streams share a single large thread pool.
[0062] Some streams also perform specific operations that other streams do not. For example, stream 306 can perform direct memory access (DMA) operations, while stream 304 does not. Therefore, the system can analyze each node to determine the type of operation represented by the node, and the system can assign that node to a stream capable of performing that type of operation. In a GPU, the primary congested resource is the DMA engine that copies data between host and device memory. The DMA engine can be used by any stream. If a stream is performing a DMA operation, that stream cannot perform computations concurrently. The example system therefore ensures that at least one other stream has some computation to perform concurrently. The system can analyze the subgraph to identify and thus generate instructions that cause the software module or driver managing the assignment of operations to assign nodes according to the following two general rules. First, the system attempts to assign nodes in a chain structure to the same stream. Nodes in a chain structure are connected to each other by following a directed edge from node to node. Therefore, a node in a chain must wait for the operation at the previous node in the chain to complete before it can compute its own operation. The chain of assigned nodes is not always possible because branching and merging occur in the graph, for example from shared input variables or common subexpressions.
[0063] Secondly, the system can selectively generate instructions that cause the accelerator 302 to assign multiple nodes, each receiving input from a single node, to a single stream. That is, if the first node has multiple outputs to multiple different nodes, the system assigns each of the different nodes to the single stream. Each of the different nodes has a data dependency on any of the other different nodes, and therefore, efficiency is improved when operating on non-interacting data.
[0064] As an illustration, accelerator 302 receives subgraph 316. Instructions received by the system cause accelerator 302 to assign initial node 308 to first flow 306. Initial node 308 has two outputs—a directed edge to node 310 and a directed edge to node 314. Therefore, using the second rule, the instructions cause accelerator 302 to assign nodes 310 and 314 to different flows. Node 312 also only receives the output of node 310 as input. Therefore, using the first rule, the system assigns node 312 to the same flow, i.e., flow 304, just like node 310.
[0065] As described above, a stream is a hardware queue of operations executed sequentially. Therefore, the order in which the accelerator 302 assigns nodes to streams is important. The accelerator 302 assigns nodes to streams in the order of the direction of the data flow in the subgraph. That is, the accelerator 302 identifies one or more initial nodes in the subgraph and assigns them. Then, the accelerator 302 follows the directed edges that are the outputs of the one or more initial nodes to identify subsequent nodes, and the accelerator 302 assigns these subsequent nodes to the corresponding streams. The accelerator 302 continues assigning nodes until every node in the subgraph has been assigned. As described above, as a result of assigning nodes in this order, operations within a given stream will also be executed in the order in which operations are assigned. When the inputs to operation A are generated on different streams, it is necessary to ensure that they have all been computed before operation A is executed. Execution on the stream to which operation A is assigned should be stopped until all inputs to operation A have been computed. The precise stopping mechanism is device-specific. For GPU devices, an event can be created for each input stream and instructions can be added to each stream to signal the event. For each input, instructions can also be added to the stream assigned to A to allow the operation to wait for relevant events to execute. When one or more inputs for operation A are computed on the same stream as operation A, data stream-dependent instructions can be safely removed, resulting in better performance. Within a given stream, operations represented by nodes assigned to that stream will have already been computed or scheduled to be computed while accelerator 302 executes operations represented by one or more other nodes, which generate outputs that use operations represented by one or more other nodes assigned to that stream as inputs.
[0066] Continuing the explanation above, because data flows from node 310 to node 312, stream 304 is assigned to node 310 and then to node 312. When operations are performed on the stream, accelerator 302 first executes the operation represented by node 310, and then executes the operation represented by node 312.
[0067] After the final nodes (i.e., nodes 312 and 314) have performed their operations, accelerator 302 returns the node's output or an indication that the operation is complete to the system. In the example system, there is a special "sending" node that copies the computation results from accelerator 302's memory back to host memory, where it can be handed over to different devices by receiving nodes or returned to the client in a Remote Procedure Call (RPC) response. If necessary, the system can then return the output or indication to the client.
[0068] See below for reference. Figure 4 Another implementation of assigning nodes to flows is described further.
[0069] Figure 4 This is a flowchart of an example process 400 for assigning a subgraph to a device. For convenience, process 400 will be described as being executed by a system (e.g., a GPU). For example, the GPU may receive data from a computation graph system (e.g., a...). Figure 1 The instructions generated by the computation graph system 100, when executed, cause the GPU to execute process 400.
[0070] The system can assign a specific node to a flow based on the amount of memory resources consumed by the node or by a previously assigned node. For example, the system can calculate the size of a tensor on each directed edge to and from each node in a subgraph. The size of the tensor indicates the amount of memory consumed by the operation to be performed by the device. The system may need to calculate all the sizes of the tensor to determine that size. The system can then assign a specific node with a tensor that consumes a specific size of memory to a device with that specific size of memory.
[0071] Specifically, when a device performs an operation, the software driver or actuator allocates memory to store any inputs and any outputs calculated as a result of the operation. Because the amount of memory on the device is limited, the device releases the memory when it is no longer needed.
[0072] As an illustration, the system determines whether the operation represented by the node has ended at a specific point in the flow (step 402). For example, the system may periodically poll the flow to determine whether an operation in a specific flow has ended. The flow may support actions that allow the host to determine how far the execution has progressed through a list of operations in the flow. In some implementations, events or flags may signal how far the execution has progressed. When an event occurs, it may be added to a special hardware operation queue in the flow. The host may poll this queue to determine which operations have occurred. Other flow implementations may only allow the host to determine when all queued operations have completed. Alternatively or additionally, the hardware may provide an interrupt or callback when the flow reaches a certain point.
[0073] Once an operation is complete, the system can determine that the memory used for the operation's inputs can be released for use in other operations. The system does not release the memory used for the operation's outputs because the outputs can be used in subsequent nodes.
[0074] Therefore, the system determines the amount of consumed memory to be released (step 404). The system may send a request to the software driver or actuator to identify the size of the memory to be released.
[0075] In some implementations, the example system allows the use of a Remote Direct Memory Access (RDMA) network interface, which a remote machine can use to directly transfer data to the hardware accelerator's memory at any point in time. This memory must not be used by any other operation running on any stream. The example system may not need to know precisely how far an operation on each stream has progressed. However, the system should track memory that is known not to be used by any stream. This free memory can then be used for RDMA.
[0076] The system determines the estimated amount of memory consumed by each unassigned node in the unassigned node group (step 406). Unassigned nodes may include nodes that receive input from nodes whose operations have been completed. Unassigned nodes may also include nodes independent of nodes whose operations have been completed but still require processing by the accelerator. As described above, the estimated amount of memory can be determined by evaluating the size of the corresponding tensor to the unassigned node.
[0077] The system determines a first unassigned node from the group of unassigned nodes representing an operation that maximizes the use of the amount of memory to be released when executed by the accelerator on the stream (step 408). If the operation represented by the unassigned node requires more memory than the amount of memory to be released for execution, the unassigned node will not be assigned to the stream. If the first and second operations require a corresponding estimated amount of memory less than or equal to the amount of memory to be released, the system selects the operation that maximizes the use of the amount of memory to be released. In other words, in this case, the system determines the node representing the selected operation as the first unassigned node. The example system does not enqueue operations on the stream until it can determine which areas of the accelerator memory will be used to hold temporary workspaces and outputs for operations. In cases of memory scarcity, the example system may choose to enqueue operations that require less memory or prioritize enqueueing operations that consume large input tensors, thereby allowing them to be deallocated.
[0078] The system assigns the operation represented by the first unassigned node to a specific flow (step 410). The system can then enable the specific flow to perform the operation, and the system can continue as described above. Figure 2-3 Operate as described.
[0079] Embodiments of the subject matter and functional operation described in this specification may be implemented using digital electronic circuits, computer software or firmware tangibly implemented, computer hardware (including the structures disclosed in this specification and their equivalents), or a combination of one or more of these. Embodiments of the subject matter described in this specification may be implemented as one or more computer programs (i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory program carrier for execution by or control of the operation of a data processing apparatus). Alternatively or additionally, program instructions may be encoded on artificially generated propagating signals, such as machine-generated electrical, optical, or electromagnetic signals, which are generated to encode information for transmission to a suitable receiver device for execution by a data execution device. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access storage device, or a combination of one or more of these. However, the computer storage medium is not a propagating signal.
[0080] The term "data processing apparatus" encompasses all kinds of devices, apparatuses, and machines for processing data, including, for example, programmable processors, computers, or multiple processors or computers. The apparatus may include special-purpose logic circuitry, such as FPGAs (Field-Programmable Gate Arrays) or ASICs (Application-Specific Integrated Circuits). In addition to hardware, the apparatus may include code that creates an execution environment for the computer program, such as code constituting processor firmware, protocol stacks, database management systems, operating systems, or combinations thereof.
[0081] A computer program (which may also be referred to or described as a program, software, software application, module, software module, script, or code) may be written in any form of programming language, including compiled or interpreted languages or declarative or procedural languages, and it may be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may, but does not necessarily, correspond to a file in a file system. It may be stored as a part of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to said program, or in multiple coordinating files (e.g., a file storing portions of one or more modules, subroutines, or code). A computer program may be deployed to execute on a single computer or on multiple computers located at a single site or distributed across multiple sites and interconnected by a communication network.
[0082] As used herein, "engine" or "software engine" refers to a software-implemented input / output system that provides outputs distinct from its inputs. An engine can be a functional block of code, such as a library, platform, software development kit ("SDK"), or object. Each engine can be implemented on any suitable type of computing device, including one or more processors and computer-readable media, such as a server, mobile phone, tablet computer, laptop computer, music player, e-book reader, laptop or desktop computer, PDA, smartphone, or other fixed or portable device. Additionally, two or more of these engines can be implemented on the same computing device or on different computing devices.
[0083] The processes and logic flows described in this specification can be executed by one or more programmable computers executing one or more computer programs to perform functions by manipulating input data and generating output. The processes and logic flows can also be executed by dedicated logic circuitry, and the apparatus can also be implemented as dedicated logic circuitry, such as FPGA (Field-Programmable Gate Array) or ASIC (Application-Specific Integrated Circuit).
[0084] As an example, a computer suitable for executing computer programs may be based on a general-purpose microprocessor or a special-purpose microprocessor or both, or any other type of central processing unit. Generally, the central processing unit receives instructions and data from read-only memory or random access memory or both. Essential components of a computer are the central processing unit for executing or carrying out instructions and one or more storage devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to, receiving data from, transferring data to, or both from one or more mass storage devices (e.g., disks, magneto-optical disks, or optical disks) for storing data. However, a computer does not necessarily have to have such devices. Furthermore, a computer may be embedded in another device, such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, or a portable storage device (e.g., a universal serial bus (USB) flash drive), etc.
[0085] Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices such as EPROM, EEPROM, and flash memory devices; magnetic disks, such as internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. Processors and memory may be supplemented by dedicated logic circuitry or incorporated into dedicated logic circuitry.
[0086] To provide interaction with a user, embodiments of the subject matter described herein can be implemented on a computer having a display device for displaying information to the user, such as a CRT (cathode ray tube) monitor, an LCD (liquid crystal display) monitor, or an OLED display, and an input device for providing input to the computer, such as a keyboard, a mouse, or a presence-sensitive display or other surface. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including sound, speech, or tactile input. Furthermore, the computer can interact with the user by sending resources to and receiving resources from a device used by the user; for example, by sending a web page to a web browser on the user's client device in response to a request received from a web browser.
[0087] Embodiments of the subject matter described herein may be implemented in computing systems including back-end components (e.g., as a data server), or middleware components (e.g., an application server), or front-end components (e.g., a client computer having a graphical user interface or web browser that a user can use to interact with embodiments of the subject matter described herein), or any combination of one or more such back-end, middleware, or front-end components. The components of the system may be interconnected via digital data communication (e.g., a communication network) of any form or medium. Examples of communication networks include local area networks (“LANs”) and wide area networks (“WANs”), such as the Internet.
[0088] A computing system may include clients and servers. Clients and servers are generally geographically separated and typically interact via communication networks. The relationship between clients and servers occurs through computer programs running on the respective computers and having a client-server relationship with each other.
[0089] While this specification contains numerous details of specific implementations, these should not be construed as limiting the scope of any invention or what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of a particular invention. Certain features described in this specification in the context of a single embodiment may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented individually or in any suitable sub-combination in multiple embodiments. Furthermore, although these features may be described above as functioning in certain combinations and therefore even initially claimed, one or more features from the claimed combination may be removed from the combination in some cases, and the claimed combination may be for sub-combinations or variations thereof.
[0090] Similarly, although operations are depicted in a specific order in the accompanying drawings, this should not be construed as requiring such operations to be performed in the specific order shown or in sequential order, or requiring all illustrated operations to achieve the desired result. In some cases, multitasking and parallel processing can be advantageous. Furthermore, the separation of various system modules and components in the above embodiments should not be construed as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
[0091] Specific embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve the desired result. As an example, the processes depicted in the drawings do not necessarily require a specific order or sequence to achieve the desired result. In some embodiments, multitasking and parallel processing can be advantageous.
Claims
1. One or more non-transitory computer-readable storage media, the one or more non-transitory computer-readable storage media comprising instructions that, when executed by a system comprising a plurality of hardware streams, cause the system to perform program operations, the program operations comprising: For each of the plurality of operations represented in the computation graph, the operation is assigned to a corresponding flow among the plurality of hardware flows of the system, each flow being configured to queue the operations assigned to the flow and execute the queued operations in a defined order on the corresponding hardware resources of the flow. Configure a first stream of the plurality of hardware streams to stop the execution of a first operation assigned to the first stream until all inputs to the first operation have been computed, wherein the first inputs to the first operation include the outputs of second operations assigned to different second streams of the plurality of hardware streams; and The operations assigned to each of the plurality of hardware streams are performed in a defined order, including performing at least one operation performed by the first stream in parallel with at least one operation performed by the second stream.
2. The one or more non-transitory computer-readable storage media according to claim 1, wherein, The program operation further includes: Receive from the client a request identifying one or more specific outputs from one or more operations represented in the computation graph; and Provide one or more specific outputs to the client.
3. The one or more non-transitory computer-readable storage media according to claim 1, wherein, The program operation further includes: Receive data identifying a group of operations represented in the computation graph, the group of operations being connected to each other by following a directed edge from operation to operation represented in the computation graph; and The operation group is assigned to a stream.
4. The one or more non-transitory computer-readable storage media according to claim 1, wherein, The assignment includes: Receive data identifying a representation of a first operation that has multiple directed edges as outputs in the computation graph; and For each of the plurality of directed edges, the target operation pointed to by the directed edge is assigned to a unique hardware stream of the system, and each target operation is assigned to a different unique hardware stream.
5. The one or more non-transitory computer-readable storage media according to claim 1, wherein, The program operation further includes: for each of the plurality of nodes representing the corresponding operation among the plurality of operations in the computation graph, determining the corresponding amount of memory resources consumed by the operation represented by the node based on information about the directed edges to the node, wherein assigning the operation represented by each node in the computation graph to the corresponding hardware flow is based at least on the corresponding amount of memory resources consumed by the operation represented by the node.
6. The one or more non-transitory computer-readable storage media according to claim 1, wherein, The program operation further includes: Determine that a specific operation represented in the computation graph has ended at a specific hardware flow; In response to determining that the specific operation has ended, a first amount of memory consumed by the specific operation to be released is determined; For each unassigned operation in the unassigned operation group, determine the corresponding estimated amount of memory that will be consumed by the unassigned operation; A first unassigned operation is determined from the group of unassigned operations using a corresponding estimated memory amount to be consumed by the unassigned operation; the first unassigned operation has an estimated memory amount that maximizes the use of the first memory amount; and Based on the determination that the first unassigned operation maximizes the use of the first memory amount, the first unassigned operation is assigned to the specific hardware stream.
7. One or more non-transitory computer-readable storage media according to claim 2, wherein, The program operation also includes storing the one or more specific outputs in the memory of the hardware accelerator.
8. The one or more non-transitory computer-readable storage media according to claim 1, wherein, The program operation further includes: Determine that the specific operation assigned to the representation of a specific hardware stream has finished executing; and In response to determining that a specific operation has finished executing: Identify at least one subsequent operation that uses the output of the specific operation as input, and After at least one subsequent operation has been performed, the memory allocated for the output of the particular operation is reused.
9. The one or more non-transitory computer-readable storage media according to claim 1, wherein, Assigning each operation in the computation graph to a corresponding flow among the plurality of hardware flows includes: assigning the operation to minimize the number of directed edges across the flow, wherein the directed edge across the flow is an instance where the input to an operation in one flow is received from the output of an operation in another flow.
10. The one or more non-transitory computer-readable storage media of claim 1, wherein performing the operations represented by the nodes in the computation graph comprises: At the point immediately preceding the execution of the first operation in the first stream, it is identified that the output of the second operation assigned to the second stream has not yet been computed; and Execution of the first operation in the first stream is stopped until the output of the second operation from the second stream can be used as input to the first operation in the first stream.
11. One or more non-transitory computer-readable storage media according to claim 10, wherein, The execution of the first operation in the first stream is stopped, and the execution of any additional operations downstream of the first operation in the first stream is further stopped.
12. The one or more non-transitory computer-readable storage media of claim 1, wherein the computation graph is a subgraph corresponding to a portion of a larger computation graph.
13. A method performed by a system comprising multiple hardware streams, comprising: For each of the plurality of operations represented in the computation graph, the operation is assigned to a corresponding flow among the plurality of hardware flows of the system, each flow being configured to queue the operations assigned to the flow and execute the queued operations in a defined order on the corresponding hardware resources of the flow. Configure a first stream of the plurality of hardware streams to stop the execution of a first operation assigned to the first stream until all inputs to the first operation have been computed, wherein the first inputs to the first operation include the outputs of second operations assigned to different second streams of the plurality of hardware streams; and The operations assigned to each of the plurality of hardware streams are performed in a defined order, including performing at least one operation performed by the first stream in parallel with at least one operation performed by the second stream.
14. The method of claim 13, further comprising: Receive from the client a request to identify one or more specific outputs from one or more operations represented in the computation graph; and Provide one or more specific outputs to the client.
15. The method of claim 13, further comprising: Receive data identifying a group of operations represented in the computation graph, the group of operations being connected to each other by following a directed edge from operation to operation represented in the computation graph; and The operation group is assigned to a stream.
16. The method according to claim 13, wherein, The assignment includes: Receive data identifying a representation of a first operation that has multiple directed edges as outputs in the computation graph; and For each of the plurality of directed edges, the target operation pointed to by the directed edge is assigned to a unique hardware stream of the system, and each target operation is assigned to a different unique hardware stream.
17. The method of claim 13, further comprising: For each of the plurality of nodes representing the corresponding operation among the plurality of operations in the computation graph, the amount of corresponding memory resources consumed by the operation represented by the node is determined based on information about the directed edges to the node, wherein assigning the operation represented by each node in the computation graph to the corresponding hardware flow is based at least on the amount of corresponding memory resources consumed by the operation represented by the node.
18. The method of claim 13, further comprising: Determine that a specific operation represented in the computation graph has ended at a specific hardware flow; In response to determining that the specific operation has ended, a first amount of memory consumed by the specific operation to be released is determined; For each unassigned operation in the unassigned operation group, determine the corresponding estimated amount of memory that will be consumed by the unassigned operation; A first unassigned operation is determined from the group of unassigned operations using a corresponding estimated amount of memory to be consumed by the unassigned operation, the first unassigned operation having an estimated amount of memory that maximizes the use of the first amount of memory. and Based on the determination that the first unassigned operation maximizes the use of the first memory amount, the first unassigned operation is assigned to the specific hardware stream.
19. The method of claim 14, further comprising: The one or more specific outputs are stored in the memory of the hardware accelerator.
20. A system comprising: Multiple hardware streams; and One or more non-transitory computer-readable storage media encoded with instructions that, when executed, cause the system to perform program operations, the program operations including: For each of the plurality of operations represented in the computation graph, the operation is assigned to a corresponding flow among the plurality of hardware flows of the system, each flow being configured to queue the operations assigned to the flow and execute the queued operations in a defined order on the corresponding hardware resources of the flow. Configure a first stream of the plurality of hardware streams to halt the execution of a first operation assigned to that first stream until all inputs to the first operation have been computed, wherein the first inputs to the first operation include the outputs of second operations assigned to different second streams of the plurality of hardware streams; and The operations assigned to each of the plurality of hardware streams are performed in a defined order, including performing at least one operation performed by the first stream in parallel with at least one operation performed by the second stream.