Neural architecture and hardware accelerator search
By optimizing the architecture of neural networks and hardware accelerators through controller strategies and reinforcement learning techniques, the problem of low architecture determination efficiency in existing technologies is solved, and efficient joint search of neural networks and hardware accelerators is achieved, satisfying computational and hardware constraints.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- GOOGLE LLC
- Filing Date
- 2021-10-01
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies struggle to efficiently combine the optimal architecture of neural networks and the optimal hardware architecture of hardware accelerators, leading to wasted computing resources and increased complexity in the search process.
The system employs a controller strategy to generate output sequences, trains instances of sub-neural networks and hardware accelerators, adjusts the controller strategy using reinforcement learning techniques to optimize network and accelerator performance metrics, and jointly searches for the architecture of neural networks and hardware accelerators.
It effectively identifies neural network architectures and hardware accelerator architectures that can efficiently execute machine learning tasks within the target latency, reducing computational resource consumption, meeting hardware constraints, and improving search efficiency.
Smart Images

Figure CN116324807B_ABST
Abstract
Description
[0001] Cross-reference to related applications
[0002] This application claims priority to U.S. Provisional Application No. 63 / 087,143, filed October 2, 2020. The disclosure of the earlier application is considered part of the disclosure of this application and is incorporated herein by reference. Background Technology
[0003] This specification relates to determining neural network architecture and hardware accelerator design.
[0004] A neural network is a machine learning model that uses one or more non-linear units to predict the output from a received input. In addition to the output layer, some neural networks also include one or more hidden layers. The output of each hidden layer is used as the input to the next layer in the network (i.e., the next hidden layer or output layer). Each layer of the network generates an output based on the received input, using the current values of its corresponding set of parameters.
[0005] Hardware accelerators are computing devices with dedicated hardware configured to perform dedicated computing, such as graphics processing units (“GPUs”), field-programmable gate arrays (“FGPAs”), and application-specific integrated circuits (“ASICs”), including tensor processing units (“TPUs”). Summary of the Invention
[0006] This specification describes a system implemented as a computer program on one or more computers in one or more locations, the system being able to jointly (e.g., simultaneously) determine (i) an optimal network architecture for a neural network configured to perform a specific machine learning task and (ii) an optimal hardware architecture for a hardware accelerator, which serves as a target computing device (part of it) on which the neural network is to be implemented.
[0007] Depending on the task, a neural network can be configured (i.e., trained) to receive any kind of numerical data input and generate any kind of score, classification, or regression output based on the input.
[0008] Once trained, the neural network can be implemented on a target computing device, which in turn includes one or more hardware accelerators. A hardware accelerator is a computing device that includes dedicated hardware for performing certain types of operations (such as matrix multiplication), and is more efficient than non-dedicated or "general-purpose" computing devices. Different hardware accelerators can have different hardware characteristics, such as in the number of computing units, parallelism, computation-to-memory ratio, bandwidth, etc.
[0009] As an example, the target computing device, including one or more hardware accelerators, can be a single, specific edge device, such as a mobile phone, a smart speaker, or another embedded computing device or other edge device. As a particular example, the edge device can be a mobile phone or other device with a specific type of hardware accelerator or other computer chip on which the neural network will be deployed.
[0010] As another example, a target computing device that includes one or more hardware accelerators can be a collection of multiple hardware accelerator devices, such as ASICs, FPGAs, or tensor processing units (TPUs) on real-world intelligent agents (e.g., vehicles, such as autonomous cars) or robots.
[0011] As another example, a target computing device that includes one or more hardware accelerator devices can be a collection of hardware accelerators in a data center.
[0012] Typically, an innovative aspect of the subject matter described in this specification can be implemented in a method comprising: generating a batch of one or more output sequences using a controller strategy, each output sequence in the batch defining (i) a corresponding architecture of a sub-neural network configured to perform a specific neural network task and (ii) a corresponding architecture of a hardware accelerator, training instances of the sub-neural network to be implemented on the hardware accelerator; for each output sequence in the batch: training a corresponding instance of the sub-neural network having the architecture defined by the output sequence to perform the specific neural network task; evaluating the network performance of the training instance of the sub-neural network for the specific neural network task to determine a network performance metric for the training instance of the sub-neural network for the specific neural network task; and evaluating the accelerator performance of a corresponding instance of the hardware accelerator having the architecture defined by the output sequence to determine an accelerator performance metric for the instance of the hardware accelerator supporting the performance of the training instance of the sub-neural network having the architecture defined by the output sequence for the specific neural network task; and adjusting the controller strategy using (i) the network performance metric of the training instance of the sub-neural network and (ii) the accelerator performance metric of the instance of the hardware accelerator.
[0013] The controller strategy can be implemented using a controller neural network with multiple controller network parameters; and adjusting the controller strategy may include adjusting the current values of multiple controller network parameters.
[0014] Adjusting the controller strategy using the network performance metrics of (i) training instances of the sub-neural network and (ii) accelerator performance metrics of instances of the hardware accelerator may include: using reinforcement learning techniques to train the controller neural network to generate an output sequence that results in increased network performance metrics for the sub-neural network and increased accelerator performance metrics for the hardware accelerator.
[0015] Reinforcement learning techniques can include proximal policy optimization (PPO) techniques.
[0016] Each output sequence may include the values of the corresponding hyperparameters of the sub-neural network at each of the first plurality of time steps.
[0017] Each output sequence may include the values of the corresponding hardware parameters of the hardware accelerator at each of the second or more time steps.
[0018] The controller neural network can be a recurrent neural network, comprising: one or more recurrent neural network layers configured to, for a given output sequence at each time step: receive the values of hyperparameters or hardware parameters from the previous time step in the given output sequence as input, and process the input to update the current hidden state of the recurrent neural network; and a corresponding output layer for each time step, wherein each output layer is configured to, for the given output sequence: receive an output layer input including the updated hidden state at that time step, and generate an output for that time step, the output defining a score distribution over possible values of the hyperparameters or hardware parameters at that time step.
[0019] Generating one or more output sequences in a batch using a controller strategy may include, for each output sequence in the batch, for each of the plurality of time steps: providing the controller neural network with the value of the hyperparameter or hardware parameter at the previous time step in the output sequence as input to generate an output for that time step, the output defining a fractional distribution over possible values of the hyperparameter or hardware parameter at that time step; and sampling from the possible values according to the fractional distribution to determine the value of the hyperparameter or hardware parameter at that time step in the output sequence.
[0020] The specific neural network task can be an object classification and / or detection task, an object pose estimation task, or a semantic segmentation task; the sub-neural network can be a convolutional neural network including one or more depthwise separable convolutional layers; and the hyperparameters can include the hyperparameters of each depthwise separable convolutional layer in the sub-neural network.
[0021] The sub-neural network may include one or more inverse residual layers and one or more linear bottleneck layers; and the hyperparameters may include the hyperparameters of each inverse residual layer and linear bottleneck layer in the sub-neural network.
[0022] The corresponding hardware characteristics of the hardware accelerator may include one or more of the following: the bandwidth of the hardware accelerator, the number of processing elements included in the hardware accelerator, the layout of the processing elements on the hardware accelerator, the number of single instruction multiple data (SIMD) multiply-accumulate operations in each processing element, the number of computation channels in each processing element, the size of the shared memory in each processing element, or the size of the register file in each processing element.
[0023] The accelerator performance metrics for the performance of the instance of the hardware accelerator in supporting the training instance of the sub-neural network may include one or more of the following: the estimated area of the hardware accelerator, the estimated power consumption of the hardware accelerator, or the estimated latency of the neural network for performing the specific neural network task when deployed on the hardware accelerator.
[0024] Evaluating the accelerator performance of a corresponding instance of the hardware accelerator having an architecture defined by the output sequence to determine accelerator performance metrics for the performance of the instance of the hardware accelerator for the training instance supporting the sub-neural network having the architecture defined by the output sequence for the specific neural network task may include: determining the estimated latency of the neural network for performing a specific neural network task when deployed on the hardware accelerator based on (i) the corresponding architecture of the sub-neural network and (ii) the corresponding architecture of the hardware accelerator defined by the batch of output sequences using a periodic accuracy performance simulator.
[0025] Evaluating the accelerator performance of a corresponding instance of the hardware accelerator having an architecture defined by the output sequence to determine accelerator performance metrics for the performance of the instance of the hardware accelerator for the training instance of the sub-neural network having the architecture defined by the output sequence for the specific neural network task may include: determining the estimated area of the hardware accelerator based on the corresponding architecture of the hardware accelerator defined by the batch of output sequences using an analytical area estimator.
[0026] Adjusting the current value of the controller network parameter of the controller neural network using the network performance metrics of (i) the training instance of the sub-neural network and (ii) the accelerator performance metrics of the hardware accelerator instance may include: assigning different weights to one or more of the accelerator performance metrics; and adjusting the current value of the controller network parameter of the controller neural network according to the different weights.
[0027] Adjusting the controller strategy using the network performance metrics of (i) the training instances of the sub-neural network and (ii) the accelerator performance metrics of the instances of the hardware accelerator may further include: fixing the network performance metrics of the training instances of the sub-neural network for the specific neural network task and adjusting the current values of the controller network parameters of the controller neural network using only the determined accelerator performance metrics of the instances of the hardware accelerator.
[0028] The method may further include: generating a final output sequence that defines the final architecture of the sub-neural network based on the adjusted values of the controller network parameters.
[0029] The method may further include: processing the received network input using a sub-neural network having the final architecture, and performing the specific neural network task on the received network input.
[0030] Another innovative aspect of the subject matter described in this specification can be implemented in a method comprising: receiving data on one or more target hardware constraints of a specified hardware accelerator, on which a neural network for performing a specific machine learning task is to be deployed; receiving training data and validation data for the specific machine learning task; and selecting, using the training data and the validation data, a network architecture for the neural network for performing the specific machine learning task from a candidate network architecture space, selecting the hardware architecture of the hardware accelerator from the candidate hardware architecture space, on which the neural network for performing the specific machine learning task is to be deployed, wherein each candidate network architecture in the space is defined by a corresponding set of decision values, the corresponding set of decision values including a corresponding decision value for each of a first plurality of classification decisions, wherein each candidate hardware architecture in the space is defined by a corresponding set of decision values, the corresponding set of decision values including a corresponding decision value for each of a second plurality of classification decisions, and wherein the selection includes: jointly updating (i) a set of controller parameters, the controller parameters being for the first plurality of classification decisions and the second plurality of classification decisions. Each classification decision defines a corresponding probability distribution on the decision value of that classification decision, and (ii) a shared set of parameters, wherein: updating the set of controller policy parameters includes: updating the set of controller parameters by reinforcement learning to maximize a reward function, the reward function measuring (i) the estimated quality of a candidate hardware architecture and (ii) the estimated quality of a candidate network architecture defined by the set of decision values, the set of decision values being sampled from the probability distribution generated using the controller policy parameters; and updating the shared set of model parameters includes: updating the shared set of model parameters to optimize an objective function, the objective function measuring the performance of the candidate network architecture defined by the set of decision values for the specific machine learning task, the set of decision values being sampled from the probability distribution generated using the controller policy; after the joint update, a candidate network architecture is selected as the network architecture of the neural network, the candidate network architecture being defined by a corresponding specific decision value of each of the first plurality of classification decisions; and a candidate hardware architecture is selected as the hardware architecture of the hardware accelerator, the candidate hardware architecture being defined by a corresponding specific decision value of each of the second plurality of classification decisions.
[0031] The method may further include: receiving data that specifies a target latency for the neural network to perform the specific machine learning task when deployed on the hardware accelerator.
[0032] The reward function may include a quality term that measures the estimated quality of (i) the candidate hardware architecture and (ii) the estimated quality of the candidate network architecture, and a delay term that measures the ratio between the estimated delay of the candidate architecture and the target delay.
[0033] The joint update includes repeatedly performing operations that may include: using the validation data to determine the estimated quality of a neural network with a candidate architecture for the specific machine learning task, the candidate architecture having a subset of a shared set of model parameters defined by selected decision values of the first plurality of classification decisions, wherein the quality is estimated based on the current value of the subset of the shared set of model parameters defined by the selected decision values of the first plurality of classification decisions.
[0034] The joint update may include repeated operations, the operations including: using the validation data and a latency simulator to determine the estimated latency of the neural network having the candidate network architecture when performing the specific machine learning task, the candidate network architecture having a subset of a shared set of model parameters defined by selected decision values of the first plurality of classification decisions, wherein the neural network is deployed on the hardware architecture having the subset of a shared set of model parameters defined by selected decision values of the second plurality of classification decisions.
[0035] The joint update may include repeating an operation that includes: using an area simulator to determine the estimated quality of the candidate hardware architecture, the candidate hardware architecture having a subset of a shared set of model parameters defined by the selected decision values of the second plurality of classification decisions.
[0036] Each of the time-delay simulator and the area simulator can be a corresponding neural network trained on labeled training data generated using an accelerator simulator.
[0037] Another innovative aspect of the subject matter described in this specification can be implemented in a machine learning task-specific hardware accelerator with an architecture defined by an execution process comprising the corresponding operation of any one of the preceding claims.
[0038] Other embodiments of the described aspect include corresponding computer systems, apparatuses, and computer programs recorded on one or more computer storage devices, each computer system, apparatus, and computer program configured to perform actions of the method. A system of one or more computers may be configured to perform a specific operation or action by means of software, firmware, hardware, or any combination thereof installed on the system, which, in operation, causes the system to perform actions. One or more computer programs may be configured to perform a specific operation or action by means of instructions that, when executed by a data processing device, cause the device to perform actions.
[0039] The subject matter described in this specification can be implemented in particular embodiments to achieve one or more of the following advantages.
[0040] Hardware accelerators are specialized hardware configured to perform specific computations and are generally more computationally efficient than their general-purpose counterparts, but are also typically more expensive due to the cost of the hardware itself and the associated energy costs of powering and maintaining the accelerator. Performing machine learning tasks (such as vision tasks, natural language processing tasks, or other tasks requiring near real-time responses to be provided to the user) using neural networks deployed on hardware accelerators requires (i) an accurate and computationally efficient neural network architecture to generate inferences from inputs with a specific target latency, and (ii) a hardware accelerator architecture that has been tailored for the machine learning task.
[0041] The described technique can be used to search for neural network architectures capable of performing tasks, while simultaneously searching for hardware accelerator architectures that can supply sufficient computational resources (e.g., memory, computing power, or both) to support the network performance of the task, while satisfying hardware constraints (e.g., resource consumption constraints, area constraints, or both), and thus identify (i) a single architecture or a series of architectures that can be efficiently deployed to compute inferences with target latency and (ii) a range of single architectures or a series of architectures on which the identified neural networks are deployed, which can efficiently support the network performance of the task while satisfying hardware architecture constraints.
[0042] Moreover, because the described technology allows the system to jointly identify network architectures with the hardware architecture, the search process consumes far fewer computational resources than existing technologies that search for neural network or hardware accelerator architectures on an independent (or alternating) basis.
[0043] Details of one or more embodiments of the subject matter of this specification are set forth in the following drawings and description. Other features, aspects, and advantages of the subject matter will become apparent from the description, drawings, and claims. Attached Figure Description
[0044] Figure 1 An example neural architecture and hardware architecture search system is shown.
[0045] Figure 2 This is a flowchart of an example process for updating controller policies.
[0046] Figure 3 This is a flowchart illustrating an example process for selecting the architecture of a neural network and a hardware accelerator by jointly updating a set of controller policy parameters and a shared set of parameters.
[0047] Figure 4 This is a diagram illustrating the neural architecture of a neural network and the hardware architecture of a hardware accelerator.
[0048] In the various figures, the same reference numerals and names indicate the same elements. Detailed Implementation
[0049] This specification describes a system implemented as a computer program on one or more computers in one or more locations, the system being able to jointly (e.g., simultaneously) determine (i) an optimal network architecture for a neural network configured to perform a specific machine learning task and (ii) an optimal hardware architecture for a hardware accelerator, which serves as part of the target computing device on which the neural network is to be implemented, i.e., the architecture of the hardware accelerator on which the neural network will be deployed after the neural network has been trained.
[0050] In some cases, a neural network is a network configured to perform image processing tasks (i.e., receiving an input image and processing it to generate a network output). In this specification, processing an input image refers to using a neural network to process the intensity values of image pixels. For example, the task could be image classification, and the output generated by the neural network for a given image could be a score for each object category in a set of object classifications, each score representing an estimated probability that the image contains an image of an object belonging to that category. As another example, the task could be image embedding generation, and the output generated by the neural network could be a digital embedding of the input image. As yet another example, the task could be object detection, and the output generated by the neural network could identify locations in the input image where specific types of objects are depicted. As yet another example, the task could be image segmentation, and the output generated by the neural network could assign each pixel of the input image to a category from a set of classifications.
[0051] As another example, if the input to the neural network is an internet resource (e.g., a webpage), a document, or a portion of a document, or features extracted from an internet resource, document, or portion of a document, then the task could be to classify the resource or document. That is, the output generated by the neural network for a given internet resource, document, or portion of a document could be a score for each topic in a set of topics, with each score representing an estimated probability that the internet resource, document, or portion of a document is related to that topic.
[0052] As another example, if the input to the neural network is features of the impression context of a particular ad, the output generated by the neural network can be a score representing the estimated probability that the particular ad will be clicked.
[0053] As another example, if the input to the neural network is features for personalized recommendations to a user, such as features characterizing the recommendation context or features characterizing the user's previous actions, then the output generated by the neural network could be a score for each content item in the set of content items, with each score representing an estimated probability that the user will respond positively to the recommended content item.
[0054] As another example, if the input to a neural network is a sequence of texts in one language, the output generated by the neural network can be a score for each text segment in a set of text segments in another language, where each score represents an estimated probability that the text segment in the other language is correctly translated from the input text into the other language.
[0055] As another example, the task could be an audio processing task. For instance, if the input to the neural network is a sequence representing spoken utterances, the output generated by the neural network could be a score for each text segment in a set of text segments, with each score representing an estimated probability that the text segment is a correct transcription of the utterance.
[0056] As another example, the task could be a keyword detection task, where if the input to the neural network is a sequence representing spoken utterances, the output generated by the neural network can indicate whether a particular word or phrase (“hot word”) is spoken in the utterance. As yet another example, if the input to the neural network is a sequence representing spoken utterances, the output generated by the neural network can identify the natural language spoken by that utterance.
[0057] As another example, the task can be a natural language processing or understanding task, such as an implication task, a paraphrasing task, a text similarity task, an emotion task, a sentence completion task, a grammar task, etc., which operates on some natural language text sequences.
[0058] As another example, the task could be a text-to-speech task, where the input is text in natural language or text features in natural language, and the network output is a spectrogram or other data used to define the audio of the spoken text in natural language.
[0059] As another example, the task could be a health prediction task, where the input is a patient's electronic health record data, and the output is a prediction related to the patient's future health, such as a predicted treatment that should be administered to the patient, the likelihood of the patient experiencing an adverse health event, or a predicted diagnosis. Physiological data (such as heart rate, blood pressure, blood glucose levels, blood chemistry, etc.) can be used as input, and the output is the probability of one or more health events occurring and / or the probability of one or more diagnoses. For example, if the input includes blood glucose measurements (e.g., a sequence of blood glucose readings), the output could include the probability of a hypoglycemic or hyperglycemic event. If the input includes blood pressure measurements and / or heart rate, the output could include the probability of a cardiac event and / or the presence of a heart condition.
[0060] As another example, the task could be an agent control task, where the input is an observation characterizing the state of the environment, and the output defines the action the agent must perform in response to the observation. For example, the agent could be a real-world or simulated robot, a control system for an industrial facility, or a control system that controls different types of agents.
[0061] As another example, the task could be a genomics task, where the input is a sequence representing a fragment of a DNA sequence or other molecular sequence, and the output is, for example, an embedding of a fragment for a downstream task using unsupervised learning techniques on a dataset of DNA sequence fragments, or the output of a downstream task. Examples of downstream tasks include promoter site prediction, methylation analysis, prediction of functional effects of non-coding variants, etc.
[0062] In some cases, a machine learning task is a combination of multiple individual machine learning tasks; that is, a neural network is configured to perform multiple different individual machine learning tasks, such as two or more of the machine learning tasks mentioned above. For example, a neural network can be configured to perform multiple individual natural language understanding tasks. Optionally, the network input may include identifiers of the individual natural language understanding tasks to be performed on the network input. As another example, a neural network can be configured to perform multiple individual image processing or computer vision tasks, that is, to generate the outputs of multiple different individual image processing tasks in parallel by processing a single input image.
[0063] Figure 1An example neural architecture and hardware architecture search system 100 is shown. The neural architecture and hardware architecture search system 100 is an example of a system implemented as a computer program on one or more computers at one or more locations, wherein the systems, components and techniques described below can be implemented.
[0064] The neural architecture and hardware architecture search system 100 is a system that obtains training data 102 and validation data 104 for a specific machine learning task and selects the network architecture 150 of the neural network and the hardware architecture 160 of the hardware accelerator on which the neural network is to be deployed to perform the task using the training data 102 and validation data 104.
[0065] Typically, both training data 102 and validation data 104 include sets of neural network inputs (also known as training or validation examples), and for each network input, include a corresponding target output that should be generated by the neural network to perform a specific task. Training data 102 and validation data 104 can include different sets of neural network inputs, i.e., such that validation data 104 can be used to effectively measure how well a neural network already trained on training data 102 performs on new inputs.
[0066] System 100 can receive training data 102 and validation data 104 in any of a variety of ways. For example, system 100 can receive training data as an upload from a remote user of the system via a data communication network, for example, using an application programming interface (API) provided by system 100. System 100 can then randomly divide the received training data into training data 102 and validation data 104. As another example, system 100 can receive input from a user specifying which data that system 100 has maintained should be used to train the neural network.
[0067] System 100 may also receive, for example, data from a user specifying one or more search targets 106, which typically define the expected performance requirements or constraints of a neural network, a hardware accelerator, or both. Several example search targets are described below.
[0068] For example, the search objective could include the target accuracy used to perform a machine learning task. Target accuracy can be evaluated, for instance, by calculating the loss of a neural network trained on a validation dataset or by the results of some other metric of model accuracy when calculated on a validation dataset.
[0069] As another example, the search objective can include the target latency for performing machine learning tasks after training and during inference—that is, for processing new inputs for a specific task after the architecture has been determined. Typically, target latency is the target latency of the neural network when deployed on the target computing device. When the neural network is deployed on the target computing device, target latency measures the time required, for example, in milliseconds, to perform inference on a batch of one or more examples—that is, to process each example in the batch using the neural network.
[0070] As another example, the search objective could include constraints on the configuration or design of the underlying hardware accelerator that supports neural network operations. Example hardware configuration or design constraints could include the area of the hardware accelerator, its power (or energy consumption), etc.
[0071] In some implementations, this search target can be symbolically represented as:
[0072]
[0073] Latency(α,h)≤T latency Area(h)≤T area
[0074] in Indicates the objective function of the task, and w α The weights of the architecture α are represented by . The hardware parameters are represented by h, and the training and evaluation sets are represented by . and T latency It is the target runtime delay of the trained neural network when performing the task, and T area This refers to the target hardware accelerator area, both of which can be specified in the search target data.
[0075] Therefore, using the techniques described below, system 100 can effectively determine (i) the architecture of a neural network configured to perform a machine learning task and (ii) the hardware architecture of the hardware accelerator on which the neural network will be deployed, while satisfying one or more search objectives.
[0076] As a concrete example, system 100 can determine a specific architecture for a neural network. When deployed on a specific hardware accelerator with an architecture determined by the system and an area not exceeding the maximum permissible hardware area, the neural network can be configured to perform a specific machine learning task with acceptable accuracy (e.g., accuracy approximately equal to the target accuracy) while having a runtime latency not exceeding the maximum permissible latency. In this example, the maximum permissible hardware area, target accuracy, and maximum permissible latency can all be specified in the search target data 106.
[0077] Then, system 100 uses training set 102, validation data 104 and search target data 106 to determine the neural network architecture and hardware accelerator architecture by searching a joint search space consisting of the space of candidate neural network architectures and the space of candidate hardware accelerator architectures.
[0078] Typically, the architecture of a neural network defines the number of layers in the network, the operations performed by each layer, and the connectivity between layers—that is, which layers receive input from which other layers in the network.
[0079] Specifically, the search space for candidate neural network architectures can be defined by the set of possible values of hyperparameters; that is, it can include a set of hyperparameters, each of which can have a predetermined set of possible values. The selected values of the hyperparameters can be set before the training of the neural network begins and can affect the operations performed by the neural network. Overall, the selected values of the hyperparameters can define the architecture of the neural network.
[0080] Examples of neural architecture search spaces and the sets of corresponding hyperparameters that define these search spaces are described below.
[0081] For example, the search space could be specifically built for mobile edge processors and based on the MobilenetV2 infrastructure, which includes stacks of inverse bottleneck layers. The neural architecture search space in this example could include efficient neural network components such as Mobile Inverse Bottleneck Convolutional (MBConv) layers, each of which in turn includes one or more inverse residual layers, one or more linear bottleneck layers, and one or more convolutional layers (e.g., one or more depthwise separable convolutional layers). The searchable hyperparameters could then include corresponding hyperparameters associated with the depthwise separable convolutional layer, the inverse residual layer, or the linear bottleneck layer. Specifically, the searchable hyperparameters could include the kernel size and expansion ratio of each inverse bottleneck convolutional layer. For example, the kernel size value could be chosen from the set of possible integer values {3, 5, 7}, and the expansion ratio could be chosen from the set of possible integer values {1, 3, 6}. The MobileNetV2 search space is described in more detail in Sandler, M., et al., “MobileNetV2: Inverted Residuals and Linear Bottlenecks.” arXiv preprint arXiv:1801.04381 (2019), the entire contents of which are incorporated herein by reference.
[0082] As another example, the search space can be constructed based on the standard EfficientNet-B0 infrastructure, which consists of stacks of inverse residual blocks. The EfficientNet search space can be constructed with a larger cardinality than the MobileNetV2 search space to better utilize modern edge accelerators, which typically have a larger number of computational units and memory capacity. Similarly, searchable hyperparameters in the EfficientNet-B0 search space can include the kernel size and scaling ratio of each residual block. The EfficientNet search space is described in more detail in Tan, M., et al., “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.” arXiv preprint arXiv:1905.11946 (2019), the entire contents of which are incorporated herein by reference.
[0083] The search space for candidate hardware accelerator architectures can be defined by the possible values of a set of searchable hardware parameters. Example hardware parameters can include the number of compute units, parallelism, compute-to-memory ratio, bandwidth, etc., associated with a given hardware accelerator (e.g., an industry-standard, highly parameterized edge accelerator), collectively specifying the hardware architecture that includes the corresponding computational characteristics of the hardware accelerator. Each hardware parameter is typically associated with one or more values, such as integers or floating-point values, which can be selected from the set of possible values for the hardware parameter.
[0084] Examples of hardware search spaces and the sets of corresponding hardware parameters that define these search spaces are described in Table 1 below.
[0085] Table 1 below illustrates an example candidate architecture design space, where “PE” refers to a processing element capable of performing matrix multiplication in a Single Instruction Multiple Data (SIMD) paradigm, and “PEs_in_x_dimension” refers to the number of processing elements along the horizontal dimension of the hardware accelerator. Typically, the number of PEs in each dimension defines the aspect ratio of the hardware accelerator. Within each PE, there can be multiple compute channels sharing local memory, and each channel can have a register file and a series of SIMD-style multiply-accumulate (MAC) compute units.
[0086] parameter type Search space parameter type Search space PEs_in_x_dimension int 1、2、4、6、8 local_memory_MB int 0.5、1、2、3、4 PEs_in_y_dimension int 1、2、4、6、8 compute_lanes int 1、2、4、8 SIMD_units int 16、32、64、128 io_bandwidth_gbps float 5、10、15、20、25 register_file_KB int 8、16、32、64、128
[0087] Table 1
[0088] Specifically, in this example, searchable hardware parameters may include one or more of the following: the bandwidth of the hardware accelerator, the number of processing elements included in the hardware accelerator, the layout of the processing elements on the hardware accelerator, the number of single instruction multiple data (SIMD) multiply-accumulate (MAC) operations in each processing element, the number of compute channels in each processing element, the size of shared memory in each processing element, or the size of register files in each processing element.
[0089] While a total of three example search spaces (two for neural network architectures and one for hardware accelerator architectures) have now been described, it should be understood that the described techniques can be used to search any search space defined by the possible values of hyperparameters or sets of parameters or other tunable variables. For example, different neural network architecture search spaces can have layers composed of different kinds of operations, such as different kinds of residual blocks or different kinds of convolutional operations, such as dilated convolution, spatial convolution, etc. Similarly, different hardware accelerator architecture search spaces can have hardware components that perform different operations or supply different resources, such as different kinds of memory, such as PE memory, core memory, parameter memory, etc.
[0090] In some implementations, each candidate neural network architecture in the joint search space has a different subset of a shared set of parameters, and the corresponding values of the shared set of parameters are jointly updated by the system during the search process. This can improve search efficiency, thereby saving computational resources (e.g., in terms of processing cycles) required to determine the final neural network architecture and the final hardware accelerator architecture.
[0091] Specifically, in these implementations, each candidate neural network architecture performs a set of operations that use a different subset of a shared set of model parameters. The subset for each candidate neural network architecture is defined by a corresponding set of decision values, which includes the decision values for each of a plurality of first classification decisions. In other words, the decision values of the first classification decisions specify which operations are performed by the candidate neural network architecture and, correspondingly, which model parameters from the shared set are used by the neural network architecture.
[0092] For example, the possible values of the first classification decision define one or more aspects of the neural network architecture. Any aspects not defined by the first classification decision are fixed; that is, they are identical for all architectures in the candidate neural network architecture space. The first classification decision can include multiple different types of classification decisions, each corresponding to a specific point in the neural network.
[0093] As an example, the first classification decision may include a binary decision to determine whether a corresponding layer (or other operation) in the neural network is skipped or included in the neural network architecture. As another example, the first classification decision may include a decision specifying which operation(s) from the corresponding set of operations are performed at a given point in the neural network. For example, the first classification decision may specify whether a given layer in the architecture is a convolutional layer, an inverse bottleneck layer, etc. As yet another example, the first classification decision may specify which convolution in a different set of convolutions is performed, for example, by specifying the spatial size of the filters in the convolutional layers of a convolutional neural network.
[0094] In some implementations, each candidate hardware accelerator architecture has a set of hardware characteristics defined by a set of hardware parameters. The set of hardware parameters for each candidate hardware accelerator architecture is defined by a corresponding set of decision values, which includes the decision value for each of the second plurality of classification decisions. In other words, the decision values of the hardware accelerator classification decisions specify which hardware characteristics the candidate hardware accelerator architecture should possess.
[0095] For example, the possible values of the second classification decision define one or more aspects of the hardware characteristics of the hardware accelerator.
[0096] The neural architecture and hardware architecture search system 100 automatically searches the joint search space using a controller policy 110, a training engine 120, and a controller policy tuning engine 130 to determine the neural network architecture 150 and the hardware accelerator architecture 160.
[0097] The controller strategy 110 is typically implemented as software, which is configurable to generate a strategy output that includes values for a set of hyperparameters and a set of hardware parameters. The values of the set of hyperparameters collectively define the possible architecture of the neural network, and the values of the set of hardware parameters collectively define the possible architecture of the hardware accelerator. For example, the software has adjustable settings for generating different values for different hyperparameters or hardware parameters.
[0098] In some implementations, the controller strategy 110 may be implemented as a neural network, hereinafter referred to as a "controller neural network". The controller neural network is a neural network with parameters (referred to herein as "controller network parameters") and is configured to generate output sequences 112 based on the controller network parameters. Each output sequence 112 generated by the controller neural network defines a corresponding possible architecture for a candidate neural network (hereinafter referred to as a "sub-neural network") and a corresponding possible architecture for a candidate hardware accelerator.
[0099] In some of these implementations, each output sequence 112 includes a corresponding output at each of a plurality of time steps, and each time step in the output sequence corresponds to a different hyperparameter of the sub-neural network architecture or a different hardware parameter of the hardware accelerator architecture. Therefore, each output sequence 112 includes a corresponding value of the corresponding hyperparameter or a corresponding value of the corresponding hardware parameter at each time step. Overall, the values of the hyperparameters in a given output sequence define the architecture of the sub-neural network, while the values of the hardware parameters in a given output sequence define the architecture of the hardware accelerator.
[0100] Alternatively, in some other embodiments, the controller policy may include a set of controller policy parameters that define a corresponding probability distribution over possible values of each hyperparameter (or hardware parameter) for the neural network architecture (or hardware parameter of the hardware accelerator architecture). The system 100 can then use the controller policy parameters to select candidate neural network architectures and candidate hardware accelerator architectures. In some of these embodiments, each output sequence 112 may include corresponding values of the hyperparameters and hardware parameters sampled by the system 100 from possible values according to the probability distribution.
[0101] In other embodiments, the controller policy 110 may include a set of controller policy parameters that define a corresponding probability distribution for each of the first plurality of classification decisions and the second plurality of classification decisions, and the system 100 may use the controller policy parameters to select candidate neural network architectures and candidate hardware accelerator architectures. That is, in these embodiments, the candidate neural network architectures and candidate hardware accelerator architectures are defined by a set of decision values sampled from probability distributions generated using the controller policy parameters. In some of these embodiments, each output sequence 112 alternatively includes a set of decision values for each of the first plurality of classification decisions and the second plurality of classification decisions.
[0102] During the search process, the system 100 repeatedly adjusts the controller policy 110 using the controller policy adjustment engine 130 to determine the architecture of the sub-neural network and the architecture of the hardware accelerator, so that the controller policy 110 can propose a neural network architecture and a hardware accelerator architecture that satisfy one or more search objectives 106.
[0103] In some implementations where the controller policy 110 is implemented as a controller neural network, the system can achieve this by adjusting the values of the controller network parameters. Specifically, during iterations of the training procedure, the system 100 generates batches of sequences 112 using the controller neural network based on the current values of the controller network parameters. For each output sequence 112 in the batch, the training engine 120 trains an instance of a sub-neural network with an architecture defined by the output sequence on training data 102 and evaluates the performance of the trained instance on a validation set 104. For each output sequence 112 in the batch, the system 100 also evaluates the performance of the hardware accelerator supporting the operation of the sub-neural network, for example, by using appropriate computer architecture simulation tools or techniques. The controller policy adjustment engine 130 then uses the evaluation results (i.e., neural network performance metric 122 and accelerator performance metric 124) to update the current values of the controller network parameters in the batch of output sequences 112 to improve the expected performance of the neural network architecture and hardware accelerator architecture defined by the output sequences generated by the controller neural network for the task.
[0104] Alternatively, in some other embodiments where the controller policy 110 includes a set of controller policy parameters, this set of parameters defines a corresponding distribution over possible values of each hyperparameter of the candidate neural network and each hardware parameter of the candidate hardware accelerator (or defines a corresponding probability distribution for each of a first plurality of classification decisions and a second plurality of classification decisions). The controller policy adjustment engine 130 can update the controller policy 110 through reinforcement learning to maximize a reward function that depends on the neural network performance metric 122 and accelerator performance metric 124 of the candidate neural network architecture and the candidate hardware accelerator architecture, respectively defined by corresponding values of the hyperparameters and hardware parameters (or sets of decision values) sampled from the probability distributions generated using the controller policy parameters. In some of these embodiments, the training engine 120 jointly updates a shared set of model parameters to optimize an objective function that measures the performance of the candidate neural network architecture on a specific machine learning task.
[0105] By repeatedly updating controller policy 110, system 100 can encourage controller policy 110 to generate an output sequence that results in improved neural network performance for a specific task when the sub-neural network is deployed on a hardware accelerator with improved hardware accelerator performance, such as maximizing the expected accuracy on the validation set 104 of the neural network with the neural network architecture proposed by controller policy 110, while minimizing the runtime latency of the neural network and minimizing the area of the hardware accelerator with the neural network architecture proposed by controller policy 110.
[0106] Figure 4This is a diagram illustrating the combined neural architecture of the neural network and the hardware architecture of the hardware accelerator. Specifically, Figure 4 The illustration shows an example of a specific architecture for a neural network, which, when deployed on a specific hardware accelerator with an architecture determined by the system, can be configured to perform a specific machine learning task with acceptable accuracy and acceptable runtime latency.
[0107] As illustrated, at each iteration, controller policy 410 generates a policy output that includes the values of a set of hyperparameters that jointly define the possible architectures of neural network 412 and the values of a set of hardware parameters that jointly define the possible architectures of hardware accelerator 414. Training engine 420 trains instances of sub-neural networks with the architecture 412 defined by the policy output on training data and evaluates the performance of the trained instances on a validation set. Accelerator performance estimator 430 simulates instances of the hardware accelerator to simulate the effect of deploying sub-neural networks on the hardware accelerator to determine the effect of estimating latency. Then, controller policy tuning engine 440 uses the evaluation results (i.e., accuracy and latency) to update controller policy 410 to improve the performance of the new neural network architecture and the new hardware accelerator architecture defined by the policy output generated by controller policy 410 in the next iteration.
[0108] After the controller policy 110 has been updated, for example, once the controller neural network has been trained, the system 100 can select the neural network architecture and hardware accelerator architecture that best satisfy the search objective 106 as the final architecture of the sub-neural network and the final architecture of the hardware accelerator, respectively. Alternatively or additionally, the system 100 can generate a new output sequence by, for example, using the updated controller policy 110 based on the training values of the controller network parameters, and use the neural network architecture and hardware accelerator architecture defined by the new output sequence as the final architecture of the sub-neural network and the final architecture of the hardware accelerator, respectively.
[0109] Then, the neural architecture and hardware architecture search system 100 can generate (i) neural network architecture data 150 specifying the architecture of the sub-neural network, such as data specifying the layers that are part of the sub-neural network, the connectivity between layers, and the operations performed by the layers, and (ii) hardware accelerator architecture data 160 specifying the architecture of the hardware accelerator, such as data specifying the layout of processing elements on the hardware accelerator, the number of computing channels, and the size of local memory as output.
[0110] For example, the neural network and hardware architecture search system 100 can output neural network architecture data 150 and hardware accelerator architecture data 160 to a user who provides a search target 106. As another example, the system 100 can output hardware accelerator architecture data to a semiconductor manufacturing facility containing semiconductor manufacturing equipment, which can be used to manufacture hardware accelerators with the final hardware architecture, for example via a wired or wireless network. In some cases, the output data also includes training values of the parameters of a trained sub-neural network from a training instance of a sub-neural network having the architecture.
[0111] In some implementations, instead of or in addition to outputting neural network architecture data 150 and hardware accelerator architecture data 160, system 100 trains instances of neural networks with the determined architecture, for example, from scratch or by fine-tuning parameter values generated as a result of training instances of sub-neural networks with that architecture, and then uses the trained neural network to process requests received by the user (e.g., via a system-provided API). That is, system 100 can receive input to be processed, process the input using the trained sub-neural network, and provide output generated by the trained neural network or data derived from the generated output in response to the received input.
[0112] In some implementations, system 100 may be included as part of software tools (such as electronic design automation (EDA) tools) for designing and / or analyzing integrated circuits, and then hardware accelerator architecture data may be provided to another component of the tool for further refinement or evaluation before the hardware accelerator is manufactured.
[0113] In implementations where the controller policy is implemented as a controller neural network, system 100 can train the controller neural network in a distributed manner. That is, system 100 includes multiple replicas of the controller neural network. In some of these distributed training implementations, each replica has a dedicated training engine that generates performance metrics for batches of output sequences of the replica's output, and a dedicated controller policy tuning engine that uses the performance metrics to determine updates to the controller network parameters. Once the controller policy tuning engine has determined the updates, it can transmit the updates to a central policy tuning server accessible to all controller policy tuning engines. The central policy tuning server can update the values of the controller network parameters maintained by the server and send the updated values to the controller policy tuning engines. In some cases, each of the multiple replicas and its corresponding training engine and policy tuning engine can operate asynchronously with each of the other sets of training engines and policy tuning engines.
[0114] Figure 2This is a flowchart of an example process 200 for updating a controller policy. For convenience, process 200 will be described as being executed by a system located in one or more locations on one or more computers. For example, a properly programmed system (e.g., Figure 1 The neural architecture and hardware architecture search system 100 can execute process 200.
[0115] The system can repeatedly execute process 200 to iteratively determine the update of the controller policy.
[0116] The system uses a controller strategy to generate one or more output sequences in a batch (step 202). Each output sequence in the batch defines (i) the corresponding architecture of a sub-neural network configured to perform a specific machine learning task and (ii) the corresponding architecture of the hardware accelerator on which the training instances of the sub-neural network are to be implemented.
[0117] Depending on the details of the controller strategy, the system can generate each output sequence in any of a variety of ways. For example, when generating the output sequence, the system can first generate the corresponding hyperparameter values of the sub-neural network, and then generate the corresponding hardware parameter values of the hardware accelerator. That is, the output sequence can include the values of the corresponding hyperparameters of the sub-neural network at each time step in a first plurality of time steps and the values of the corresponding hardware parameters of the hardware accelerator at each time step in a second plurality of time steps after the last time step in the first plurality of time steps. As another example, the system can first generate the corresponding hardware parameter values of the hardware accelerator, and then generate the corresponding hyperparameter values of the sub-neural network. As yet another example, the system can generate the corresponding hyperparameter values of the sub-neural network and the corresponding hardware parameter values of the hardware accelerator in an interleaved manner.
[0118] In some implementations, the controller strategy can be implemented as a controller neural network. In some such implementations, the neural network can be a recurrent neural network comprising one or more recurrent neural network layers configured to receive, for each time step, values of hyperparameters (or hardware parameters) corresponding to the previous time step in a given output sequence as input, and process the input to update the current hidden state of the recurrent neural network. For example, the recurrent layers in the controller neural network can be long short-term memory (LSTM) layers or gated recurrent unit (GRU) layers.
[0119] Therefore, to generate hyperparameter (or hardware parameter) values for a given time step in the output sequence, the system provides the controller neural network with the values of the hyperparameter (or hardware parameter) at the previous time step in the output sequence as input, and the controller neural network generates the output for that time step, which defines a fractional distribution over the possible values of the hyperparameter (or hardware parameter) at that time step. The system can generate the fractional distribution using the output layer of the controller neural network, which can be configured as a softmax layer. For the first time step in the output sequence, since there is no previous time step, the system can instead provide predetermined placeholder inputs. The system then samples from the possible values according to the fractional distribution to determine the values of the hyperparameter (or hardware parameter) at that time step in the output sequence. The possible values that a given hyperparameter (or hardware parameter) can take are fixed before training, and the number of possible values can vary for different hyperparameters (or hardware parameters).
[0120] When a batch includes more than one output sequence (e.g., eight, sixteen, thirty-two, or sixty-four sequences), the sequences in the batch will typically be different because the system samples from a fractional distribution when generating each hyperparameter (or hardware parameter) value in the output sequence, even if each of them is generated based on the same controller parameter values.
[0121] In some other implementations, instead of being configured as a neural network, the controller policy may include a set of controller policy parameters that define a corresponding probability distribution over possible values of the hyperparameters (or hardware parameters) for each hyperparameter of the neural network architecture (or hardware parameters of the hardware accelerator architecture). To generate one or more output sequences for each of the corresponding architectures of (i) the sub-neural network and (ii) the hardware accelerator, the system then repeatedly samples from the possible values according to the probability distribution to determine the corresponding values of the hyperparameters (or hardware parameters) to be included in the output sequences.
[0122] For each output sequence in the batch, the system trains a corresponding instance of a sub-neural network with an architecture defined by the output sequence to perform a specific machine learning task (step 204). That is, for each output sequence in the batch, the system instantiates a neural network with an architecture defined by the output sequence and trains the instance on the received training data to perform the specific machine learning task using conventional machine learning training techniques suitable for that task (e.g., stochastic gradient descent with backpropagation or backpropagation over time). In some implementations, the system parallelizes the training of the sub-neural networks to reduce the total training time of the controller neural network. The system can train each sub-neural network within a specified amount of time or a specified number of training iterations.
[0123] For each output sequence in the batch, the system evaluates the network performance of the training instances of the sub-neural network on a specific machine learning task to determine the network performance metric of the training instances of the sub-neural network on the specific machine learning task (step 206). For example, the performance metric could be the accuracy of the training instances on a validation set measured by an appropriate accuracy metric. For example, when the output is a sequence, accuracy could be a perplexity metric, or when the task is a classification task, accuracy could be the cross-entropy error rate. As another example, the performance metric could be the average or maximum of the instance accuracy in each of the last two, five, or ten rounds of instance training.
[0124] Additionally, for each output sequence in the batch, the system evaluates the accelerator performance of the corresponding instance of the hardware accelerator with the architecture defined by the output sequence to determine the accelerator performance metric of the hardware accelerator instance (step 208). The performance metric measures the performance of the hardware accelerator instance on operations supporting training instances of sub-neural networks with the architecture defined by the output sequence for a specific machine learning task.
[0125] In some implementations, various tools suitable for evaluating hardware design alternatives can be used to assess hardware accelerator performance. One example of such a tool is a cycle-accurate performance simulator. The system can use a cycle-accurate performance simulator, along with, for example, simulated data, to determine the estimated latency (e.g., in milliseconds) of a neural network performing a specific machine learning task when deployed on a (simulated) instance of the hardware accelerator, specifying (i) the corresponding architecture of the sub-neural network and (ii) the corresponding architecture of the hardware accelerator defined by the output sequence.
[0126] Another example of such a tool is an analytical area estimator. A system can use an analytical area estimator, along with, for example, simulated data, to determine the estimated area (e.g., in square millimeters) of an instance of a hardware accelerator, which specifies the corresponding architecture of the hardware accelerator defined by a batch of output sequences.
[0127] In some other implementations, various machine learning-based techniques can be used alternatively to determine accelerator performance metrics. Unlike expensive simulators that typically take up to an hour or more to evaluate the performance of a single hardware accelerator with the proposed hardware architecture, machine learning-based techniques, such as neural networks, are generally much faster and more resource-efficient in determining performance metrics.
[0128] For example, the system can use a neural network (e.g., a feedforward neural network) configured to receive data from the corresponding architecture of a specified hardware accelerator and, in some cases, data from the corresponding architecture of a specified sub-neural network as input, and process the input based on the current values of the neural network's parameters to generate an area prediction of the hardware accelerator as output. As another example, the system can use another neural network to generate predictions of the model accuracy of the neural network, or predictions of the latency of a neural network deployed on a hardware accelerator. To ensure that the neural network can effectively predict performance metrics, it can be trained using supervised training techniques on labeled training data generated using the simulator described above.
[0129] The system uses (i) network performance metrics of training instances of sub-neural networks and (ii) accelerator performance metrics of instances of hardware accelerators to adjust the controller policy (step 210).
[0130] Typically, the system adjusts the controller policy in a way that encourages the controller policy to generate an output sequence that leads to increased performance metrics for both the sub-neural network and the hardware accelerator architecture. However, in some cases, depending on the actual progress of the joint search considering the search objective, the system can adjust the immediate focus of the joint search, for example, by fixing the network performance metrics of the training instances of the sub-neural network on a specific neural network task and adjusting the controller policy using only the accelerator performance metrics of the determined instances of the hardware accelerator.
[0131] In some implementations where the controller policy is implemented as a controller neural network configured as a recurrent neural network, the system adjusts the current controller parameter values by training the controller neural network using reinforcement learning techniques. More specifically, the system trains the controller neural network to generate an output sequence that maximizes the received reward determined based on network performance metrics of the trained neural network instance and accelerator performance metrics of the hardware accelerator.
[0132] Specifically, the reward for a given output sequence is a function of both network performance metrics and accelerator performance metrics. For example, the reward can be calculated by combining (e.g., multiplying) different reward terms that depend on neural network accuracy, runtime latency, and hardware accelerator area, respectively. That is, the system trains a controller neural network to generate an output sequence that maximizes:
[0133]
[0134] Where w0 and w1 are weighting factors:
[0135]
[0136] And where α is a hyperparameter defining the neural network architecture, h is a hardware parameter defining the hardware accelerator architecture, and T latency It is the target runtime delay of the trained sub-neural network when performing the task, and T area This refers to the target hardware accelerator area, both of which can be specified in the search target data.
[0137] In this example, during the search, the system can impose soft constraints on delay, area, or both, for example, by setting p and q to both have non-zero values, such as -0.071. Conversely, to impose hard constraints, such as a hard constraint on delay, the system can set p = 0 and q = -1, where the system primarily uses accuracy as the search objective as long as the estimated delay meets (e.g., is not greater than) the target delay, and the reward is significantly reduced only if the delay constraint is violated.
[0138] In some of these implementations, the system trains the controller neural network, i.e., determines the trained values of the controller network parameters from initial values to maximize the expected reward using policy gradient techniques. For example, the policy gradient technique could be a reinforcement technique or a proximal policy optimization (PPO) technique.
[0139] In some other implementations where the controller policy includes a set of controller policy parameters that define a corresponding probability distribution over possible values of each hyperparameter (or hardware parameter) for each hyperparameter of the neural network architecture (or hardware parameter of the hardware accelerator architecture), the system can similarly adjust the current value of the set of controller policy parameters using policy gradient techniques.
[0140] Figure 3 This is a flowchart of an example process 300 for selecting the architecture of a neural network and a hardware accelerator by jointly updating a set of controller policy parameters and a set of shared parameters. For convenience, process 300 will be described as being executed by a system of one or more computers located in one or more locations. For example, a properly programmed system (e.g., Figure 1 The neural architecture and hardware architecture search system 100 can execute process 300.
[0141] The system receives data specifying one or more target hardware constraints on a hardware accelerator on which a neural network for performing a specific machine learning task should be deployed (step 302). For example, the received data may specify the target area or power consumption of the hardware accelerator. As another example, the received data may specify the target latency for the neural network to perform a specific machine learning task when deployed on the hardware accelerator. For example, the target latency may be a measure of the time required for the trained neural network to process a single input or a batch of multiple inputs when deployed on the hardware accelerator.
[0142] The system receives training data and validation data for a specific machine learning task (step 304).
[0143] The system then performs steps 306 to 310 to select a network architecture for a neural network to perform a specific machine learning task from a candidate network architecture space using training and validation data. Additionally, the system performs steps 306 to 310 to select a hardware architecture for a hardware accelerator on which the neural network performing the specific machine learning task will be deployed from a candidate hardware architecture space.
[0144] As described above, both the candidate network architecture space and the candidate hardware architecture space can be part of a larger joint search space. Each candidate neural network architecture in the space is defined by a set of corresponding decision values, which includes the corresponding decision values of each of the first plurality of classification decisions. Similarly, each candidate hardware accelerator architecture in the space is defined by a set of corresponding decision values, which includes the corresponding decision values of each of the second plurality of classification decisions.
[0145] exist Figure 3 In the example, the system uses a controller policy that includes multiple controller policy parameters to generate a corresponding probability distribution for each of the first and second plurality of classification decisions based on the current values of the controller policy parameters. Specifically, for each classification decision, the controller policy parameters may include a corresponding parameter for each possible decision value of that decision. The system can generate a probability distribution for a given classification decision by applying a softmax function to the current value of the corresponding parameter for each possible decision value among the possible decision values of a given decision. For example, to select a corresponding decision value for each of the first and second plurality of classification decisions, the system can sample decision values from the probability distribution of the classification decisions for each classification decision.
[0146] To select an architecture, the system jointly updates (i) the set of controller policy parameters, which define a corresponding probability distribution on the decision values of each of the first plurality of classification decisions and the second plurality of classification decisions, and (ii) a shared set of parameters (step 306). In other words, the system repeats steps 308 and 310 in each iteration of the joint update. Each iteration of steps 306 to 310 may begin with the values of the shared set of model parameters determined in the previous iteration.
[0147] Typically, during joint updates, the system can update the set of controller policy parameters through reinforcement learning to maximize the reward functions of the candidate neural network architecture and the hardware accelerator architecture, which are defined by a set of decision values sampled from a probability distribution generated using the controller policy parameters (step 308).
[0148] For example, the reward function may include a quality term that measures the estimated quality of (i) the candidate hardware accelerator architecture and (ii) the estimated quality of the candidate neural network architecture, as well as a latency (or power consumption) term that measures the ratio between the estimated latency (or estimated power consumption) of the candidate network architecture and the target latency (or target power consumption).
[0149] The system can use validation data to determine the estimation quality for a specific machine learning task of a neural network with a candidate architecture that has a subset of a shared set of model parameters defined by selected decision values of a first plurality of classification decisions. Specifically, the system determines the estimation quality based on the current values of the shared set of model parameter sets.
[0150] As a specific example, the system can determine the estimated quality as the quality of a neural network with a candidate architecture on multiple validation examples from a batch of validation data. That is, the system can use a neural network with a candidate architecture and process each validation input in the batch according to the current values of a corresponding subset of a shared set of model parameters to generate a predicted output, and then use the target output of the validation inputs to compute the accuracy of the predicted output for a machine learning task or other appropriate performance metric.
[0151] The system can use appropriate computer architecture simulation tools or techniques (such as area simulators) to determine the estimated quality of candidate hardware architectures that have a subset of a shared set of model parameters defined by selected decision values of a second or more classification decisions.
[0152] The system can use validation data to determine the estimated latency (or power consumption) when performing a specific machine learning task with a neural network having a subset of a shared set of model parameters defined by the selected decision values of the classification decision.
[0153] For example, when a neural network with a candidate neural network architecture is deployed on an instance of a hardware accelerator with a candidate hardware accelerator architecture, the system determines the latency of each example in a batch of validation examples. That is, the system can use a neural network with a candidate architecture deployed on an instance of the hardware accelerator to process each validation input in the batch to generate a predicted output, and then measure the latency of processing the batch.
[0154] As another example, the system can use a computer architecture simulator that simulates instances of hardware accelerators with candidate hardware accelerator architectures to simulate the effect of deploying neural networks on hardware accelerators in order to determine estimated latency or estimated power consumption.
[0155] As another example, the system can use a time-delay simulation neural network and an area simulation neural network to determine the predictions for time delay and area, respectively. The neural networks can be trained on labeled training data generated using a computer architecture simulator.
[0156] The system then determines updates to the controller policy parameters through reinforcement learning. These updates improve the reward function based on the estimated quality of the candidate hardware accelerator architecture, the estimated quality of the candidate neural network architecture, and the estimated latency. Specifically, the system can perform an update step of a policy gradient reinforcement learning algorithm (e.g., a boosting algorithm) on the computed reward (i.e., the output of the reward function) to obtain the estimated quality and estimated latency to determine the updates to the controller policy parameters.
[0157] During the joint update, the system also updates a shared set of model parameters to optimize an objective function that measures the performance of a particular machine learning task of a candidate neural network architecture defined by a set of decision values sampled from a probability distribution generated using the controller policy parameters of a first plurality of classification decisions (step 310).
[0158] For example, the system can sample a batch of training examples from the training data and perform training steps on the sampled batch using an appropriate deep learning algorithm (e.g., stochastic gradient descent) to compute gradient updates, that is, compute the gradient of the objective function relative to a subset of model parameters and then apply the gradient updates to the current values of the subset.
[0159] After the joint update, the system selects a candidate neural network architecture as the neural network architecture for performing a specific machine learning task. This candidate neural network architecture is defined by the corresponding specific decision value of each of the first plurality of classification decisions (step 312).
[0160] The system selects a candidate hardware accelerator architecture as the hardware accelerator architecture on which the neural network will be deployed. This candidate hardware accelerator architecture is defined by the specific decision value of each of the second plurality of classification decisions (step 314).
[0161] For example, by selecting the decision value with the highest probability in the probability distribution of the classification decision (or equivalently, the decision value with the highest corresponding parameter value) as a specific decision value for each of the first or second classification decisions, the system can choose a candidate neural network or a hardware accelerator architecture.
[0162] This specification uses the term "configuration" in conjunction with system and computer program components. Configuring one or more computer systems to perform a specific operation or action means that software, firmware, hardware, or a combination thereof are installed on the system, causing it to perform the operation or action in operation. Configuring one or more computer programs to perform a specific operation or action means that one or more programs include instructions that, when executed by a data processing device, cause that device to perform the operation or action.
[0163] Embodiments of the subject matter and functional operation described in this specification can be implemented in digital electronic circuit systems, tangibly implemented computer software or firmware, computer hardware (including the structures disclosed in this specification and their structural equivalents), or combinations thereof. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transient storage medium for execution by a data processing device or for controlling the operation of such data processing device. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or combinations thereof. Alternatively or additionally, the program instructions can be encoded on artificially generated propagation signals (e.g., machine-generated electrical, optical, or electromagnetic signals) that are generated to encode information for transmission to a suitable receiver device for execution by the data processing device.
[0164] The term "data processing apparatus" refers to data processing hardware and encompasses all kinds of devices, apparatuses, and machines used for processing data, including, by example, programmable processors, computers, or multiple processors or computers. The apparatus may also be or further include special-purpose logic circuit systems, such as FPGAs (Field-Programmable Gate Arrays) or ASICs (Application-Specific Integrated Circuits). In addition to hardware, the apparatus may optionally include code that creates an execution environment for computer programs, such as code constituting processor firmware, protocol stacks, database management systems, operating systems, or combinations thereof.
[0165] Computer programs (which may also be referred to or described as programs, software, software applications, applications, modules, software modules, scripts, or code) can be written in any form of programming language (including compiled or interpreted languages, or declarative or programming languages); and they can be deployed in any form (including as standalone programs or as modules, components, subroutines, or other units suitable for a computing environment). A program may, but does not necessarily, correspond to a file in a file system. A program may be stored as part of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), or as a single file dedicated to the program under development, or as multiple collaborative files (e.g., files storing one or more modules, subroutines, or portions of code). Computer programs can be deployed to execute on a single computer or on multiple computers located at a single site or distributed across multiple sites and interconnected via a data communication network.
[0166] In this specification, the term "database" is used broadly to refer to any collection of data: data that does not need to be structured in any particular way or at all, and which can be stored on storage devices in one or more locations. Therefore, an indexed database, for example, may include multiple collections of data, each of which can be organized and accessed in different ways.
[0167] Similarly, in this specification, the term "engine" is used broadly to refer to a software-based system, subsystem, or process programmed to perform one or more specific functions. Typically, an engine will be implemented as one or more software modules or components installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in other cases, multiple engines may be installed and run on the same one or more computers.
[0168] The processes and logic flows described in this specification can be executed by one or more programmable computers, which execute one or more computer programs to perform functions by manipulating input data and generating output. The processes and logic flows can also be executed by a dedicated logic circuit system (such as an FPGA or ASIC) or a combination of a dedicated logic circuit system and one or more programmable computers.
[0169] A computer suitable for executing computer programs can be based on a general-purpose or special-purpose microprocessor, or both, or any other type of central processing unit. Typically, the central processing unit receives instructions and data from read-only memory or random access memory, or both. The essential components of a computer are the central processing unit for executing or carrying out instructions and one or more memory devices for storing instructions and data. The central processing unit and memory may be supplemented by or incorporated into a dedicated logic circuit system. Typically, a computer will also include one or more mass storage devices (e.g., disks, magneto-optical disks, or optical disks) for storing data, or the computer may be operatively coupled to receive data from or transfer data to or both of these mass storage devices. However, a computer does not necessarily need to have such devices. Furthermore, a computer can be embedded in another device, such as, to name just a few, mobile phones, personal digital assistants (PDAs), mobile audio or video players, game consoles, GPS receivers, or portable storage devices (e.g., Universal Serial Bus (USB) flash drives).
[0170] Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, by example, semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices); magnetic disks (e.g., internal hard disks or removable disks); magneto-optical disks; and CD-ROM and DVD-ROM disks.
[0171] To provide interaction with the user, embodiments of the subject matter described in this specification can be implemented on a computer having: a display device for displaying information to the user, such as a CRT (cathode ray tube) or LCD (liquid crystal display) monitor; and a keyboard and pointing device, such as a mouse or trackball, through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form (including acoustic input, voice input, or tactile input). Additionally, the computer can interact with the user by sending documents to and receiving documents from the device used by the user; for example, by sending web pages to a web browser on the user's device in response to a request received from a web browser. Moreover, the computer can interact with the user by sending text messages or other forms of messages to a personal device (e.g., a smartphone) running a messaging application and receiving response messages from the user as a reply.
[0172] The data processing apparatus used to implement machine learning models may also include, for example, dedicated hardware accelerator units for handling the common compute-intensive parts of machine learning training or production (i.e., inference, workloads).
[0173] Machine learning models can be implemented and deployed using machine learning frameworks, such as the TensorFlow framework, the Microsoft Cognitive Toolkit framework, the Apache Singa framework, or the Apache MXNet framework.
[0174] Embodiments of the subject matter described in this specification can be implemented in computing systems that include backend components (e.g., as data servers), or computing systems that include middleware components (e.g., application servers), or computing systems that include frontend components (e.g., client computers with a graphical user interface, web browser, or application through which users can interact with embodiments of the subject matter described in this specification), or computing systems that include one or more such backend components, middleware components, or frontend components. The components of the system can be interconnected via digital data communication (e.g., communication networks) of any form or medium. Examples of communication networks include local area networks (LANs) and wide area networks (WANs), such as the Internet.
[0175] A computing system may include clients and servers. Clients and servers are typically located remotely to each other and usually interact via a communication network. The client-server relationship is established by means of computer programs running on respective computers and having a client-server relationship with each other. In some embodiments, the server transmits data (e.g., HTML pages) to a user device, for example, to display data to a user interacting with the device and to receive user input from that user, the device acting as a client. Data generated at the user device (e.g., the result of user interaction) can be received from the device at the server.
[0176] While this specification contains numerous specific implementation details, these details should not be construed as limiting the scope of any invention or the scope of what may be claimed, but rather as descriptions of features specific to particular embodiments of a particular invention. Certain features described in the context of individual embodiments in this specification may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented individually or in any suitable sub-combination in multiple embodiments. Furthermore, although features may be described above as functioning in certain combinations, or even as originally claimed, one or more features from a claimed combination may be removed from the combination in some cases, and the claimed combination may involve sub-combinations or variations thereof.
[0177] Similarly, although operations are depicted in a specific order in the accompanying drawings and described in the claims, this should not be construed as requiring such operations to be performed in the specific order shown or in a sequential order, or that all illustrated operations be performed to achieve the desired result. In some cases, multitasking and parallel processing may be advantageous. Furthermore, the separation of various system modules and components in the above embodiments should not be construed as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
[0178] Specific embodiments of this subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve the desired result. As an example, the processes depicted in the drawings do not necessarily require the specific order or sequential order shown to achieve the desired result. In some cases, multitasking and parallel processing may be advantageous.
Claims
1. A method for determining neural network architecture and hardware accelerator design, the method comprising: Receive data specifying one or more target hardware constraints on a hardware accelerator on which a neural network to perform a specific neural network task is to be deployed, the one or more target hardware constraints including the maximum area of the hardware accelerator; Using a controller strategy, one or more output sequences are generated in a batch, each output sequence in the batch defining (i) a corresponding architecture of a sub-neural network configured to perform the specific neural network task and (ii) a corresponding architecture of the hardware accelerator, on which training instances of the sub-neural network are to be implemented. For each output sequence in the batch: Train corresponding instances of the sub-neural network having the architecture defined by the output sequence to perform the specific neural network task; Evaluate the network performance of the training instances of the sub-neural network for the specific neural network task to determine the network performance metrics of the training instances of the sub-neural network for the specific neural network task; and The accelerator performance of a corresponding instance of the hardware accelerator having the architecture defined by the output sequence is evaluated to determine an accelerator performance metric for the performance of the instance of the hardware accelerator in supporting a training instance of a sub-neural network having the architecture defined by the output sequence for the specific neural network task, wherein the accelerator performance metric for the performance of the instance of the hardware accelerator in supporting the training instance of the sub-neural network includes an estimated area of the hardware accelerator, and wherein evaluating the accelerator performance of a corresponding instance of the hardware accelerator includes determining the estimated area of the hardware accelerator based on using a first neural network configured to receive data specifying the corresponding architecture of the hardware accelerator as input and process the input data according to the values of the parameters of the first neural network to generate a prediction of the area of the hardware accelerator as output. The controller policy is adjusted using the network performance metrics of the training instances of the sub-neural network (i) and the accelerator performance metrics of the instances of the hardware accelerator; and The final output sequence is generated based on the adjusted controller strategy, and the final output sequence defines the final architecture of the sub-neural network and the final architecture of the hardware accelerator.
2. The method according to claim 1, wherein: The controller strategy is implemented using a controller neural network with multiple controller network parameters; and Adjusting the controller policy includes adjusting the current values of the multiple controller network parameters.
3. The method of claim 2, wherein, Adjusting the controller policy using the network performance metrics of (i) the training instances of the sub-neural network and (ii) the accelerator performance metrics of the instances of the hardware accelerator includes: The controller neural network is trained using reinforcement learning techniques to generate an output sequence that results in increased network performance metrics for the sub-neural network and increased accelerator performance metrics for the hardware accelerator.
4. The method of claim 1, wherein, Each output sequence includes the values of the corresponding hyperparameters of the sub-neural network at each of the first plurality of time steps.
5. The method of claim 1, wherein, Each output sequence includes the values of the corresponding hardware parameters of the hardware accelerator at each time step in the second plurality of time steps.
6. The method of claim 2, wherein, The controller neural network is a recurrent neural network, which includes: One or more recurrent neural network layers, the one or more recurrent neural network layers being configured to, for a given output sequence, at each time step: The system receives the values of hyperparameters or hardware parameters at the previous time step in the given output sequence as input, and processes the input to update the current hidden state of the recurrent neural network; and The corresponding output layer for each time step, wherein each output layer is configured for the given output sequence: The system receives an output layer input including the updated hidden state at that time step and generates an output at that time step, the output defining a fractional distribution over possible values of the hyperparameter or the hardware parameter at that time step.
7. The method of claim 6, wherein, Generating one or more output sequences in a batch using the controller strategy includes, for each output sequence in the batch and for each of the plurality of time steps: The controller neural network is provided with the value of the hyperparameter or hardware parameter at the previous time step in the output sequence as input to generate the output at that time step, the output defining the fractional distribution of the possible values of the hyperparameter or hardware parameter at that time step; as well as The possible values are sampled from the fractional distribution to determine the value of the hyperparameter or hardware parameter at the time step in the output sequence.
8. The method according to claim 4, wherein: The specific neural network task is an object classification and / or detection task, an object pose estimation task, or a semantic segmentation task; The sub-neural network is a convolutional neural network comprising one or more depthwise separable convolutional layers; and The hyperparameters include the hyperparameters of each depthwise separable convolutional layer in the sub-neural network.
9. The method according to claim 4, wherein: The sub-neural network includes one or more inverse residual layers and one or more linear bottleneck layers; and The hyperparameters include the hyperparameters of each inverse residual layer and linear bottleneck layer in the sub-neural network.
10. The method of claim 1, wherein, The hardware characteristics of the hardware accelerator include one or more of the following: The bandwidth of the hardware accelerator The number of processing elements included in the hardware accelerator, The layout of the processing elements on the hardware accelerator, The number of Single Instruction Multiple Data (SIMD) Multiply-Accumulate (MAC) operations per processing element The number of computing channels in each processing element The size of the shared memory in each processing element, and Size of the register file in each processing element.
11. The method of claim 1, wherein, The accelerator performance metrics for the performance of the instance of the hardware accelerator in supporting the training instance of the sub-neural network include one or more of the following: The estimated power consumption of the hardware accelerator, and The estimated latency for performing the specific neural network task when the neural network is deployed on the hardware accelerator.
12. The method according to claim 11, wherein, The evaluation of the accelerator performance of a corresponding instance of the hardware accelerator includes: The estimated latency for performing a specific neural network task when the neural network is deployed on the hardware accelerator is determined based on (i) the corresponding architecture of the sub-neural network and (ii) the corresponding architecture of the hardware accelerator defined by the output sequence of the batch, using a periodic accuracy simulator.
13. The method of claim 2, wherein, Adjusting the controller policy using the network performance metrics of (i) the training instances of the sub-neural network and (ii) the accelerator performance metrics of the instances of the hardware accelerator includes: Assigning different weights to one or more accelerator performance metrics; and The current values of the controller network parameters of the controller neural network are adjusted according to the different weights.
14. The method of claim 2, wherein, Adjusting the controller policy using the network performance metrics of (i) the training instances of the sub-neural network and (ii) the accelerator performance metrics of the instances of the hardware accelerator includes: The current values of the controller network parameters of the controller neural network are adjusted by fixing the network performance metrics of the training instances of the sub-neural network for the specific neural network task and using only the determined accelerator performance metrics of the instances of the hardware accelerator.
15. The method of any one of claims 1 to 14, further comprising: The received network input is processed by using a sub-neural network with the final architecture, and the specific neural network task is performed on the received network input, wherein the specific neural network task includes: An image processing task, the image processing task including receiving an input image and processing the input image to generate a network output of the input image; An audio processing task comprising receiving a sequence representing spoken utterance and processing the input to generate a score for each text segment in a set of text segments, each score representing an estimated probability that the text segment is a correct transcription of the utterance; or An agent control task includes receiving observations characterizing the state of the environment as input, and processing the inputs to generate an output, the output defining an action to be performed by the agent in response to the observations.
16. A machine learning task-specific hardware accelerator having an architecture defined by an execution process, the process comprising corresponding operations of the method according to any one of claims 1 to 15.
17. A system comprising one or more computers and one or more storage devices storing instructions, wherein the instructions, when executed by the one or more computers, cause the one or more computers to perform operations according to any one of claims 1 to 15.
18. A non-transitory computer-readable storage medium encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform the operation of the method according to any one of claims 1 to 15.